Skip to content

scieloorg/usage

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

545 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SciELO Usage

CI Python Django Wagtail

Application for processing SciELO access logs, extracting COUNTER R5.1 metrics, and exporting monthly/yearly usage documents to OpenSearch.

Quick Start

Local development runs with Docker Compose using local.yml.

make build
make django_migrate
make django_createsuperuser
make up

Admin: http://localhost:8009/admin

Main local services:

Service Port
Django/Wagtail 8009
PostgreSQL 5439
Redis 6399
Mailhog 8029

Full Pipeline Setup

After the app is running, open a Django shell:

make django_shell

Seed the base data and resources:

from collection.tasks import task_load_collections
from log_manager_config.tasks import task_load_log_manager_collection_settings
from resources.tasks import task_load_geoip, task_load_robots

log_config = [
    {
        "acronym": "scl",
        "directory_name": "SciELO Brasil",
        "path": "/app/logs/scielo.br",
        "quantity": 1,
        "e-mail": "tecnologia@scielo.org",
        "translator_class": "opac",
    }
]

task_load_collections.delay()
task_load_log_manager_collection_settings.delay(data=log_config)
task_load_robots.delay()
task_load_geoip.delay()

Load sources and documents before processing logs. For a first run, restrict document synchronization to a smaller date range:

from document.tasks import (
    task_load_dataset_metadata_into_documents,
    task_load_documents_from_article_meta,
    task_load_documents_from_opac,
    task_load_preprints_into_documents,
    task_sync_documents_from_scielo_books,
)
from source.tasks import (
    task_load_sources_from_article_meta,
    task_load_sources_from_scielo_books,
)

task_load_sources_from_article_meta.delay(collections=["scl"])
task_load_sources_from_scielo_books.delay(limit=1000)

date_range = {"from_date": "2025-01-01", "until_date": "2025-12-31"}
task_load_documents_from_article_meta.delay(**date_range)
task_load_documents_from_opac.delay(collection="scl", **date_range)
task_load_preprints_into_documents.delay(**date_range)
task_load_dataset_metadata_into_documents.delay(**date_range)
task_sync_documents_from_scielo_books.delay()

Before starting the log pipeline, confirm in the admin that each collection has an active Log Manager configuration pointing to a readable log directory mounted in the container.

For the example above, place a log file under the configured directory:

mkdir -p <mounted-logs-dir>/scielo.br
cp metrics/tests/fixtures/usage.log <mounted-logs-dir>/scielo.br/usage-2021-05-21.log

Run the full Search -> Validate -> Parse -> Export chain for a date range:

from log_manager.tasks import task_search_log_files

task_search_log_files.delay(
    collections=["scl"],
    from_date="2021-05-21",
    until_date="2021-05-21",
    trigger_validation=True,
)

Monitor execution with:

make logs

Commands

make help                    # list available targets
make app_version             # show VERSION
make build                   # build local images
make build_no_cache          # build local images without cache
make up                      # start local services
make logs                    # follow service logs
make stop                    # stop local services
make restart                 # restart local services
make ps                      # list running services
make django_bash             # open bash in the django container
make django_shell            # open Django shell
make django_createsuperuser  # create an admin user
make django_migrate          # apply migrations
make django_makemigrations   # create migrations
make django_makemessages     # update translation messages
make django_compilemessages  # compile translation messages
make wagtail_update_translation_field
make wagtail_sync
make test                    # run pytest
make django_test             # run pytest
make django_fast             # run pytest --failfast
make lint                    # run flake8
make format_check            # run black/isort checks
make precommit               # run pre-commit hooks

Use compose=production.yml or another Compose file when needed:

make ps compose=production.yml

Run one test path:

docker compose -f local.yml run --rm django pytest metrics/tests/test_opensearch.py

Pipeline

The log pipeline is coordinated by Celery tasks:

  1. Search configured directories for new .log and .gz files.
  2. Validate log samples and detect usage date.
  3. Parse requests with scielo_usage_counter.
  4. Aggregate COUNTER R5.1 metrics.
  5. Export idempotent monthly/yearly documents to OpenSearch.

Metadata synchronization keeps sources and documents updated from ArticleMeta, OPAC, SciELO Books, SciELO Preprints, and SciELO Data.

Periodic Tasks

Configure the default schedule manually in Wagtail/Admin through django-celery-beat PeriodicTask records. Exact cron times may vary by installation, but the default operational setup should include:

Task Suggested schedule Notes
[Metadata] Daily Sync Routine (Auto) Daily, early morning Refreshes sources and documents before log processing. Use the load queue.
[Log Pipeline] Daily Routine (Auto) Daily, after metadata sync Runs Search -> Validate -> Parse -> Export for new logs. Use the load queue.
[Metrics] Resume Log Exports Every 15-30 minutes Retries errored or stale daily metric export jobs.
[Metrics] Resume Stale Parsing Logs Every 30-60 minutes Marks stale PAR logs for retry.
[Metrics] Cleanup Daily Payloads Daily or weekly Removes old exported daily payload files.
[Reports] Populate All Reports Daily, after log processing Refreshes weekly, monthly, and yearly log report tables.

Optional operational tasks:

Task Suggested schedule Notes
[Reports] Generate Log Report Summary (Manual) Manual or scheduled as needed Sends summary emails using configured collection contacts.
[Resources] Load Robots Data Weekly Refreshes robots list used during parsing.
[Resources] Load Geolocation Data Monthly Refreshes GeoIP data used during parsing.

Version

Project release version is stored in VERSION.

About

This repository contains the code for the SciELO Usage application, which is a tool for managing and analyzing SciELO usage data.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors