docs/ai-context/archive/cursor-skills/dpla-orchestrator/SKILL.md
Run or monitor the DPLA Python ingest orchestrator. Use when the user says run orchestrator, parallel ingest, ingest status, run hubs, orchestrator dry-run, or retry failed hubs. Covers venv, main entry point, status script, and logs.
npx skillsauth add dpla/ingestion3 dpla-orchestratorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Run and monitor the Python orchestrator that drives the full ingestion pipeline (harvest → mapping → enrichment → JSONL → anomaly → S3 sync) for one or more hubs, with Slack notifications and parallel execution.
Environment: Always run source .env from repo root before running the orchestrator so JAVA_HOME, SLACK_WEBHOOK, and other vars are set. Ensure the fat JAR is current: from repo root run source .env then sbt assembly before starting the orchestrator (or confirm no Scala changes since last build). Full checklist: AGENTS.md § Environment and build.
# From repo root; source .env for JAVA_HOME, SLACK_WEBHOOK, etc.
source .env
./venv/bin/python -m scheduler.orchestrator.main [options]
Do not use system python3; use ./venv/bin/python so dependencies and environment are correct.
| Goal | Command |
|------|--------|
| Current month, all scheduled hubs | ./venv/bin/python -m scheduler.orchestrator.main |
| Specific hubs | ./venv/bin/python -m scheduler.orchestrator.main --hub=wisconsin,p2p |
| Parallel (2–3 hubs at once) | ./venv/bin/python -m scheduler.orchestrator.main --hub=wi,va,mn --parallel=3 |
| Specific month | ./venv/bin/python -m scheduler.orchestrator.main --month=2 |
| Preview only (no run) | ./venv/bin/python -m scheduler.orchestrator.main --dry-run |
| Retry last run's failures | ./venv/bin/python -m scheduler.orchestrator.main --retry-failed |
| Skip harvest (reuse data) | ./venv/bin/python -m scheduler.orchestrator.main --hub=wisconsin --skip-harvest |
| Skip S3 sync | ./venv/bin/python -m scheduler.orchestrator.main --hub=wisconsin --skip-s3-sync |
Per-hub status is written to logs/status/<hub>.status (JSON). Use the status script:
# Table view
./scripts/status/ingest-status.sh
# Auto-refresh (e.g. every 30s)
./scripts/status/ingest-status.sh --watch
# Specific hubs
./scripts/status/ingest-status.sh wisconsin p2p
# Verbose (stage history, durations)
./scripts/status/ingest-status.sh -v
# JSON (for scripting)
./scripts/status/ingest-status.sh --json
| Resource | Path | |----------|------| | Orchestrator entry | scheduler/orchestrator/main.py | | Config | scheduler/orchestrator/config.py; .env for SLACK_WEBHOOK, JAVA_HOME | | Per-hub status | logs/status/<hub>.status | | Escalation reports | data/escalations/failures-<run_id>.md | | Email drafts (after run) | logs/hub-emails-<run_id>/ |
Harvests can run 12–24 hours. Use tmux or nohup so the run survives disconnection:
# tmux (recommended; reattach with: tmux attach -t ingest)
tmux new -s ingest
cd /path/to/ingestion3 && source .env
./venv/bin/python -m scheduler.orchestrator.main --hub=wisconsin,p2p --parallel=2
# Ctrl-B, D to detach
# nohup (fire and forget)
nohup ./venv/bin/python -m scheduler.orchestrator.main --hub=wi,p2p --parallel=2 \
> logs/orchestrator-$(date +%Y%m%d_%H%M%S).log 2>&1 &
Slack notifications go to #tech-alerts (and hub-complete to #tech when configured). Failures are written to data/escalations/.
data-ai
Show key i3.conf config for a hub (provider, harvest.type, harvest.endpoint, schedule, email, setlist). Use when user asks for hub config, harvest type/endpoint, who gets emails, schedule months, or OAI setlist details.
development
Run Community Webs ingest. Use when the user says harvest community-webs, run community-webs ingest, export community webs, or process community webs DB.
testing
Verify ingest outcomes and send failure or status notifications to Slack or [email protected]. Use when the user asks to verify the ingest, check if it succeeded, notify about a failure, or post to tech-alerts.
business
Report which hubs have new JSONL staged in S3 for a given month, and optionally post the report to Slack. Use when user asks what hubs are staged/ready for indexing, /ingest staged, or what changed this month in S3.