docs/ai-context/archive/cursor-skills/dpla-community-webs-ingest/SKILL.md
Run Community Webs ingest from SQLite DB. Use when the user says harvest community-webs, run community-webs ingest, export community webs, or process community webs DB.
npx skillsauth add dpla/ingestion3 dpla-community-webs-ingestInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Run the Community Webs ingest workflow: export SQLite DB → JSONL → ZIP → harvest → (optional) full pipeline. Internet Archive sends a SQLite database; the scripts handle the intermediate export step automatically.
Environment: Source .env before running so JAVA_HOME, DPLA_DATA, I3_CONF are set. Scripts that source common.sh load $I3_HOME/.env when present. Full checklist: AGENTS.md § Environment and build.
$DPLA_DATA/community-webs/originalRecords/ (or provide path via --db=)../scripts/harvest/community-webs-ingest.sh./scripts/harvest/community-webs-ingest.sh --full./scripts/harvest/community-webs-ingest.sh --full --update-conf_SUCCESS in harvest/mapping/enrichment/jsonl dirs.| Script | Purpose |
|--------|---------|
| community-webs-export.sh | Export DB → JSONL → ZIP; validates schema before zipping |
| community-webs-ingest.sh | Orchestrates export + harvest (+ optional full pipeline) |
community-webs-export.sh:
--db=PATH — Explicit DB path (default: auto-detect latest *.db in originalRecords)--update-conf — Update i3.conf community-webs.harvest.endpoint--skip-validate — Skip JSONL schema validation (not recommended)community-webs-ingest.sh:
--db=PATH — Pass to export script--skip-export — Use existing ZIP (endpoint must already point to it)--full — Run harvest + mapping + enrichment + jsonl--update-conf — Update i3.conf with export output directory$DPLA_DATA/community-webs/originalRecords/<YYYYMMDD>/community-webs-<timestamp>.zip$DPLA_DATA/community-webs/harvest/$DPLA_DATA/community-webs/The export script runs community-webs-validate-jsonl.py on the JSONL before zipping. It checks:
id"status":"deleted" are skipped (harvester filters these)Tests: ./venv/bin/python -m pytest scripts/tests/test_community_webs_validation.py -v
*.db in $DPLA_DATA/community-webs/originalRecords/ or use --db=/path/to/file.dbsqlite3, jq (brew install / apt install)community-webs.harvest.endpoint must point to directory containing the ZIP| Resource | Path | |----------|------| | Script reference | scripts/SCRIPTS.md | | Ingest docs | README_INGESTS.md | | Agent guide | AGENTS.md |
data-ai
Show key i3.conf config for a hub (provider, harvest.type, harvest.endpoint, schedule, email, setlist). Use when user asks for hub config, harvest type/endpoint, who gets emails, schedule months, or OAI setlist details.
development
Run Community Webs ingest. Use when the user says harvest community-webs, run community-webs ingest, export community webs, or process community webs DB.
testing
Verify ingest outcomes and send failure or status notifications to Slack or [email protected]. Use when the user asks to verify the ingest, check if it succeeded, notify about a failure, or post to tech-alerts.
business
Report which hubs have new JSONL staged in S3 for a given month, and optionally post the report to Slack. Use when user asks what hubs are staged/ready for indexing, /ingest staged, or what changed this month in S3.