agentic/code/addons/doc-intelligence/skills/doc-scraper/SKILL.md
Scrape documentation websites into organized reference files. Use when converting docs sites to searchable references or building Claude skills.
npx skillsauth add jmagly/aiwg doc-scraperInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Single responsibility: Convert documentation websites into organized, categorized reference files suitable for Claude skills or offline archives. (BP-4)
Before executing, VERIFY:
curl -I)DO NOT proceed without verification. Inspect before scraping.
ASK USER instead of guessing when:
<article> or <main> elements)NEVER substitute missing configuration with assumptions.
| Context Type | Included | Excluded | |--------------|----------|----------| | RELEVANT | Target URL, selectors, output path | Unrelated documentation | | PERIPHERAL | Similar site examples for selector hints | Historical scrape data | | DISTRACTOR | Other projects, unrelated URLs | Previous failed attempts |
# Test URL accessibility
curl -I <target-url>
# Check robots.txt
curl <base-url>/robots.txt
# Inspect page structure (use browser dev tools or fetch sample)
Generate scraper config based on inspection:
{
"name": "skill-name",
"description": "When to use this skill",
"base_url": "https://docs.example.com/",
"selectors": {
"main_content": "article",
"title": "h1",
"code_blocks": "pre code"
},
"url_patterns": {
"include": ["/docs", "/guide", "/api"],
"exclude": ["/blog", "/changelog", "/releases"]
},
"categories": {
"getting_started": ["intro", "quickstart", "installation"],
"api_reference": ["api", "reference", "methods"],
"guides": ["guide", "tutorial", "how-to"]
},
"rate_limit": 0.5,
"max_pages": 500
}
Option A: With skill-seekers (if installed)
# Verify skill-seekers is available
pip show skill-seekers
# Run scraper
skill-seekers scrape --config config.json
# For large docs, use async mode
skill-seekers scrape --config config.json --async --workers 8
Option B: Manual scraping guidance
# Check output structure
ls -la output/<skill-name>/
# Verify content quality
head -50 output/<skill-name>/references/index.md
# Count extracted pages
find output/<skill-name>_data/pages -name "*.json" | wc -l
On error:
Connection error → Verify URL, check networkSelector not found → Re-inspect page structureRate limited → Increase delay, reduce workersMemory/disk → Reduce batch size, clear temp filesState saved to: .aiwg/working/checkpoints/doc-scraper/
Resume interrupted scrape:
skill-seekers scrape --config config.json --resume
Clear checkpoint and start fresh:
skill-seekers scrape --config config.json --fresh
output/<skill-name>/
├── SKILL.md # Main skill description
├── references/ # Categorized documentation
│ ├── index.md # Category index
│ ├── getting_started.md
│ ├── api_reference.md
│ └── guides.md
├── scripts/ # (empty, for user additions)
└── assets/ # (empty, for user additions)
output/<skill-name>_data/
├── pages/ # Raw scraped JSON (one per page)
└── summary.json # Scrape statistics
{
"name": "myframework",
"base_url": "https://docs.example.com/",
"max_pages": 100
}
{
"name": "myframework",
"description": "MyFramework documentation for building web apps",
"base_url": "https://docs.example.com/",
"selectors": {
"main_content": "article, main, div[role='main']",
"title": "h1, .title",
"code_blocks": "pre code, .highlight code",
"navigation": "nav, .sidebar"
},
"url_patterns": {
"include": ["/docs/", "/api/", "/guide/"],
"exclude": ["/blog/", "/changelog/", "/v1/", "/v2/"]
},
"categories": {
"getting_started": ["intro", "quickstart", "install", "setup"],
"concepts": ["concept", "overview", "architecture"],
"api": ["api", "reference", "method", "function"],
"guides": ["guide", "tutorial", "how-to", "example"],
"advanced": ["advanced", "internals", "customize"]
},
"rate_limit": 0.5,
"max_pages": 1000,
"checkpoint": {
"enabled": true,
"interval": 100
}
}
| Issue | Diagnosis | Solution |
|-------|-----------|----------|
| No content extracted | Selector mismatch | Inspect page, update main_content selector |
| Wrong pages scraped | URL pattern issue | Check include/exclude patterns |
| Rate limited | Too aggressive | Increase rate_limit to 1.0+ seconds |
| Memory issues | Too many pages | Add max_pages limit, enable checkpoints |
| Categories wrong | Keyword mismatch | Update category keywords in config |
data-ai
Report which research-corpus radar sidecars are overdue for refresh. Computes staleness (days since last refresh vs the cadence window) for every radar, sorted most-overdue-first. Runs via `aiwg corpus radar-status`.
data-ai
Aggregate research-corpus radar sidecars into a corpus or per-cluster freshness report — totals, overdue count, per-cluster / per-GRADE / per-trajectory breakdowns, an overdue table, and per-radar rationale snippets. Runs via `aiwg corpus radar-report`.
testing
Scaffold radar/freshness sidecars for research-corpus REFs. Pulls title/authors from the citation sidecar and GRADE from the analysis doc, defaults the refresh cadence from GRADE and the cluster from a corpus-local map, and stamps documentation/radar/REF-XXX-radar.md. Runs via `aiwg corpus radar-init`.
data-ai
Compute an entity's publication trajectory — per-year paper counts, topic drift, hot-streak detection (≥3 consecutive A-grade years), and career phase. Runs via `aiwg corpus profile-temporal`.