Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

jmagly/doc-scraper

Name: doc-scraper
Author: jmagly

agentic/code/addons/doc-intelligence/skills/doc-scraper/SKILL.md

npx skillsauth add jmagly/aiwg doc-scraper

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Documentation Scraper Skill

Purpose

Single responsibility: Convert documentation websites into organized, categorized reference files suitable for Claude skills or offline archives. (BP-4)

Grounding Checkpoint (Archetype 1 Mitigation)

Before executing, VERIFY:

[ ] Target URL is accessible (test with curl -I)
[ ] Documentation structure is identifiable (inspect page for content selectors)
[ ] Output directory is writable
[ ] Rate limiting requirements are known (check robots.txt)

DO NOT proceed without verification. Inspect before scraping.

Uncertainty Escalation (Archetype 2 Mitigation)

ASK USER instead of guessing when:

Content selector is ambiguous (multiple <article> or <main> elements)
URL patterns unclear (can't determine include/exclude rules)
Category mapping uncertain (content doesn't fit predefined categories)
Rate limiting unknown (no robots.txt, unclear ToS)

NEVER substitute missing configuration with assumptions.

Context Scope (Archetype 3 Mitigation)

| Context Type | Included | Excluded | |--------------|----------|----------| | RELEVANT | Target URL, selectors, output path | Unrelated documentation | | PERIPHERAL | Similar site examples for selector hints | Historical scrape data | | DISTRACTOR | Other projects, unrelated URLs | Previous failed attempts |

Workflow Steps

Step 1: Verify Target (Grounding)

# Test URL accessibility
curl -I <target-url>

# Check robots.txt
curl <base-url>/robots.txt

# Inspect page structure (use browser dev tools or fetch sample)

Step 2: Create Configuration

Generate scraper config based on inspection:

{
  "name": "skill-name",
  "description": "When to use this skill",
  "base_url": "https://docs.example.com/",
  "selectors": {
    "main_content": "article",
    "title": "h1",
    "code_blocks": "pre code"
  },
  "url_patterns": {
    "include": ["/docs", "/guide", "/api"],
    "exclude": ["/blog", "/changelog", "/releases"]
  },
  "categories": {
    "getting_started": ["intro", "quickstart", "installation"],
    "api_reference": ["api", "reference", "methods"],
    "guides": ["guide", "tutorial", "how-to"]
  },
  "rate_limit": 0.5,
  "max_pages": 500
}

Step 3: Execute Scraping

Option A: With skill-seekers (if installed)

# Verify skill-seekers is available
pip show skill-seekers

# Run scraper
skill-seekers scrape --config config.json

# For large docs, use async mode
skill-seekers scrape --config config.json --async --workers 8

Option B: Manual scraping guidance

Use sitemap.xml or crawl starting URL
Extract content using configured selectors
Categorize pages based on URL patterns and keywords
Save to organized directory structure

Step 4: Validate Output

# Check output structure
ls -la output/<skill-name>/

# Verify content quality
head -50 output/<skill-name>/references/index.md

# Count extracted pages
find output/<skill-name>_data/pages -name "*.json" | wc -l

Recovery Protocol (Archetype 4 Mitigation)

On error:

PAUSE - Stop scraping, preserve already-fetched pages
DIAGNOSE - Check error type:
- Connection error → Verify URL, check network
- Selector not found → Re-inspect page structure
- Rate limited → Increase delay, reduce workers
- Memory/disk → Reduce batch size, clear temp files
ADAPT - Adjust configuration based on diagnosis
RETRY - Resume from checkpoint (max 3 attempts)
ESCALATE - Ask user for guidance

Checkpoint Support

State saved to: .aiwg/working/checkpoints/doc-scraper/

Resume interrupted scrape:

skill-seekers scrape --config config.json --resume

Clear checkpoint and start fresh:

skill-seekers scrape --config config.json --fresh

Output Structure

output/<skill-name>/
├── SKILL.md              # Main skill description
├── references/           # Categorized documentation
│   ├── index.md          # Category index
│   ├── getting_started.md
│   ├── api_reference.md
│   └── guides.md
├── scripts/              # (empty, for user additions)
└── assets/               # (empty, for user additions)

output/<skill-name>_data/
├── pages/                # Raw scraped JSON (one per page)
└── summary.json          # Scrape statistics

Configuration Templates

Minimal Config

{
  "name": "myframework",
  "base_url": "https://docs.example.com/",
  "max_pages": 100
}

Full Config

{
  "name": "myframework",
  "description": "MyFramework documentation for building web apps",
  "base_url": "https://docs.example.com/",
  "selectors": {
    "main_content": "article, main, div[role='main']",
    "title": "h1, .title",
    "code_blocks": "pre code, .highlight code",
    "navigation": "nav, .sidebar"
  },
  "url_patterns": {
    "include": ["/docs/", "/api/", "/guide/"],
    "exclude": ["/blog/", "/changelog/", "/v1/", "/v2/"]
  },
  "categories": {
    "getting_started": ["intro", "quickstart", "install", "setup"],
    "concepts": ["concept", "overview", "architecture"],
    "api": ["api", "reference", "method", "function"],
    "guides": ["guide", "tutorial", "how-to", "example"],
    "advanced": ["advanced", "internals", "customize"]
  },
  "rate_limit": 0.5,
  "max_pages": 1000,
  "checkpoint": {
    "enabled": true,
    "interval": 100
  }
}

Troubleshooting

| Issue | Diagnosis | Solution | |-------|-----------|----------| | No content extracted | Selector mismatch | Inspect page, update main_content selector | | Wrong pages scraped | URL pattern issue | Check include/exclude patterns | | Rate limited | Too aggressive | Increase rate_limit to 1.0+ seconds | | Memory issues | Too many pages | Add max_pages limit, enable checkpoints | | Categories wrong | Keyword mismatch | Update category keywords in config |

References

Skill Seekers: https://github.com/jmagly/Skill_Seekers
REF-001: Production-Grade Agentic Workflows (BP-1, BP-4, BP-9)
REF-002: LLM Failure Modes (Archetype 1-4 mitigations)

jmagly/doc-scraper

agentic/code/addons/doc-intelligence/skills/doc-scraper/SKILL.md

Scrape documentation websites into organized reference files. Use when converting docs sites to searchable references or building Claude skills.

122 stars

development

Updated Apr 24, 2026

$ install --global

skillsauth

npx skillsauth add jmagly/aiwg doc-scraper

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 24, 2026, 5:19 PM301.5s1 file scanned

SKILL.md

namespace:: aiwg
name:: doc-scraper
description:: Scrape documentation websites into organized reference files. Use when converting docs sites to searchable references or building Claude skills.
tools:: Read, Write, Bash, WebFetch
platforms:: [all]

Documentation Scraper Skill

Purpose

Single responsibility: Convert documentation websites into organized, categorized reference files suitable for Claude skills or offline archives. (BP-4)

Grounding Checkpoint (Archetype 1 Mitigation)

Before executing, VERIFY:

[ ] Target URL is accessible (test with curl -I)
[ ] Documentation structure is identifiable (inspect page for content selectors)
[ ] Output directory is writable
[ ] Rate limiting requirements are known (check robots.txt)

DO NOT proceed without verification. Inspect before scraping.

Uncertainty Escalation (Archetype 2 Mitigation)

ASK USER instead of guessing when:

Content selector is ambiguous (multiple <article> or <main> elements)
URL patterns unclear (can't determine include/exclude rules)
Category mapping uncertain (content doesn't fit predefined categories)
Rate limiting unknown (no robots.txt, unclear ToS)

NEVER substitute missing configuration with assumptions.

Context Scope (Archetype 3 Mitigation)

Workflow Steps

Step 1: Verify Target (Grounding)

# Test URL accessibility
curl -I <target-url>

# Check robots.txt
curl <base-url>/robots.txt

# Inspect page structure (use browser dev tools or fetch sample)

Step 2: Create Configuration

Generate scraper config based on inspection:

{
  "name": "skill-name",
  "description": "When to use this skill",
  "base_url": "https://docs.example.com/",
  "selectors": {
    "main_content": "article",
    "title": "h1",
    "code_blocks": "pre code"
  },
  "url_patterns": {
    "include": ["/docs", "/guide", "/api"],
    "exclude": ["/blog", "/changelog", "/releases"]
  },
  "categories": {
    "getting_started": ["intro", "quickstart", "installation"],
    "api_reference": ["api", "reference", "methods"],
    "guides": ["guide", "tutorial", "how-to"]
  },
  "rate_limit": 0.5,
  "max_pages": 500
}

Step 3: Execute Scraping

Option A: With skill-seekers (if installed)

# Verify skill-seekers is available
pip show skill-seekers

# Run scraper
skill-seekers scrape --config config.json

# For large docs, use async mode
skill-seekers scrape --config config.json --async --workers 8

Option B: Manual scraping guidance

Use sitemap.xml or crawl starting URL
Extract content using configured selectors
Categorize pages based on URL patterns and keywords
Save to organized directory structure

Step 4: Validate Output

# Check output structure
ls -la output/<skill-name>/

# Verify content quality
head -50 output/<skill-name>/references/index.md

# Count extracted pages
find output/<skill-name>_data/pages -name "*.json" | wc -l

Recovery Protocol (Archetype 4 Mitigation)

On error:

PAUSE - Stop scraping, preserve already-fetched pages
DIAGNOSE - Check error type:
- Connection error → Verify URL, check network
- Selector not found → Re-inspect page structure
- Rate limited → Increase delay, reduce workers
- Memory/disk → Reduce batch size, clear temp files
ADAPT - Adjust configuration based on diagnosis
RETRY - Resume from checkpoint (max 3 attempts)
ESCALATE - Ask user for guidance

Checkpoint Support

State saved to: .aiwg/working/checkpoints/doc-scraper/

Resume interrupted scrape:

skill-seekers scrape --config config.json --resume

Clear checkpoint and start fresh:

skill-seekers scrape --config config.json --fresh

Output Structure

output/<skill-name>/
├── SKILL.md              # Main skill description
├── references/           # Categorized documentation
│   ├── index.md          # Category index
│   ├── getting_started.md
│   ├── api_reference.md
│   └── guides.md
├── scripts/              # (empty, for user additions)
└── assets/               # (empty, for user additions)

output/<skill-name>_data/
├── pages/                # Raw scraped JSON (one per page)
└── summary.json          # Scrape statistics

Configuration Templates

Minimal Config

{
  "name": "myframework",
  "base_url": "https://docs.example.com/",
  "max_pages": 100
}

Full Config

{
  "name": "myframework",
  "description": "MyFramework documentation for building web apps",
  "base_url": "https://docs.example.com/",
  "selectors": {
    "main_content": "article, main, div[role='main']",
    "title": "h1, .title",
    "code_blocks": "pre code, .highlight code",
    "navigation": "nav, .sidebar"
  },
  "url_patterns": {
    "include": ["/docs/", "/api/", "/guide/"],
    "exclude": ["/blog/", "/changelog/", "/v1/", "/v2/"]
  },
  "categories": {
    "getting_started": ["intro", "quickstart", "install", "setup"],
    "concepts": ["concept", "overview", "architecture"],
    "api": ["api", "reference", "method", "function"],
    "guides": ["guide", "tutorial", "how-to", "example"],
    "advanced": ["advanced", "internals", "customize"]
  },
  "rate_limit": 0.5,
  "max_pages": 1000,
  "checkpoint": {
    "enabled": true,
    "interval": 100
  }
}

Troubleshooting

References

Skill Seekers: https://github.com/jmagly/Skill_Seekers
REF-001: Production-Grade Agentic Workflows (BP-1, BP-4, BP-9)
REF-002: LLM Failure Modes (Archetype 1-4 mitigations)

Related Skills

jmagly/radar-status

data-ai

VerifiedTrustedCommunity

Report which research-corpus radar sidecars are overdue for refresh. Computes staleness (days since last refresh vs the cadence window) for every radar, sorted most-overdue-first. Runs via `aiwg corpus radar-status`.

140SKILL.mdUpdated May 28, 2026

jmagly/radar-report

data-ai

VerifiedTrustedCommunity

Aggregate research-corpus radar sidecars into a corpus or per-cluster freshness report — totals, overdue count, per-cluster / per-GRADE / per-trajectory breakdowns, an overdue table, and per-radar rationale snippets. Runs via `aiwg corpus radar-report`.

140SKILL.mdUpdated May 28, 2026

jmagly/radar-init

testing

VerifiedTrustedCommunity

Scaffold radar/freshness sidecars for research-corpus REFs. Pulls title/authors from the citation sidecar and GRADE from the analysis doc, defaults the refresh cadence from GRADE and the cluster from a corpus-local map, and stamps documentation/radar/REF-XXX-radar.md. Runs via `aiwg corpus radar-init`.

140SKILL.mdUpdated May 28, 2026

jmagly/profile-temporal

data-ai

VerifiedTrustedCommunity

Compute an entity's publication trajectory — per-year paper counts, topic drift, hot-streak detection (≥3 consecutive A-grade years), and career phase. Runs via `aiwg corpus profile-temporal`.

140SKILL.mdUpdated May 28, 2026

jmagly/profile-temporal

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/jmagly/aiwg.git

# Copy into Claude Code skills folder (global)
cp -r aiwg/agentic/code/addons/doc-intelligence/skills/doc-scraper ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

jmagly/aiwg

122 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT