skills/documentation-scraper/SKILL.md
Use when needing to scrape documentation websites into markdown for AI context. Triggers on "scrape docs", "download documentation", "get docs for [library]", or creating local copies of online documentation. CRITICAL - always analyze sitemap first before scraping.
npx skillsauth add ratacat/claude-skills documentation-scraperInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
slurp-ai scrapes documentation websites and compiles them into a single markdown file optimized for AI agent context. It is lightweight, fast, and deterministic - it does NOT use AI to scrape, it is FOR AI consumption.
All commands in this skill MUST be run outside the sandbox. Use dangerouslyDisableSandbox: true for all Bash commands including:
which slurp (installation check)node analyze-sitemap.js (sitemap analysis)slurp (scraping)wc, head, cat, etc.)The sandbox blocks network access and file operations required for web scraping.
Before scraping, verify slurp-ai is installed:
which slurp || echo "NOT INSTALLED"
If not installed, ask the user to run:
npm install -g slurp-ai
Requires: Node.js v20+
Do NOT proceed with scraping until slurp-ai is confirmed installed.
| Command | Purpose |
|---------|---------|
| slurp <url> | Fetch and compile in one step |
| slurp fetch <url> [version] | Download docs to partials only |
| slurp compile | Compile partials into single file |
| slurp read <package> [version] | Read local documentation |
Output: Creates slurp_compiled/compiled_docs.md from partials in slurp_partials/.
Before running slurp, ALWAYS analyze the sitemap. This reveals the complete site structure and informs your --base-path and --max decisions.
Use the included analyze-sitemap.js script:
node analyze-sitemap.js https://docs.example.com
This outputs:
--max)--base-path)Example output:
📊 Total URLs in sitemap: 247
📁 URLs by top-level section:
/docs 182 pages
/api 45 pages
/blog 20 pages
🎯 Suggested --base-path options:
https://docs.example.com/docs/guides/ (67 pages)
https://docs.example.com/docs/reference/ (52 pages)
https://docs.example.com/api/ (45 pages)
💡 Recommended slurp commands:
# Just "/docs/guides" section (67 pages)
slurp https://docs.example.com/docs/guides/ --base-path https://docs.example.com/docs/guides/ --max 80
| Sitemap Shows | Action |
|---------------|--------|
| < 50 pages total | Scrape entire site: slurp <url> --max 60 |
| 50-200 pages | Scope to relevant section with --base-path |
| 200+ pages | Must scope down - pick specific subsection |
| No sitemap found | Start with --max 30, inspect partials, adjust |
With sitemap data, you can now set accurate parameters:
# From sitemap: /docs/api has 45 pages
slurp https://docs.example.com/docs/api/intro \
--base-path https://docs.example.com/docs/api/ \
--max 55
Key insight: Starting URL is where crawling begins. Base path filters which links get followed. They can differ (useful when base path itself returns 404).
# Express.js 4.x docs
slurp https://expressjs.com/en/4x/api.html --base-path https://expressjs.com/en/4x/
# React docs (latest)
slurp https://react.dev/learn --base-path https://react.dev/learn
slurp https://docs.example.com/api/introduction --base-path https://docs.example.com/api/
slurp https://docs.example.com/
| Flag | Default | Purpose |
|------|---------|---------|
| --max <n> | 20 | Maximum pages to scrape |
| --concurrency <n> | 5 | Parallel page requests |
| --headless <bool> | true | Use headless browser |
| --base-path <url> | start URL | Filter links to this prefix |
| --output <dir> | ./slurp_partials | Output directory for partials |
| --retry-count <n> | 3 | Retries for failed requests |
| --retry-delay <ms> | 1000 | Delay between retries |
| --yes | - | Skip confirmation prompts |
| Flag | Default | Purpose |
|------|---------|---------|
| --input <dir> | ./slurp_partials | Input directory |
| --output <file> | ./slurp_compiled/compiled_docs.md | Output file |
| --preserve-metadata | true | Keep metadata blocks |
| --remove-navigation | true | Strip nav elements |
| --remove-duplicates | true | Eliminate duplicates |
| --exclude <json> | - | JSON array of regex patterns to exclude |
Use --headless false for:
Default is headless (true) - works for most modern doc sites including SPAs.
slurp_partials/ # Intermediate files
└── page1.md
└── page2.md
slurp_compiled/ # Final output
└── compiled_docs.md # Compiled result
# 1. ALWAYS analyze sitemap first
node analyze-sitemap.js https://docs.example.com
# 2. Scrape with informed parameters (from sitemap analysis)
slurp https://docs.example.com/docs/ --base-path https://docs.example.com/docs/ --max 80
# 3. Skip prompts for automation
slurp https://docs.example.com/ --yes
# 4. Check output
cat slurp_compiled/compiled_docs.md | head -100
| Problem | Cause | Solution |
|---------|-------|----------|
| Wrong --max value | Guessing page count | Run analyze-sitemap.js first |
| Too few pages scraped | --max limit (default 20) | Set --max based on sitemap analysis |
| Missing content | JS not rendering | Ensure --headless true (default) |
| Crawl stuck/slow | Rate limiting | Reduce --concurrency 3 |
| Duplicate sections | Similar content | Use --remove-duplicates (default) |
| Wrong pages included | Base path too broad | Use sitemap to find correct --base-path |
| Prompts blocking automation | Interactive mode | Add --yes flag |
The output markdown is designed for AI context injection:
# Check file size (context budget)
wc -c slurp_compiled/compiled_docs.md
# Preview structure
grep "^#" slurp_compiled/compiled_docs.md | head -30
# Use with Claude Code - reference in prompt or via @file
tools
Build and test iOS apps on simulator using XcodeBuildMCP
development
Produces concise, clear documentation by applying Elements of Style principles. Use when writing or improving any technical documentation (READMEs, guides, API docs, architecture docs). Not for code comments.
testing
Use when user asks to create, write, edit, or test a skill. Also use when documenting reusable techniques, patterns, or workflows for future Claude instances.
testing
Execute work plans efficiently while maintaining quality and finishing features