skills/backup/website-doc-scraper/SKILL.md
Scrapes website pages to Markdown with state management and deduplication. Supports Fast Path (single page), Layer Mode (incremental crawling), and Site Mode (full discovery).
npx skillsauth add sampx/agent-tools website-doc-scraperInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Scrapes website documentation and saves as Markdown files.
Default Output: docs/scraped/<domain>/
Before ANY batch scraping (Layer Mode or Site Mode), you MUST:
⛔ NEVER start batch scraping without user confirmation!
| Use Case | Pattern | Description |
|----------|---------|-------------|
| Docs only | ^/docs | Match paths starting with /docs |
| Chinese docs | ^/docs/zh | Match /docs/zh paths |
| English docs | ^/docs/en | Match /docs/en paths |
| Exclude blog | ^(?!/blog).* | Exclude /blog paths |
| Exclude multiple | ^(?!/blog\|/forum\|/api).* | Exclude multiple paths |
| All pages | ^.* | Or leave empty |
| Mode | Trigger | Description | |------|---------|-------------| | Fast Path | Single URL | Quick single page save | | Layer Mode | "Scrape related pages" | Layer-by-layer with confirmation | | Site Mode | "Full site", "all docs" | Complete site discovery |
Always use the largest batch size possible (recommended: 20+ URLs per batch).
Why? Tavily's behavior depends on result size:
| Result Size | Tavily Behavior | Agent Handling |
|-------------|-----------------|----------------|
| > 25k tokens | Saves to file, returns file path | ✅ Use save-batch --input-file (efficient) |
| ≤ 25k tokens | Returns JSON in response | ⚠️ Parse JSON manually (token-heavy, error-prone) |
Best Practice:
Get filename & Extract:
python ./scripts/state_manager.py get-filename --output-dir <dir> --url <url>
Then use tavily_extract(urls=[<url>], extract_depth="advanced").
Save: Write content to the full_path returned.
Ask user: "Continue scraping more pages? Choose:\n1. Layer Mode (incremental crawling)\n2. Site Mode (full site discovery)\n3. Finish"
If user chooses Layer Mode or Site Mode, import the page first:
python ./scripts/state_manager.py import-single --output-dir <dir> --url <url> --file <path>
Then follow the User Confirmation Required section above before proceeding.
⛔ Confirm URL filter with user before starting! (See "Critical: User Confirmation Required" above)
# Set path filter based on user-confirmed criteria
python ./scripts/state_manager.py set-path-filter --output-dir <dir> --pattern "<user-confirmed-pattern>"
# Verify and show to user
python ./scripts/state_manager.py stats --output-dir <dir>
For each layer N:
1. Discover Links
python ./scripts/state_manager.py get-base-url --output-dir <dir>
python ./scripts/extract_links.py <output-dir> --base-url <base_url> --include-internal --output /tmp/urls.txt
python ./scripts/state_manager.py add-urls --output-dir <dir> --urls-file /tmp/urls.txt
2. Preview & Confirm
python ./scripts/state_manager.py stats --output-dir <dir>
python ./scripts/state_manager.py preview --output-dir <dir> --size 10
⚠️ WAIT for user confirmation before scraping.
3. Batch Scrape (loop until queue empty)
# Get batch
python ./scripts/state_manager.py next-batch --output-dir <dir> --size 20
# Extract with Tavily
tavily_extract(urls=batch, extract_depth="advanced")
# Save results (choose one):
# Option A: If Tavily returns a file path
python ./scripts/state_manager.py save-batch --output-dir <dir> --input-file <file>
# Option B: If Tavily returns JSON content
# Save files directly, then mark as scraped
python ./scripts/state_manager.py mark-scraped --output-dir <dir> --urls <url1> <url2>
4. Continue: Ask "Continue to Layer N+1?" → repeat from step 1, or proceed to Link Verification.
When no new links discovered → proceed to Link Verification.
Use tavily_map for initial discovery, then iterate with Layer Mode for complete coverage.
⛔ Confirm URL filter with user before starting! (See "Critical: User Confirmation Required" above)
# Initialize
python ./scripts/state_manager.py init --output-dir <dir> --base-url <url>
# Set filter based on user-confirmed criteria
python ./scripts/state_manager.py set-path-filter --output-dir <dir> --pattern "<user-confirmed-pattern>"
# Show to user and get confirmation
python ./scripts/state_manager.py stats --output-dir <dir>
Tell the user: "✓ State manager initialized\n✓ Filter rule set: <pattern>\n\nConfirm to start site discovery?"
⛔ DO NOT proceed with site discovery until user explicitly confirms!
1. Discover via Sitemap
tavily_map(url=<base_url>, max_depth=2)
Then add discovered URLs:
python ./scripts/state_manager.py add-urls --output-dir <dir> --urls <url1> <url2> ...
2. Hybrid Iteration (recommended for complete coverage)
After scraping initial batch, extract links from scraped pages and add new URLs. Repeat until pending_urls = 0.
3. Batch Scrape: Same as Layer Mode step 3.
After scraping completes:
# Check links (Requires all pending URLs to be scraped first!)
python ./scripts/check_markdown_links.py <output-dir>
# Fix links (absolute → relative, add .md, convert in-scope external links)
python ./scripts/fix_markdown_links.py <output-dir>
# Re-verify
python ./scripts/check_markdown_links.py <output-dir>
Note:
check_markdown_links.pywill properly identify external links that are within the scraping scope. If they are already scraped,fix_markdown_links.pywill convert them to internal relative links. If they are pending, the checker will alert you.
| Command | Description |
|---------|-------------|
| get-filename | Get correct filename for URL |
| get-base-url | Get base URL from state |
| init | Initialize project state |
| add-urls | Add URLs to queue |
| next-batch | Get next batch |
| save-batch | Save Tavily results |
| mark-scraped | Mark URLs completed |
| stats | Show statistics |
| preview | Preview next batch |
| set-path-filter | ⚠️ Set path filter (REQUIRED before batch scraping!) |
For detailed arguments and state file format, see ./references/scraper-guide.md.
tools
Configure ellamaka, a fork of OpenCode with wopal-space mode. MUST use for any task about ellamaka config, agent frontmatter, permission rules, model/provider selection, formatter settings, config loading order, or why config changes are ignored. Trigger on requests about ellamaka or opencode config files, agent permission overrides, restricting subagents, custom/plugin tool permissions (e.g. wopal_task_*), disabling tools, configuring providers or models, formatter setup, config precedence or layering, or debugging settings that do not take effect. Use this skill even when the user says "opencode" if the actual runtime, config path, or behavior is ellamaka. Prefer this skill whenever the answer depends on the difference between ellamaka and upstream opencode, including wopal-space config loading, plugin tool permissions, or agent frontmatter precedence.
development
Plan quality verification for dev-flow. Goal-backward analysis ensures plans WILL achieve their stated goal before execution burns context. ⚠️ MUST use when: (1) Reviewing Plan quality before approve (2) Wopal completes Plan writing and needs quality gate (3) User asks to "check plan", "verify plan", "review plan" (4) Plan enters planning status and needs pre-execution validation 🔴 Trigger automatically when Plan is ready for review, even if user doesn't explicitly say "review". Agent: rook (read-only verification subagent) Mode: verification, not execution
development
Review implementation results for goal achievement and code quality. Supports both Plan-backed review and planless diff review. ⚠️ MUST use when: (1) Wopal delegates rook to review fae implementation output, (2) Prompt contains "review_type: implementation", (3) Prompt contains changed code file list or Plan path + implementation scope, (4) Any code review request from Wopal. 🔴 Trigger even when user does not explicitly mention "review" if the task involves verifying implementation results. This skill is rook-exclusive (only rook agent can load it).
tools
Foundation rules for how Wopal collaborates with sub-agents such as fae and rook. ⚠️ MUST load before ANY delegation — covers delegation tool APIs, task lifecycle, notifications, status handling, and recovery. 🔴 Trigger: "delegate", "let fae implement", "fae task", "rook review", "check task status", "cancel task", "abort task", "agent collaboration", "委派", "让 fae 执行", "fae 任务", "rook 审查", "检查状态", or any intent to hand work to a sub-agent. 🔴 Never delegate without loading this skill first. Skipping it is serious negligence. Note: this skill does not include workflow-specific prompt templates such as dev-flow templates. Those belong to the corresponding workflow skills.