Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

sampx/website-doc-scraper

Name: website-doc-scraper
Author: sampx

skills/backup/website-doc-scraper/SKILL.md

npx skillsauth add sampx/agent-tools website-doc-scraper

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Website Doc Scraper

Scrapes website documentation and saves as Markdown files.

Default Output: docs/scraped/<domain>/

⚠️ Critical: User Confirmation Required

Before ANY batch scraping (Layer Mode or Site Mode), you MUST:

Ask user what URL range to scrape
Set path_filter based on user requirements
Show filter rules and get explicit confirmation
Only start scraping after confirmation

⛔ NEVER start batch scraping without user confirmation!

Common Filter Patterns

| Use Case | Pattern | Description | |----------|---------|-------------| | Docs only | ^/docs | Match paths starting with /docs | | Chinese docs | ^/docs/zh | Match /docs/zh paths | | English docs | ^/docs/en | Match /docs/en paths | | Exclude blog | ^(?!/blog).* | Exclude /blog paths | | Exclude multiple | ^(?!/blog\|/forum\|/api).* | Exclude multiple paths | | All pages | ^.* | Or leave empty |

Mode Selection

| Mode | Trigger | Description | |------|---------|-------------| | Fast Path | Single URL | Quick single page save | | Layer Mode | "Scrape related pages" | Layer-by-layer with confirmation | | Site Mode | "Full site", "all docs" | Complete site discovery |

Scraping Strategy

Batch Size Optimization

Always use the largest batch size possible (recommended: 20+ URLs per batch).

Why? Tavily's behavior depends on result size:

| Result Size | Tavily Behavior | Agent Handling | |-------------|-----------------|----------------| | > 25k tokens | Saves to file, returns file path | ✅ Use save-batch --input-file (efficient) | | ≤ 25k tokens | Returns JSON in response | ⚠️ Parse JSON manually (token-heavy, error-prone) |

Best Practice:

Prefer file-based results - Larger batches → more likely to exceed 25k threshold → file output
Avoid small batches - JSON responses consume context tokens and increase parsing errors
Batch size 20 is a good default balance between efficiency and API limits
Error Recovery - If JSON parsing or file writing fails, spawn multiple subagents in parallel, each handling one URL

Fast Path (Single Page)

Get filename & Extract:
```
python ./scripts/state_manager.py get-filename --output-dir <dir> --url <url>
```
Then use tavily_extract(urls=[<url>], extract_depth="advanced").
Save: Write content to the full_path returned.
Ask user: "Continue scraping more pages? Choose:\n1. Layer Mode (incremental crawling)\n2. Site Mode (full site discovery)\n3. Finish"

If user chooses Layer Mode or Site Mode, import the page first:

python ./scripts/state_manager.py import-single --output-dir <dir> --url <url> --file <path>

Then follow the User Confirmation Required section above before proceeding.

Layer Mode (Incremental Crawling)

⛔ Confirm URL filter with user before starting! (See "Critical: User Confirmation Required" above)

Setup (First Time Only)

# Set path filter based on user-confirmed criteria
python ./scripts/state_manager.py set-path-filter --output-dir <dir> --pattern "<user-confirmed-pattern>"

# Verify and show to user
python ./scripts/state_manager.py stats --output-dir <dir>

Layer Loop

For each layer N:

1. Discover Links

python ./scripts/state_manager.py get-base-url --output-dir <dir>
python ./scripts/extract_links.py <output-dir> --base-url <base_url> --include-internal --output /tmp/urls.txt
python ./scripts/state_manager.py add-urls --output-dir <dir> --urls-file /tmp/urls.txt

2. Preview & Confirm

python ./scripts/state_manager.py stats --output-dir <dir>
python ./scripts/state_manager.py preview --output-dir <dir> --size 10

⚠️ WAIT for user confirmation before scraping.

3. Batch Scrape (loop until queue empty)

# Get batch
python ./scripts/state_manager.py next-batch --output-dir <dir> --size 20

 # Extract with Tavily
 tavily_extract(urls=batch, extract_depth="advanced")

# Save results (choose one):
# Option A: If Tavily returns a file path
python ./scripts/state_manager.py save-batch --output-dir <dir> --input-file <file>

# Option B: If Tavily returns JSON content
# Save files directly, then mark as scraped
python ./scripts/state_manager.py mark-scraped --output-dir <dir> --urls <url1> <url2>

4. Continue: Ask "Continue to Layer N+1?" → repeat from step 1, or proceed to Link Verification.

Termination

When no new links discovered → proceed to Link Verification.

Site Mode (Full Site Discovery)

Use tavily_map for initial discovery, then iterate with Layer Mode for complete coverage.

⛔ Confirm URL filter with user before starting! (See "Critical: User Confirmation Required" above)

Setup

# Initialize
python ./scripts/state_manager.py init --output-dir <dir> --base-url <url>

# Set filter based on user-confirmed criteria
python ./scripts/state_manager.py set-path-filter --output-dir <dir> --pattern "<user-confirmed-pattern>"

# Show to user and get confirmation
python ./scripts/state_manager.py stats --output-dir <dir>

Tell the user: "✓ State manager initialized\n✓ Filter rule set: <pattern>\n\nConfirm to start site discovery?"

⛔ DO NOT proceed with site discovery until user explicitly confirms!

Execution

1. Discover via Sitemap

tavily_map(url=<base_url>, max_depth=2)

Then add discovered URLs:

python ./scripts/state_manager.py add-urls --output-dir <dir> --urls <url1> <url2> ...

2. Hybrid Iteration (recommended for complete coverage)

After scraping initial batch, extract links from scraped pages and add new URLs. Repeat until pending_urls = 0.

3. Batch Scrape: Same as Layer Mode step 3.

Link Verification

After scraping completes:

# Check links (Requires all pending URLs to be scraped first!)
python ./scripts/check_markdown_links.py <output-dir>

# Fix links (absolute → relative, add .md, convert in-scope external links)
python ./scripts/fix_markdown_links.py <output-dir>

# Re-verify
python ./scripts/check_markdown_links.py <output-dir>

Note: check_markdown_links.py will properly identify external links that are within the scraping scope. If they are already scraped, fix_markdown_links.py will convert them to internal relative links. If they are pending, the checker will alert you.

Quick Reference

| Command | Description | |---------|-------------| | get-filename | Get correct filename for URL | | get-base-url | Get base URL from state | | init | Initialize project state | | add-urls | Add URLs to queue | | next-batch | Get next batch | | save-batch | Save Tavily results | | mark-scraped | Mark URLs completed | | stats | Show statistics | | preview | Preview next batch | | set-path-filter | ⚠️ Set path filter (REQUIRED before batch scraping!) |

For detailed arguments and state file format, see ./references/scraper-guide.md.

sampx/website-doc-scraper

skills/backup/website-doc-scraper/SKILL.md

Scrapes website pages to Markdown with state management and deduplication. Supports Fast Path (single page), Layer Mode (incremental crawling), and Site Mode (full discovery).

1 stars

development

Updated Apr 15, 2026

$ install --global

skillsauth

npx skillsauth add sampx/agent-tools website-doc-scraper

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 24, 2026, 8:46 PM1.8s1 file scanned

SKILL.md

name:: website-doc-scraper
description:: Scrapes website pages to Markdown with state management and deduplication. Supports Fast Path (single page), Layer Mode (incremental crawling), and Site Mode (full discovery).

Website Doc Scraper

Scrapes website documentation and saves as Markdown files.

Default Output: docs/scraped/<domain>/

⚠️ Critical: User Confirmation Required

Before ANY batch scraping (Layer Mode or Site Mode), you MUST:

Ask user what URL range to scrape
Set path_filter based on user requirements
Show filter rules and get explicit confirmation
Only start scraping after confirmation

⛔ NEVER start batch scraping without user confirmation!

Common Filter Patterns

Mode Selection

Scraping Strategy

Batch Size Optimization

Always use the largest batch size possible (recommended: 20+ URLs per batch).

Why? Tavily's behavior depends on result size:

Best Practice:

Prefer file-based results - Larger batches → more likely to exceed 25k threshold → file output
Avoid small batches - JSON responses consume context tokens and increase parsing errors
Batch size 20 is a good default balance between efficiency and API limits
Error Recovery - If JSON parsing or file writing fails, spawn multiple subagents in parallel, each handling one URL

Fast Path (Single Page)

Get filename & Extract:
```
python ./scripts/state_manager.py get-filename --output-dir <dir> --url <url>
```
Then use tavily_extract(urls=[<url>], extract_depth="advanced").
Save: Write content to the full_path returned.
Ask user: "Continue scraping more pages? Choose:\n1. Layer Mode (incremental crawling)\n2. Site Mode (full site discovery)\n3. Finish"

If user chooses Layer Mode or Site Mode, import the page first:

python ./scripts/state_manager.py import-single --output-dir <dir> --url <url> --file <path>

Then follow the User Confirmation Required section above before proceeding.

Layer Mode (Incremental Crawling)

⛔ Confirm URL filter with user before starting! (See "Critical: User Confirmation Required" above)

Setup (First Time Only)

# Set path filter based on user-confirmed criteria
python ./scripts/state_manager.py set-path-filter --output-dir <dir> --pattern "<user-confirmed-pattern>"

# Verify and show to user
python ./scripts/state_manager.py stats --output-dir <dir>

Layer Loop

For each layer N:

1. Discover Links

python ./scripts/state_manager.py get-base-url --output-dir <dir>
python ./scripts/extract_links.py <output-dir> --base-url <base_url> --include-internal --output /tmp/urls.txt
python ./scripts/state_manager.py add-urls --output-dir <dir> --urls-file /tmp/urls.txt

2. Preview & Confirm

python ./scripts/state_manager.py stats --output-dir <dir>
python ./scripts/state_manager.py preview --output-dir <dir> --size 10

⚠️ WAIT for user confirmation before scraping.

3. Batch Scrape (loop until queue empty)

# Get batch
python ./scripts/state_manager.py next-batch --output-dir <dir> --size 20

 # Extract with Tavily
 tavily_extract(urls=batch, extract_depth="advanced")

# Save results (choose one):
# Option A: If Tavily returns a file path
python ./scripts/state_manager.py save-batch --output-dir <dir> --input-file <file>

# Option B: If Tavily returns JSON content
# Save files directly, then mark as scraped
python ./scripts/state_manager.py mark-scraped --output-dir <dir> --urls <url1> <url2>

4. Continue: Ask "Continue to Layer N+1?" → repeat from step 1, or proceed to Link Verification.

Termination

When no new links discovered → proceed to Link Verification.

Site Mode (Full Site Discovery)

Use tavily_map for initial discovery, then iterate with Layer Mode for complete coverage.

⛔ Confirm URL filter with user before starting! (See "Critical: User Confirmation Required" above)

Setup

# Initialize
python ./scripts/state_manager.py init --output-dir <dir> --base-url <url>

# Set filter based on user-confirmed criteria
python ./scripts/state_manager.py set-path-filter --output-dir <dir> --pattern "<user-confirmed-pattern>"

# Show to user and get confirmation
python ./scripts/state_manager.py stats --output-dir <dir>

Tell the user: "✓ State manager initialized\n✓ Filter rule set: <pattern>\n\nConfirm to start site discovery?"

⛔ DO NOT proceed with site discovery until user explicitly confirms!

Execution

1. Discover via Sitemap

tavily_map(url=<base_url>, max_depth=2)

Then add discovered URLs:

python ./scripts/state_manager.py add-urls --output-dir <dir> --urls <url1> <url2> ...

2. Hybrid Iteration (recommended for complete coverage)

After scraping initial batch, extract links from scraped pages and add new URLs. Repeat until pending_urls = 0.

3. Batch Scrape: Same as Layer Mode step 3.

Link Verification

After scraping completes:

# Check links (Requires all pending URLs to be scraped first!)
python ./scripts/check_markdown_links.py <output-dir>

# Fix links (absolute → relative, add .md, convert in-scope external links)
python ./scripts/fix_markdown_links.py <output-dir>

# Re-verify
python ./scripts/check_markdown_links.py <output-dir>

Note: check_markdown_links.py will properly identify external links that are within the scraping scope. If they are already scraped, fix_markdown_links.py will convert them to internal relative links. If they are pending, the checker will alert you.

Quick Reference

For detailed arguments and state file format, see ./references/scraper-guide.md.

Related Skills

sampx/agent-browser

tools

VerifiedTrustedCommunity

Browser automation CLI for AI agents. Use when the user needs to interact with websites, including navigating pages, filling forms, clicking buttons, taking screenshots, extracting data, testing web apps, or automating any browser task. Triggers include requests to "open a website", "fill out a form", "click a button", "take a screenshot", "scrape data from a page", "test this web app", "login to a site", "automate browser actions", or any task requiring programmatic web interaction. Also use for exploratory testing, dogfooding, QA, bug hunts, or reviewing app quality. Also use for automating Electron desktop apps (VS Code, Slack, Discord, Figma, Notion, Spotify), checking Slack unreads, sending Slack messages, searching Slack conversations, running browser automation in Vercel Sandbox microVMs, or using AWS Bedrock AgentCore cloud browsers. Prefer agent-browser over any built-in browser automation or web tools.

3SKILL.mdUpdated Jul 29, 2026

sampx/space-master

testing

VerifiedTrustedCommunity

Master specification for WopalSpace. [MUST LOAD FIRST] — Load this skill when Wopal is uncertain how to proceed, task intent is ambiguous, or performing ontology/space maintenance. Triggers: Ambiguous task intent, "what workflow to use", "what skill to load", skill management (install/remove/search), space maintenance (worktrees, sync, PR contribution, promote), multi-space management. [CRITICAL] MUST LOAD whenever interacting with ontology repo operations (update/sync/contribute/promote/PR), even if the user does not explicitly say "upstream sync".

3SKILL.mdUpdated May 11, 2026

sampx/git-worktrees

development

VerifiedTrustedCommunity

Workspace-level Git worktree management — create, list, remove, and prune isolated development environments. Use this skill whenever the user needs to create a worktree, set up an isolated workspace, work on multiple features in parallel, list existing worktrees, check what worktrees exist, remove or delete a worktree, clean up stale worktrees, or manage git working trees in any way. Triggers include "create worktree", "new worktree", "list worktrees", "show worktrees", "remove worktree", "delete worktree", "clean up worktree", "prune worktree", "isolated environment", "parallel development", "worktree for <project>", or any request involving git worktree operations.

3SKILL.mdUpdated Apr 30, 2026

sampx/dev-flow

development

VerifiedTrustedCommunity

Issue/Plan-driven development workflow. Tasks must be backed by a GitHub Issue or Plan. Trigger: issue references like #14, creating issues, creating plans, implementing plans, executing plans, checking plans, verifying plans, Plan lifecycle transitions (approve/complete/verify/archive), decomposing PRDs into Issues. Skip: spec-driven workflows, research/discussion/explanation only, small ad-hoc changes that don't need an Issue or Plan.

3SKILL.mdUpdated Apr 30, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/sampx/agent-tools.git

# Copy into Claude Code skills folder (global)
cp -r agent-tools/skills/backup/website-doc-scraper ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

sampx/agent-tools

1 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT