deepresearch/.claude/skills/web-scrape/SKILL.md
Intelligent web scraper with content extraction, multiple output formats, and error handling
npx skillsauth add 21pounder/terminalagent web-scrapeInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
/web-scrape <url> [options]
Options:
--format=markdown|json|text - Output format (default: markdown)--full - Include full page content (skip smart extraction)--screenshot - Also save a screenshot--scroll - Scroll to load dynamic content (infinite scroll pages)Examples:
/web-scrape https://example.com/article
/web-scrape https://news.site.com/story --format=json
/web-scrape https://spa-app.com/page --scroll --screenshot
1. mcp__playwright__browser_navigate
url: "<target URL>"
2. mcp__playwright__browser_wait_for
time: 2 (allow initial render)
If --scroll option: Execute scroll sequence to trigger lazy loading:
3. mcp__playwright__browser_evaluate
function: "async () => {
for (let i = 0; i < 3; i++) {
window.scrollTo(0, document.body.scrollHeight);
await new Promise(r => setTimeout(r, 1000));
}
window.scrollTo(0, 0);
}"
4. mcp__playwright__browser_snapshot
→ Returns full accessibility tree with all text content
If --screenshot option:
5. mcp__playwright__browser_take_screenshot
filename: "scraped_<domain>_<timestamp>.png"
fullPage: true
6. mcp__playwright__browser_close
After getting the snapshot, apply intelligent extraction:
| Page Type | Indicators | Extraction Strategy |
|-----------|------------|---------------------|
| Article/Blog | <article>, long paragraphs, date/author | Extract main article body |
| Product Page | Price, "Add to Cart", specs | Extract title, price, description, specs |
| Documentation | Code blocks, headings hierarchy | Preserve structure and code |
| List/Search | Repeated item patterns | Extract as structured list |
| Landing Page | Hero section, CTAs | Extract key messaging |
ALWAYS REMOVE these elements from output:
For Articles:
# [Title]
**Source:** [URL]
**Date:** [if available]
**Author:** [if available]
---
[Main content in clean markdown]
For Product Pages:
# [Product Name]
**Price:** [price]
**Availability:** [in stock/out of stock]
## Description
[product description]
## Specifications
| Spec | Value |
|------|-------|
| ... | ... |
Clean, readable markdown with proper headings, lists, and formatting.
{
"url": "https://...",
"title": "Page Title",
"type": "article|product|docs|list",
"content": {
"main": "...",
"metadata": {}
},
"extracted_at": "ISO timestamp"
}
Plain text with minimal formatting, suitable for further processing.
| Error | Detection | Action | |-------|-----------|--------| | Timeout | Page doesn't load in 30s | Report error, suggest retry | | 404 Not Found | "404" in title/content | Report "Page not found" | | 403 Forbidden | "403", "Access Denied" | Report access restriction | | CAPTCHA | "captcha", "verify you're human" | Report CAPTCHA detected, cannot proceed | | Paywall | "subscribe", "premium content" | Extract visible content, note paywall |
If page load fails:
1. Report the specific error to user
2. Suggest: "Try again?" or "Different URL?"
3. Close browser cleanly
If content is blocked:
1. Report what was detected (CAPTCHA/paywall/geo-block)
2. Extract any available preview content
3. Suggest alternatives if applicable
1. Navigate to URL
2. Wait longer (3-5 seconds) for JS hydration
3. Use browser_wait_for with specific text if known
4. Then snapshot
1. Navigate
2. Execute scroll loop (see Phase 1)
3. Snapshot after scrolling completes
1. Snapshot first to identify clickable elements
2. Use browser_click on "Read more" / "Show all" buttons
3. Wait briefly
4. Snapshot again for full content
1. Scrape first page
2. Identify "Next" or pagination links
3. Ask user: "Article has X pages. Scrape all?"
4. If yes, iterate through pages and combine
| Metric | Target | How | |--------|--------|-----| | Speed | < 15 seconds | Minimal waits, parallel where possible | | Token Usage | < 5000 tokens | Smart extraction, not full DOM | | Reliability | > 95% success | Proper error handling |
Minimum viable scrape (4 tool calls):
1. browser_navigate → 2. browser_wait_for → 3. browser_snapshot → 4. browser_close
Full-featured scrape (with scroll + screenshot):
1. browser_navigate
2. browser_wait_for
3. browser_evaluate (scroll)
4. browser_snapshot
5. browser_take_screenshot
6. browser_close
Remember: The goal is to deliver clean, useful content to the user, not raw HTML/DOM dumps.
tools
Comprehensive PDF manipulation toolkit for extracting text and tables, creating new PDFs, merging/splitting documents, and handling forms. When Claude needs to fill in a PDF form or programmatically process, generate, or analyze PDF documents at scale.
tools
Use this skill when user asks to "commit changes", "create a commit", "stage and commit", or wants help with git commit workflow.
development
Conduct comprehensive deep research on any topic using Dify-powered workflow - searches documentation, academic papers, tutorials, APIs, best practices, and returns structured analysis with insights.
development
Use this skill when user asks to "review code", "check for issues", "analyze code quality", "find bugs", or wants feedback on code implementation.