Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

aidotnet/web-scraper

Name: web-scraper
Author: aidotnet

resources/skills/web-scraper/SKILL.md

npx skillsauth add aidotnet/opencowork web-scraper

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Clean

VirusTotalMulti-engine malware detection

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Web Scraper

Fetch, search, and extract content from websites.

When to use this skill

User asks to fetch or read a webpage / URL
User wants to search the internet for information
User needs to extract links, tables, or structured data from a website
User asks to crawl a JavaScript-rendered (dynamic) page
User wants web content converted to clean Markdown for analysis

Scripts overview

| Script | Purpose | Dependencies | | ------------------ | ---------------------------------------------------- | ------------------------------------------------------------- | | fetch_page.py | Fetch a URL and extract readable content as Markdown | requests, beautifulsoup4, readability-lxml, html2text | | search_web.py | Search the web via DuckDuckGo | ddgs | | crawl_dynamic.py | Crawl JS-rendered pages with a headless browser | crawl4ai | | extract_links.py | Extract and categorize all links from a page | requests, beautifulsoup4 |

Steps

1. Install dependencies (first time only)

For lightweight scraping (static pages, search, link extraction):

pip install requests beautifulsoup4 readability-lxml html2text ddgs

For dynamic / JavaScript-rendered pages (heavier, installs Playwright + Chromium):

pip install crawl4ai
crawl4ai-setup

Note: crawl4ai-setup downloads a Chromium browser (~150 MB). Only install if you actually need dynamic page support.

CRITICAL — Dependency Error Recovery: If ANY script below fails with an ImportError or "module not found" error, install the missing dependencies using the command above, then re-run the EXACT SAME script command that failed. Do NOT write inline Python code (python -c "...") or your own ad-hoc scripts as a substitute. These scripts handle encoding, error handling, and output formatting that inline code will miss.

2. Fetch a web page (static — recommended first choice)

Use this for most websites. It's fast, lightweight, and works for articles, docs, blogs, etc.

python scripts/fetch_page.py "URL"

Options:

--raw — Output full page Markdown instead of extracted article content
--selector "CSS_SELECTOR" — Extract only elements matching the CSS selector (e.g. ".article-body", "table", "#content")
--save OUTPUT_PATH — Also save output to a file
--max-length N — Truncate output to N characters (default: no limit)

Examples:

# Fetch an article
python fetch_page.py "https://example.com/article"

# Extract only tables
python fetch_page.py "https://example.com/data" --selector "table"

# Fetch raw full-page markdown, limit to 5000 chars
python fetch_page.py "https://example.com" --raw --max-length 5000

3. Search the web

Search using DuckDuckGo (no API key required).

python scripts/search_web.py "search query"

Options:

--max-results N — Number of results to return (default: 10)
--region REGION — Region code, e.g. cn-zh, us-en, jp-jp (default: wt-wt for worldwide)
--news — Search news instead of general web

Examples:

# General search
python search_web.py "Python web scraping best practices 2025"

# News search, Chinese region, 5 results
python search_web.py "AI 最新进展" --news --region cn-zh --max-results 5

4. Crawl a dynamic / JavaScript-rendered page

Use this only when fetch_page.py returns empty or incomplete content (SPA, React/Vue apps, pages that load content via JS).

python scripts/crawl_dynamic.py "URL"

Options:

--wait N — Wait N seconds after page load for JS to finish (default: 3)
--selector "CSS_SELECTOR" — Wait for a specific element to appear before extracting
--scroll — Scroll to bottom of page to trigger lazy loading
--save OUTPUT_PATH — Also save output to a file
--max-length N — Truncate output to N characters

5. Extract links from a page

Extract all links with their text labels, categorized by type (internal, external, resource).

python scripts/extract_links.py "URL"

Options:

--filter PATTERN — Only show links matching a regex pattern (applied to URL)
--external-only — Only show external links
--json — Output as JSON instead of Markdown

Decision guide: which script to use

Start with fetch_page.py — handles 90% of websites (articles, docs, blogs, wikis).
If fetch_page.py returns empty/garbled content → try crawl_dynamic.py (the page likely needs JavaScript).
Need to find URLs first? → Use search_web.py to discover relevant pages.
Need to navigate a site structure? → Use extract_links.py to map out links, then fetch individual pages.

Common workflows

Research a topic

search_web.py "topic" → get relevant URLs
fetch_page.py "best_url" → read the content
Repeat for multiple sources, then synthesize

Scrape structured data from a page

fetch_page.py "url" --selector "table" → extract tables
Or fetch_page.py "url" --selector ".product-card" → extract specific elements

Crawl a modern web app (SPA)

crawl_dynamic.py "url" --wait 5 --scroll → full JS-rendered content

Edge cases

Paywalled sites: May return partial content or login pages. Inform the user.
Rate limiting / CAPTCHAs: If requests fail with 403/429, wait and retry or inform the user.
Very large pages: Use --max-length to truncate output and avoid overwhelming the context window.
Encoding issues: Scripts handle UTF-8 by default. Exotic encodings may need manual adjustment.
Robots.txt: These scripts do not check robots.txt. Use responsibly and respect website terms of service.

Scripts

fetch_page.py — Fetch and extract readable content as Markdown
search_web.py — Search the web via DuckDuckGo
crawl_dynamic.py — Crawl JavaScript-rendered pages
extract_links.py — Extract and categorize page links

aidotnet/web-scraper

resources/skills/web-scraper/SKILL.md

Scrape web pages, search the internet, and extract structured content using Python. Use when the user wants to fetch a webpage, search for information online, extract links, or crawl JavaScript-rendered dynamic pages.

284 stars

development

Updated Mar 27, 2026

$ install --global

skillsauth

npx skillsauth add aidotnet/opencowork web-scraper

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

4 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Clean

VirusTotalMulti-engine malware detection

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 4, 2026, 12:29 AM63.7s1 file scanned

SKILL.md

name:: web-scraper
description:: Scrape web pages, search the internet, and extract structured content using Python. Use when the user wants to fetch a webpage, search for information online, extract links, or crawl JavaScript-rendered dynamic pages.
compatibility:: Requires Python 3. Lightweight mode needs requests, beautifulsoup4, readability-lxml, html2text. Dynamic mode needs crawl4ai. Search needs duckduckgo-search.

Web Scraper

Fetch, search, and extract content from websites.

When to use this skill

User asks to fetch or read a webpage / URL
User wants to search the internet for information
User needs to extract links, tables, or structured data from a website
User asks to crawl a JavaScript-rendered (dynamic) page
User wants web content converted to clean Markdown for analysis

Scripts overview

Steps

1. Install dependencies (first time only)

For lightweight scraping (static pages, search, link extraction):

pip install requests beautifulsoup4 readability-lxml html2text ddgs

For dynamic / JavaScript-rendered pages (heavier, installs Playwright + Chromium):

pip install crawl4ai
crawl4ai-setup

Note: crawl4ai-setup downloads a Chromium browser (~150 MB). Only install if you actually need dynamic page support.

CRITICAL — Dependency Error Recovery: If ANY script below fails with an ImportError or "module not found" error, install the missing dependencies using the command above, then re-run the EXACT SAME script command that failed. Do NOT write inline Python code (python -c "...") or your own ad-hoc scripts as a substitute. These scripts handle encoding, error handling, and output formatting that inline code will miss.

2. Fetch a web page (static — recommended first choice)

Use this for most websites. It's fast, lightweight, and works for articles, docs, blogs, etc.

python scripts/fetch_page.py "URL"

Options:

--raw — Output full page Markdown instead of extracted article content
--selector "CSS_SELECTOR" — Extract only elements matching the CSS selector (e.g. ".article-body", "table", "#content")
--save OUTPUT_PATH — Also save output to a file
--max-length N — Truncate output to N characters (default: no limit)

Examples:

# Fetch an article
python fetch_page.py "https://example.com/article"

# Extract only tables
python fetch_page.py "https://example.com/data" --selector "table"

# Fetch raw full-page markdown, limit to 5000 chars
python fetch_page.py "https://example.com" --raw --max-length 5000

3. Search the web

Search using DuckDuckGo (no API key required).

python scripts/search_web.py "search query"

Options:

--max-results N — Number of results to return (default: 10)
--region REGION — Region code, e.g. cn-zh, us-en, jp-jp (default: wt-wt for worldwide)
--news — Search news instead of general web

Examples:

# General search
python search_web.py "Python web scraping best practices 2025"

# News search, Chinese region, 5 results
python search_web.py "AI 最新进展" --news --region cn-zh --max-results 5

4. Crawl a dynamic / JavaScript-rendered page

Use this only when fetch_page.py returns empty or incomplete content (SPA, React/Vue apps, pages that load content via JS).

python scripts/crawl_dynamic.py "URL"

Options:

--wait N — Wait N seconds after page load for JS to finish (default: 3)
--selector "CSS_SELECTOR" — Wait for a specific element to appear before extracting
--scroll — Scroll to bottom of page to trigger lazy loading
--save OUTPUT_PATH — Also save output to a file
--max-length N — Truncate output to N characters

5. Extract links from a page

Extract all links with their text labels, categorized by type (internal, external, resource).

python scripts/extract_links.py "URL"

Options:

--filter PATTERN — Only show links matching a regex pattern (applied to URL)
--external-only — Only show external links
--json — Output as JSON instead of Markdown

Decision guide: which script to use

Start with fetch_page.py — handles 90% of websites (articles, docs, blogs, wikis).
If fetch_page.py returns empty/garbled content → try crawl_dynamic.py (the page likely needs JavaScript).
Need to find URLs first? → Use search_web.py to discover relevant pages.
Need to navigate a site structure? → Use extract_links.py to map out links, then fetch individual pages.

Common workflows

Research a topic

search_web.py "topic" → get relevant URLs
fetch_page.py "best_url" → read the content
Repeat for multiple sources, then synthesize

Scrape structured data from a page

fetch_page.py "url" --selector "table" → extract tables
Or fetch_page.py "url" --selector ".product-card" → extract specific elements

Crawl a modern web app (SPA)

crawl_dynamic.py "url" --wait 5 --scroll → full JS-rendered content

Edge cases

Paywalled sites: May return partial content or login pages. Inform the user.
Rate limiting / CAPTCHAs: If requests fail with 403/429, wait and retry or inform the user.
Very large pages: Use --max-length to truncate output and avoid overwhelming the context window.
Encoding issues: Scripts handle UTF-8 by default. Exotic encodings may need manual adjustment.
Robots.txt: These scripts do not check robots.txt. Use responsibly and respect website terms of service.

Scripts

fetch_page.py — Fetch and extract readable content as Markdown
search_web.py — Search the web via DuckDuckGo
crawl_dynamic.py — Crawl JavaScript-rendered pages
extract_links.py — Extract and categorize page links

Related Skills

aidotnet/resources/skills/post-to-x

tools

VerifiedTrustedCommunity

Post tweets to X.com (Twitter) using the system browser's login state

448SKILL.mdUpdated Mar 27, 2026

aidotnet/resources/skills/post-to-x

aidotnet/docx

development

VerifiedTrustedCommunity

Comprehensive document creation, editing, and analysis with support for tracked changes, comments, formatting preservation, and text extraction. When GLM needs to work with professional documents (.docx files) for: (1) Creating new documents, (2) Modifying or editing content, (3) Working with tracked changes, (4) Adding comments, or any other document tasks

448SKILL.mdUpdated Mar 27, 2026

aidotnet/xlsx

development

VerifiedTrustedCommunity

Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When GLM needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas

284SKILL.mdUpdated Mar 27, 2026

aidotnet/xiaohongshu-search

testing

VerifiedTrustedCommunity

Search Xiaohongshu (Rednote) by keyword and extract note image URLs and titles with Playwright. Use when the user wants 小红书搜索结果抓取、图片链接提取或标题采集导出。Supports terminal JSON output and optional local text export.

284SKILL.mdUpdated Mar 27, 2026

aidotnet/xiaohongshu-search

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/aidotnet/opencowork.git

# Copy into Claude Code skills folder (global)
cp -r opencowork/resources/skills/web-scraper ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

aidotnet/opencowork

284 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT