resources/skills/web-scraper/SKILL.md
Scrape web pages, search the internet, and extract structured content using Python. Use when the user wants to fetch a webpage, search for information online, extract links, or crawl JavaScript-rendered dynamic pages.
npx skillsauth add aidotnet/opencowork web-scraperInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
4 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Fetch, search, and extract content from websites.
| Script | Purpose | Dependencies |
| ------------------ | ---------------------------------------------------- | ------------------------------------------------------------- |
| fetch_page.py | Fetch a URL and extract readable content as Markdown | requests, beautifulsoup4, readability-lxml, html2text |
| search_web.py | Search the web via DuckDuckGo | ddgs |
| crawl_dynamic.py | Crawl JS-rendered pages with a headless browser | crawl4ai |
| extract_links.py | Extract and categorize all links from a page | requests, beautifulsoup4 |
For lightweight scraping (static pages, search, link extraction):
pip install requests beautifulsoup4 readability-lxml html2text ddgs
For dynamic / JavaScript-rendered pages (heavier, installs Playwright + Chromium):
pip install crawl4ai
crawl4ai-setup
Note:
crawl4ai-setupdownloads a Chromium browser (~150 MB). Only install if you actually need dynamic page support.
CRITICAL — Dependency Error Recovery: If ANY script below fails with an
ImportErroror "module not found" error, install the missing dependencies using the command above, then re-run the EXACT SAME script command that failed. Do NOT write inline Python code (python -c "...") or your own ad-hoc scripts as a substitute. These scripts handle encoding, error handling, and output formatting that inline code will miss.
Use this for most websites. It's fast, lightweight, and works for articles, docs, blogs, etc.
python scripts/fetch_page.py "URL"
Options:
--raw — Output full page Markdown instead of extracted article content--selector "CSS_SELECTOR" — Extract only elements matching the CSS selector (e.g. ".article-body", "table", "#content")--save OUTPUT_PATH — Also save output to a file--max-length N — Truncate output to N characters (default: no limit)Examples:
# Fetch an article
python fetch_page.py "https://example.com/article"
# Extract only tables
python fetch_page.py "https://example.com/data" --selector "table"
# Fetch raw full-page markdown, limit to 5000 chars
python fetch_page.py "https://example.com" --raw --max-length 5000
Search using DuckDuckGo (no API key required).
python scripts/search_web.py "search query"
Options:
--max-results N — Number of results to return (default: 10)--region REGION — Region code, e.g. cn-zh, us-en, jp-jp (default: wt-wt for worldwide)--news — Search news instead of general webExamples:
# General search
python search_web.py "Python web scraping best practices 2025"
# News search, Chinese region, 5 results
python search_web.py "AI 最新进展" --news --region cn-zh --max-results 5
Use this only when fetch_page.py returns empty or incomplete content (SPA, React/Vue apps, pages that load content via JS).
python scripts/crawl_dynamic.py "URL"
Options:
--wait N — Wait N seconds after page load for JS to finish (default: 3)--selector "CSS_SELECTOR" — Wait for a specific element to appear before extracting--scroll — Scroll to bottom of page to trigger lazy loading--save OUTPUT_PATH — Also save output to a file--max-length N — Truncate output to N charactersExtract all links with their text labels, categorized by type (internal, external, resource).
python scripts/extract_links.py "URL"
Options:
--filter PATTERN — Only show links matching a regex pattern (applied to URL)--external-only — Only show external links--json — Output as JSON instead of Markdownfetch_page.py — handles 90% of websites (articles, docs, blogs, wikis).fetch_page.py returns empty/garbled content → try crawl_dynamic.py (the page likely needs JavaScript).search_web.py to discover relevant pages.extract_links.py to map out links, then fetch individual pages.search_web.py "topic" → get relevant URLsfetch_page.py "best_url" → read the contentfetch_page.py "url" --selector "table" → extract tablesfetch_page.py "url" --selector ".product-card" → extract specific elementscrawl_dynamic.py "url" --wait 5 --scroll → full JS-rendered content--max-length to truncate output and avoid overwhelming the context window.tools
Post tweets to X.com (Twitter) using the system browser's login state
development
Comprehensive document creation, editing, and analysis with support for tracked changes, comments, formatting preservation, and text extraction. When GLM needs to work with professional documents (.docx files) for: (1) Creating new documents, (2) Modifying or editing content, (3) Working with tracked changes, (4) Adding comments, or any other document tasks
development
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When GLM needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
testing
Search Xiaohongshu (Rednote) by keyword and extract note image URLs and titles with Playwright. Use when the user wants 小红书搜索结果抓取、图片链接提取或标题采集导出。Supports terminal JSON output and optional local text export.