abhishekj9621/web-scraper-skill/SKILL.md
Use this skill to scrape, crawl, or extract data from websites using Apify or Firecrawl APIs. Trigger whenever the user wants to: scrape a URL, crawl a website, extract structured data from web pages, run an Apify Actor, batch scrape multiple URLs, search and scrape the web, map a site's URLs, collect product/price/review data, or build any web data pipeline. If the user says things like "scrape this site", "get data from this URL", "crawl this website", "run an Apify actor", "use Firecrawl", "extract content from a page", "pull data from the web", or mentions any web data extraction task — always use this skill. Also use it when the user wants to choose between Apify and Firecrawl.
npx skillsauth add openclaw/skills web-scraper-skillInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill helps Openclaw scrape and extract data from websites using two powerful APIs:
| Use Case | Recommended Tool |
|---|---|
| Scrape a single page into markdown/JSON | Firecrawl /scrape |
| Crawl an entire website (follow links) | Firecrawl /crawl |
| Map all URLs on a site | Firecrawl /map |
| Search web + scrape results | Firecrawl /search |
| Scrape Instagram / TikTok / Twitter | Apify (social actors) |
| Scrape Google Maps / reviews | Apify (compass/crawler-google-places) |
| Scrape Amazon products | Apify (apify/amazon-scraper) |
| Scrape Google Search results | Apify (apify/google-search-scraper) |
| Custom actor / any Apify Store actor | Apify |
Both APIs require API keys passed via headers. Always ask the user for their key if not provided.
Firecrawl: Authorization: Bearer fc-YOUR_API_KEY
Apify: Authorization: Bearer YOUR_APIFY_TOKEN (or ?token=YOUR_TOKEN in URL)
Base URL: https://api.firecrawl.dev/v2
POST /v2/scrape
Authorization: Bearer fc-YOUR_API_KEY
Content-Type: application/json
{
"url": "https://example.com",
"formats": ["markdown"], // Options: markdown, html, rawHtml, links, screenshot, json
"onlyMainContent": true, // Strips nav/footer/ads
"waitFor": 0, // ms to wait before scraping (for JS-heavy pages)
"timeout": 30000, // ms
"blockAds": true,
"proxy": "auto" // "auto", "basic", or "stealth"
}
Response: { "success": true, "data": { "markdown": "...", "metadata": {...} } }
Crawling is async — starts a job, then poll for results.
POST /v2/crawl
{
"url": "https://docs.example.com",
"limit": 50, // Max pages
"maxDepth": 3,
"allowExternalLinks": false,
"scrapeOptions": {
"formats": ["markdown"],
"onlyMainContent": true
}
}
Response: { "success": true, "id": "crawl-job-id" }
Poll status:
GET /v2/crawl/{crawl-job-id}
Response: { "status": "completed", "total": 50, "data": [...] }
POST /v2/map
{ "url": "https://example.com" }
Response: { "success": true, "links": [{ "url": "...", "title": "..." }] }
POST /v2/search
{
"query": "best web scraping tools 2025",
"limit": 5,
"scrapeOptions": { "formats": ["markdown"] }
}
Response: { "data": [{ "url": "...", "title": "...", "markdown": "..." }] }
POST /v2/batch/scrape
{
"urls": ["https://a.com", "https://b.com"],
"formats": ["markdown"]
}
Returns a job ID; poll with GET /v2/batch/scrape/{id}
Base URL: https://api.apify.com/v2
Auth: Pass token as query param ?token=YOUR_TOKEN or in Authorization header.
Apify runs "Actors" (pre-built scrapers). The flow is:
runId and defaultDatasetIdSUCCEEDEDPOST /v2/acts/{actorId}/runs?token=YOUR_TOKEN
Content-Type: application/json
{ ...actor-specific input... }
Response:
{
"data": {
"id": "RUN_ID",
"status": "RUNNING",
"defaultDatasetId": "DATASET_ID"
}
}
Common Actor IDs:
apify/web-scraper — generic JS scraperapify/google-search-scraper — Google SERPscompass/crawler-google-places — Google Mapsapify/instagram-scraper — Instagramclockworks/free-tiktok-scraper — TikTokapify/amazon-scraper — Amazon productsGET /v2/acts/{actorId}/runs/{runId}?token=YOUR_TOKEN
Poll until status is SUCCEEDED or FAILED. Recommended interval: 5 seconds.
GET /v2/datasets/{datasetId}/items?token=YOUR_TOKEN&format=json
Optional params: format (json/csv/xlsx/xml), limit, offset
For short runs, use the sync endpoint — it waits and returns dataset items directly:
POST /v2/acts/{actorId}/run-sync-get-dataset-items?token=YOUR_TOKEN
Content-Type: application/json
{ ...actor input... }
Google Search Scraper:
{ "queries": "web scraping tools", "maxPagesPerQuery": 1, "resultsPerPage": 10 }
Google Maps Scraper:
{ "searchStringsArray": ["restaurants in Mumbai"], "maxCrawledPlaces": 20 }
Web Scraper (generic):
{
"startUrls": [{ "url": "https://example.com" }],
"pageFunction": "async function pageFunction(context) { const $ = context.jQuery; return { title: $('title').text() }; }",
"maxPagesPerCrawl": 10
}
GET /v2/datasets/{id}/items.See references/code-templates.md for ready-to-run Python and JavaScript code for both APIs.
GET /v2/acts/{id}/runs/{runId}/logsuccess: false in Firecrawl responsesrobots.txt by defaultwaitFor (Firecrawl) or use Playwright/Puppeteer actors (Apify)limit before scalingonlyMainContent: true in Firecrawl to remove nav/footer noisetools
Use when the user wants to connect to, test, or use the McDonalds service at mcp.mcd.cn, including checking authentication, probing MCP endpoints, listing tools, or calling McDonalds MCP tools through a reusable local CLI.
development
Web scraping platform — Twitter/X data, Vinted marketplace, and general web scraping API
development
SlowMist AI Agent Security Review — comprehensive security framework for skills, repositories, URLs, on-chain addresses, and products (Claude Code version)
data-ai
去除中文文本中的 AI 写作痕迹,使其读起来自然。基于维基百科 AI 写作特征指南,检测 24 种 AI 模式。触发词:humanizer-cn、去除 AI 痕迹、去除 AI 写作痕迹、中文文本人性化。