skills/scraperapi-crawler/SKILL.md
Product-usage reference for ScraperAPI's Crawler — crawl an entire site or section by following links automatically. Consult when the user needs to extract data from many pages of a site without knowing the URLs upfront. Use when user asks: "crawl an entire website with ScraperAPI", "scrape all pages on a domain", "follow links and scrape each page", "how do I use the ScraperAPI crawler API", "scrape a site map", "extract data from every product page on a site", "ScraperAPI crawler job API". Covers job creation, URL regex patterns, depth vs budget, per-page scraping parameters, status polling, webhooks, scheduling, and credit costs. Also invoke when the user is building a site-wide scraper and asks which ScraperAPI product to use.
npx skillsauth add scraperapi/scraperapi-skills scraperapi-crawlerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
The Crawler discovers and scrapes linked pages automatically, starting from a seed URL and following links that match your regex pattern. Use it when you need data from many pages of a site but don't have the URLs upfront.
api.scraperapi.com).max_depth: 1.| Action | Method | URL |
|--------|--------|-----|
| Start a crawl | POST | https://crawler.scraperapi.com/job |
| Check status | GET | https://crawler.scraperapi.com/job/<jobId> |
| Cancel a crawl | DELETE | https://crawler.scraperapi.com/job/<jobId> |
Auth: include api_key in the JSON body (POST) or as a query parameter (GET/DELETE).
import os, requests
API_KEY = os.environ["SCRAPERAPI_API_KEY"]
job = requests.post(
"https://crawler.scraperapi.com/job",
json={
"api_key": API_KEY,
"start_url": "https://example.com/blog/",
"url_regexp_include": "(?<full_url>https?://example\\.com/blog/.*)",
"url_regexp_exclude": ".*\\.(pdf|png|jpg|zip)",
"max_depth": 3,
"crawl_budget": 500,
"api_params": {
"country_code": "us",
"render": False,
},
"callback": {
"type": "webhook",
"url": "https://yourapp.com/crawler-results"
},
}
).json()
job_id = job["jobId"]
print(f"Started: {job_id}")
| Parameter | Description |
|-----------|-------------|
| api_key | Your ScraperAPI key (read from env, never hardcode) |
| start_url | Seed URL — where the crawl begins |
| url_regexp_include | Regex defining which URLs to crawl (see patterns below) |
| max_depth or crawl_budget | Must provide at least one; providing both applies whichever limit is hit first |
url_regexp_include and url_regexp_exclude are standard regexes. Include patterns must use
named capture groups: (?<full_url>...) for absolute URLs, (?<relative_url>...) for relative.
# All pages on a domain
(?<full_url>https?://example\.com/.*)
# Only blog posts
(?<full_url>https?://example\.com/blog/.*)
# Only HTML pages — exclude files by requiring no extension
(?<full_url>https?://example\.com/[^.]*$)
# Relative paths (for sites that use relative links)
(?<relative_url>/products/.*)
Exclusion patterns (url_regexp_exclude — plain regex, no capture group needed):
# Skip binary files
.*\.(pdf|png|jpg|jpeg|gif|svg|zip|mp4)
# Skip auth and admin pages
.*(login|logout|signup|admin|auth).*
Start with a narrow url_regexp_include and validate on a small test run before scaling up — a
pattern like .* will follow external links and crawl the entire internet.
| Use max_depth when | Use crawl_budget when |
|---------------------|------------------------|
| Site structure is known and bounded | Site size is unknown |
| You want a specific section (e.g., 3 hops from landing page) | You want to cap credit spend |
| Small site | Large site with unpredictable link density |
Use both together for safe exploration: max_depth: 3, crawl_budget: 500 stops at whichever
limit is hit first.
Depth 0 = start URL only. Depth 1 = start URL + every page it links to. Depth 2 = one level deeper, and so on.
api_params)Controls how each discovered page is fetched. All standard ScraperAPI parameters are supported:
{
"render": true,
"premium": false,
"ultra_premium": false,
"country_code": "us",
"device_type": "desktop",
"output_format": "markdown"
}
Add render: true only if the site uses JavaScript to load content — it multiplies the credit
cost of every crawled page. Test without it first.
import time, requests
def wait_for_crawl(job_id, api_key, poll_interval=10):
url = f"https://crawler.scraperapi.com/job/{job_id}"
while True:
data = requests.get(url, params={"api_key": api_key}).json()
status = data.get("status")
print(f"Status: {status}")
if status in ("completed", "failed", "cancelled", "delivered"):
return data
time.sleep(poll_interval)
Job states: delayed → running → completed / failed / cancelled → in delivery → delivered.
Use webhooks instead of polling for long crawls — they push results as each page is scraped, rather than requiring you to poll for a final result.
When callback is set, ScraperAPI POSTs results to your endpoint as each page is scraped (not
just at the end). Each payload contains the crawled URL, HTML/content, credit cost for that
request, and the current depth.
A final summary payload is sent when the crawl completes, listing all succeeded/failed URLs and total cost.
{
"callback": {
"type": "webhook",
"url": "https://yourapp.com/crawler-results"
}
}
The webhook URL must be publicly reachable. ScraperAPI retries delivery up to 3 times; a failed webhook does not stop the crawl itself. Results are also retained for up to 30 days if you need to re-fetch them.
Omit schedule for a one-time crawl. Add it to run on a recurring interval:
{
"schedule": {
"name": "weekly-blog-crawl",
"interval": "daily"
}
}
Available intervals: once, hourly, daily, weekly, monthly. Scheduling requires a paid
plan — free accounts can only run one-time crawls at max_depth: 1.
Crawler credit cost = sum of all individual page requests in the crawl. Each page costs the
same as a direct Standard API call with the same api_params (render, premium, etc.).
Set crawl_budget to cap total credits. The crawl stops gracefully when the budget is reached —
it will not exceed it. Failed requests are not charged.
Check the sa-credit-cost header on each webhook payload to see per-page cost; use this to
calibrate crawl_budget for future runs.
| Code | Meaning | Action |
|------|---------|--------|
| 200 | Job created or status returned | Continue |
| 400 | Malformed request | Check required fields and regex syntax |
| 401 | Invalid API key | Check SCRAPERAPI_API_KEY |
| 403 | Credits exhausted or free plan depth limit | Upgrade or reduce max_depth/crawl_budget |
| 429 | Too many concurrent jobs | Wait and retry |
| 500 | Transient failure | Retry with backoff |
development
SERP landscape analysis for SEO strategy decisions. Use this skill when the user wants to understand what a search results page actually looks like for their target keywords — including AI Overview presence and attribution, SERP feature composition, how Google is interpreting query intent, which competitors dominate specific keyword sets, and where organic rankings actually translate to visible traffic. Trigger on requests like "analyze the SERP for [keyword]," "why isn't my content getting traffic even though it ranks," "what does Google show for [keyword]," "which keywords are worth targeting," "is [keyword] dominated by AI Overviews," "who owns the SERP for [topic]," "SERP analysis," "keyword landscape," or any request to understand what's happening on a search results page before making a content or SEO strategy decision.
tools
Run a comprehensive SEO audit using ScraperAPI's live SERP and scraping tools — no setup required. Use this skill whenever the user wants to: audit SEO for a website, understand why a page isn't ranking, check SEO health, analyze keyword rankings, compare against competitors in search results, find content gaps, review on-page signals (titles, meta, headings, schema), diagnose a traffic drop, check indexation, or get prioritized SEO recommendations. Also trigger when the user says things like "why am I not showing up on Google," "my traffic dropped," "how do I rank for X," "what's wrong with my SEO," "SEO check," or "SEO review." This skill works out of the box — it uses the ScraperAPI MCP tools already connected to this session, with no CLI or API key setup needed.
development
Build and implement web scrapers using ScraperAPI. Use this skill whenever the user asks to build, write, create, or implement a scraper, or wants runnable code that extracts data from a website. Trigger on: "build me a scraper for [website]", "write a scraper that fetches product pages from [ecommerce site]", "I need to scrape [data] from [website]", "create a script that extracts [fields] from [URL]", "help me scrape [website] — I need [fields]", "write code to scrape [website]", "make a script that scrapes [website]", "implement a scraper for [URL]". Guides architectural decisions (structured endpoint vs. raw HTML, JS rendering, proxy tier, sync vs. async batch), then generates a complete runnable Python or Node.js script with retry logic, error handling, pagination, and credit estimation.
development
Use this skill whenever the user wants to check, track, or be alerted about product prices on Amazon, Walmart, or via Google Shopping. Trigger on: "monitor the price of this Amazon product", "did the price drop on [Walmart URL]?", "track these ASINs", "compare today's prices to last week", "alert me if [product] goes below $X", "what's the current price of [product]?", "check my price watchlist", "scrape the price of [URL]", "is [product] cheaper anywhere else?". Accepts ASINs, Amazon/Walmart product URLs, or free-text product queries for Google Shopping. Reads an optional baseline JSON file to detect changes, fetches live prices via ScraperAPI's structured endpoints, and reports increases, decreases, restocks, and out-of-stock transitions in a structured change report. Use this skill even when the user does not say the word "monitor" — any one-shot or recurring price-check request belongs here.