skills/scraperapi-scraper-builder/SKILL.md
Build and implement web scrapers using ScraperAPI. Use this skill whenever the user asks to build, write, create, or implement a scraper, or wants runnable code that extracts data from a website. Trigger on: "build me a scraper for [website]", "write a scraper that fetches product pages from [ecommerce site]", "I need to scrape [data] from [website]", "create a script that extracts [fields] from [URL]", "help me scrape [website] — I need [fields]", "write code to scrape [website]", "make a script that scrapes [website]", "implement a scraper for [URL]". Guides architectural decisions (structured endpoint vs. raw HTML, JS rendering, proxy tier, sync vs. async batch), then generates a complete runnable Python or Node.js script with retry logic, error handling, pagination, and credit estimation.
npx skillsauth add scraperapi/scraperapi-skills scraperapi-scraper-builderInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Build production-quality scrapers using ScraperAPI. Work through four phases: gather requirements, probe the target site, make architecture decisions, then generate a complete runnable script.
Before writing code, collect the following. Pull from the user's prompt; ask only what is missing.
| Info | Default if not specified | |------|--------------------------| | Target URL or website | Required — ask if missing | | Data fields to extract | Ask if vague ("product info" → which fields exactly?) | | Volume | Infer: single URL / paginated / bulk list of URLs | | Language | Ask if not clear from context; Python is a reasonable default | | Output format | stdout JSON | | Geo-targeting needed? | Infer from site type; confirm for e-commerce pricing |
Before making any architecture decisions, fetch 1–2 sample pages from the target site using ScraperAPI to observe actual behavior. This replaces guesswork with evidence and costs at most 2 credits.
Fetch at most two pages:
Always start with a standard request (no render, no premium) — the cheapest probe:
GET https://api.scraperapi.com/?api_key=<SCRAPERAPI_API_KEY>&url=<target_url>
1. Response status
| Status | Meaning |
|--------|---------|
| 200 | Proceed to content analysis |
| 403 + body has "Just a moment", cf-ray, or "Cloudflare" | Cloudflare detected |
| 403 + body has "DataDome", "PerimeterX", or "Akamai" | Bot manager detected |
| 403 (generic) | Anti-bot protection present; premium proxies likely needed |
| 429 | Rate limited; note for architecture phase |
2. Content completeness (for 200 responses)
Signals that render=true will be needed:
<div id="root"></div>, <div id="app"></div>, <div id="__next"></div> with no children<script> tags: _next, __nuxt__, react, vue, angular<script> and <div> tags with little visible textIf the target data is visibly present in the raw HTML → standard rendering is sufficient, do not enable render.
3. Pagination structure
Look for the site's pagination scheme in <a> hrefs and link tags:
<a rel="next"> or <link rel="next"> → follow-the-next-link pattern?page=N, ?p=N, /page/N/ in hrefs → page-number pagination?start=N or ?offset=N → offset-based paginationdata-infinite-scroll or "Load more" button → JS-driven infinite scroll (requires render=true)Note the exact URL pattern — this feeds directly into the pagination loop in Phase 4.
4. Data structure (when target fields are visible)
<li class="product-item">, <div class="listing-card">)<script type="application/ld+json"> (JSON-LD) or data-* attributes — these are often cleaner to parse than raw HTML5. Internal API endpoints (bonus)
Scan inline <script> blocks for fetch( or axios.get( calls pointing to internal paths (e.g., /api/products, /_next/data/). If found, scraping that JSON endpoint is usually simpler and more stable than parsing HTML.
Summarize findings as a table before proceeding. Every row maps to a specific decision in Phase 3.
| Signal | Observed |
|-------------------------|---------------------------------------|
| HTTP status (standard) | 200 / 403 / ... |
| Anti-bot protection | None / Cloudflare / DataDome / ... |
| JS rendering needed | Yes / No / Uncertain |
| Pagination pattern | ?page=N / rel=next / load-more / ... |
| Target data visible | Yes / No (skeleton HTML) |
| Data container selector | e.g. div.product-card |
| Structured data (LD+J) | Present / Not found |
| Internal API detected | Yes (<path>) / No |
Use the Site Profile from Phase 2 as primary evidence for each decision. Work through all six decisions in order, citing specific observations from the profile. Summarize all decisions as a table (see end of this section) before writing any code.
Check if the target site is one of the supported verticals. If yes, use it — structured endpoints return clean JSON with no HTML parsing, and they handle rendering and anti-bot automatically. Skip Decisions 2 and 3.
Supported verticals (verify against docs):
product, search, offers, reviewssearch, news, jobs, shopping, mapsproduct, search, categoryproduct, searchlisting, searchjobs, jobprofile, jobs (availability may vary)Endpoint pattern: GET https://api.scraperapi.com/structured/<site>/<type>?api_key=...&<required_param>=...
Enable render=true only when the Site Profile confirms that target data was absent in the standard response. Cost: ~10 credits/req.
render=truerender=truerender=true and compareDo not enable render=true speculatively — the Phase 2 probe is the evidence.
Use the Site Profile's "Anti-bot protection" row as the starting point. Escalate gradually — never start at the highest tier.
premium=true (~10 credits)premium=true still returns 403 → escalate to ultra_premium=true (~30 credits)ultra_premium=true and premium=true are mutually exclusive — never set both.
| Volume | Approach |
|--------|----------|
| 1–10 URLs | Sync loop, no special handling |
| 10–100 URLs | Sync with 1 req/sec rate limiting |
| >100 URLs | Async Jobs API (POST https://async.scraperapi.com/batchjobs) |
For async batches, the script submits all URLs, then polls until all jobs complete. See references/code-templates.md for the async pattern.
Add session_number=<int> when:
Pick any integer as the session ID. Requests with the same session_number are routed through the same proxy IP.
Add country_code (ISO-3166, e.g., us, gb, de) when:
Before generating code, print this table:
| Parameter | Value | Reason |
|----------------|------------|-----------------------------------|
| url | ... | target |
| render | true/false | (why) |
| premium | true/false | (why) |
| ultra_premium | true/false | (why) |
| country_code | us / – | (why) |
| session_number | N / – | (why) |
| volume mode | sync/async | (why) |
Ask for language if not already known (default: Python). Generate a complete, runnable script — not pseudocode or snippets. See references/code-templates.md for the base templates to adapt.
Use the Site Profile from Phase 2 to pre-fill the parsing logic: if a data container selector was identified (e.g., div.product-card), write it directly into the extract() function rather than leaving a TODO. If JSON-LD was found, parse <script type="application/ld+json"> instead of navigating the HTML tree. If an internal API path was found, target that endpoint instead of the page HTML.
argparse CLI with --url, --pages, --output, --max-credits, plus flags for all ScraperAPI params decided in Phase 3os.environ["SCRAPERAPI_API_KEY"] — never hardcodedscrape() function with exponential backoff retry (5 attempts)extract() function pre-filled with selectors from the Site Profile (or a TODO if none were identified)--output fileSame flags and behavior as above. Use node-fetch@2 + commander. Match the user's existing module format (ESM vs. CommonJS) if known, otherwise default to CommonJS.
# Requirements: pip install requests (Python)
# Requirements: npm install node-fetch@2 commander (Node.js)
# Usage: SCRAPERAPI_API_KEY=your_key python scraper.py --url "https://example.com" --pages 5
| Configuration | Credits per request |
|---------------|---------------------|
| Standard | 1 |
| render=true | ~10 |
| premium=true | ~10 |
| ultra_premium=true | ~30 |
| Structured endpoint | 1–10 (varies by vertical) |
Print a cost estimate before executing: # Estimated: 50 pages × 10 credits = ~500 credits
The generated script must include a --max-credits flag that aborts if the estimate exceeds the budget.
| Status | Action |
|--------|--------|
| 200 | Parse and return |
| 401 | Abort — invalid API key, do not retry |
| 403 | Retry with premium=true; if still blocked, try ultra_premium=true |
| 404 | Skip this URL — page does not exist |
| 429 | Exponential backoff and retry; switch to async for large batches |
| 500/503 | Exponential backoff, max 5 attempts |
Before presenting the script:
$SCRAPERAPI_API_KEY env var — not hardcoded--max-credits guard is present and enforced before the first requestultra_premium and premium are not both setdevelopment
SERP landscape analysis for SEO strategy decisions. Use this skill when the user wants to understand what a search results page actually looks like for their target keywords — including AI Overview presence and attribution, SERP feature composition, how Google is interpreting query intent, which competitors dominate specific keyword sets, and where organic rankings actually translate to visible traffic. Trigger on requests like "analyze the SERP for [keyword]," "why isn't my content getting traffic even though it ranks," "what does Google show for [keyword]," "which keywords are worth targeting," "is [keyword] dominated by AI Overviews," "who owns the SERP for [topic]," "SERP analysis," "keyword landscape," or any request to understand what's happening on a search results page before making a content or SEO strategy decision.
tools
Run a comprehensive SEO audit using ScraperAPI's live SERP and scraping tools — no setup required. Use this skill whenever the user wants to: audit SEO for a website, understand why a page isn't ranking, check SEO health, analyze keyword rankings, compare against competitors in search results, find content gaps, review on-page signals (titles, meta, headings, schema), diagnose a traffic drop, check indexation, or get prioritized SEO recommendations. Also trigger when the user says things like "why am I not showing up on Google," "my traffic dropped," "how do I rank for X," "what's wrong with my SEO," "SEO check," or "SEO review." This skill works out of the box — it uses the ScraperAPI MCP tools already connected to this session, with no CLI or API key setup needed.
development
Use this skill whenever the user wants to check, track, or be alerted about product prices on Amazon, Walmart, or via Google Shopping. Trigger on: "monitor the price of this Amazon product", "did the price drop on [Walmart URL]?", "track these ASINs", "compare today's prices to last week", "alert me if [product] goes below $X", "what's the current price of [product]?", "check my price watchlist", "scrape the price of [URL]", "is [product] cheaper anywhere else?". Accepts ASINs, Amazon/Walmart product URLs, or free-text product queries for Google Shopping. Reads an optional baseline JSON file to detect changes, fetches live prices via ScraperAPI's structured endpoints, and reports increases, decreases, restocks, and out-of-stock transitions in a structured change report. Use this skill even when the user does not say the word "monitor" — any one-shot or recurring price-check request belongs here.
development
Market research powered by live web data. Use this skill when the user wants to understand a market, category, or customer segment — including consumer sentiment, demand and trend signals, price and category structure, or the competitive landscape. Trigger on requests like "research the [X] market," "what do customers think about [category]," "how is [category] priced," "what trends are shaping [industry]," "who are the players in [space]," or any request to understand a market before making a product, pricing, positioning, or investment decision.