1kalin/afrexai-web-scraping-engine/SKILL.md
Complete web scraping methodology — legal compliance, architecture design, anti-detection, data pipelines, and production operations. Use when building scrapers, extracting web data, monitoring competitors, or automating data collection at scale.
npx skillsauth add openclaw/skills Web Scraping & Data Extraction EngineInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Score your scraping operation (2 points each):
| Signal | Healthy | Unhealthy | |--------|---------|-----------| | Legal compliance | robots.txt checked, ToS reviewed | Scraping blindly | | Architecture | Tool matches site complexity | Using Puppeteer for static HTML | | Anti-detection | Rotation, delays, fingerprint diversity | Single IP, no delays | | Data quality | Validation + dedup pipeline | Raw dumps, no cleaning | | Error handling | Retry logic, circuit breakers | Crashes on first 403 | | Monitoring | Success rates tracked, alerts set | No visibility | | Storage | Structured, deduplicated, versioned | Flat files, duplicates | | Scheduling | Appropriate frequency, off-peak | Hammering during business hours |
Score: /16 → 12+: Production-ready | 8-11: Needs work | <8: Stop and redesign
compliance_brief:
target_domain: ""
date_assessed: ""
robots_txt:
checked: false
target_paths_allowed: false
crawl_delay_specified: ""
ai_bot_rules: "" # Many sites now block AI crawlers specifically
terms_of_service:
reviewed: false
scraping_mentioned: false
scraping_prohibited: false
api_available: false
api_sufficient: false
data_classification:
type: "" # public-factual | public-personal | behind-auth | copyrighted
contains_pii: false
pii_types: [] # name, email, phone, address, photo
gdpr_applies: false # EU residents' data
ccpa_applies: false # California residents' data
legal_risk: "" # low | medium | high | do-not-scrape
decision: "" # proceed | use-api | request-permission | abandon
justification: ""
| Scenario | Risk Level | Key Case Law | |----------|-----------|--------------| | Public data, no login, robots.txt allows | LOW | hiQ v. LinkedIn (2022) | | Public data, robots.txt disallows | MEDIUM | Meta v. Bright Data (2024) | | Behind authentication | HIGH | Van Buren v. US (2021), CFAA | | Personal data without consent | HIGH | GDPR Art. 6, CCPA §1798.100 | | Republishing copyrighted content | HIGH | Copyright Act §106 | | Price/product comparison | LOW | eBay v. Bidder's Edge (fair use) | | Academic/research use | LOW-MEDIUM | Varies by jurisdiction | | Bypassing anti-bot measures | HIGH | CFAA "exceeds authorized access" |
Many sites now specifically block AI-related crawlers:
# Common AI bot blocks in robots.txt
User-agent: GPTBot
User-agent: ChatGPT-User
User-agent: Google-Extended
User-agent: CCBot
User-agent: anthropic-ai
User-agent: ClaudeBot
User-agent: Bytespider
User-agent: PerplexityBot
Rule: If collecting data for AI training, check for these specific blocks.
| Tool/Approach | Best For | Speed | JS Support | Complexity | Cost | |---------------|----------|-------|------------|------------|------| | HTTP client (requests/axios) | Static HTML, APIs | ⚡⚡⚡ | ❌ | Low | Free | | Beautiful Soup / Cheerio | Static HTML parsing | ⚡⚡⚡ | ❌ | Low | Free | | Scrapy | Large-scale structured crawling | ⚡⚡⚡ | Plugin | Medium | Free | | Playwright / Puppeteer | JS-rendered, SPAs, interactions | ⚡ | ✅ | Medium | Free | | Selenium | Legacy, browser automation | ⚡ | ✅ | High | Free | | Crawlee | Hybrid (HTTP + browser fallback) | ⚡⚡ | ✅ | Medium | Free | | Firecrawl / ScrapingBee | Managed, anti-bot bypass | ⚡⚡ | ✅ | Low | Paid | | Bright Data / Oxylabs | Enterprise, proxy + browser | ⚡⚡ | ✅ | Low | Paid |
Is the content in the initial HTML source?
├── YES → Is the site structure consistent?
│ ├── YES → Static scraper (requests + BeautifulSoup/Cheerio)
│ └── NO → Scrapy with custom parsers
└── NO → Does the page require user interaction?
├── YES → Playwright/Puppeteer with interaction scripts
└── NO → Playwright in non-interactive mode
└── At scale (>10K pages)? → Crawlee (hybrid mode)
└── Heavy anti-bot? → Managed service (Firecrawl/ScrapingBee)
scraping_project:
name: ""
objective: "" # What data, why, how often
targets:
- domain: ""
pages_estimated: 0
rendering: "static" | "javascript" | "spa"
anti_bot: "none" | "basic" | "cloudflare" | "advanced"
rate_limit: "" # requests per second safe limit
tool_selected: ""
justification: ""
data_schema:
fields: []
output_format: "" # json | csv | database
schedule:
frequency: "" # once | hourly | daily | weekly
preferred_time: "" # off-peak for target timezone
infrastructure:
proxy_needed: false
proxy_type: "" # residential | datacenter | mobile
storage: ""
monitoring: ""
# Python example — production request pattern
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
session = requests.Session()
# Retry strategy
retry = Retry(
total=3,
backoff_factor=1, # 1s, 2s, 4s
status_forcelist=[429, 500, 502, 503, 504],
respect_retry_after_header=True
)
session.mount("https://", HTTPAdapter(max_retries=retry))
# Realistic headers
session.headers.update({
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Cache-Control": "no-cache",
})
Rotate these to avoid fingerprinting:
| Header | Rotation Pool Size | Notes | |--------|-------------------|-------| | User-Agent | 20-50 real browser UAs | Match OS distribution | | Accept-Language | 5-10 locale combos | Match proxy geo | | Sec-Ch-Ua | Match User-Agent | Chrome/Edge/Brave | | Referer | Vary per request | Previous page or search engine |
| Site Type | Safe Delay | Aggressive (risky) | |-----------|-----------|-------------------| | Small business site | 5-10 seconds | 2-3 seconds | | Medium site | 2-5 seconds | 1-2 seconds | | Large platform (Amazon, etc.) | 3-5 seconds | 1 second | | API endpoint | Per API docs | Never exceed | | robots.txt crawl-delay | Respect exactly | Never below |
Rules:
Crawl-delay in robots.txtRetry-After headers — they mean it[data-product-id], [data-price] (most stable)#product-title, #price (stable but can change)[aria-label="Price"] (accessibility, fairly stable)article, main, nav (structural, stable).product-card (can change with redesigns)//div[3]/span[2] (FRAGILE — last resort)Structured data first — Check before writing CSS selectors:
# 1. Check JSON-LD (best source — structured, clean)
import json
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for script in soup.find_all('script', type='application/ld+json'):
data = json.loads(script.string)
# Often contains: Product, Article, Organization, etc.
# 2. Check Open Graph meta tags
og_title = soup.find('meta', property='og:title')
og_price = soup.find('meta', property='product:price:amount')
# 3. Check microdata
items = soup.find_all(itemtype=True)
# 4. Fall back to CSS selectors only if above are empty
Table extraction pattern:
import pandas as pd
# Quick table extraction
tables = pd.read_html(html) # Returns list of DataFrames
# For complex tables with merged cells
def extract_table(soup, selector):
table = soup.select_one(selector)
headers = [th.get_text(strip=True) for th in table.select('thead th')]
rows = []
for tr in table.select('tbody tr'):
cells = [td.get_text(strip=True) for td in tr.select('td')]
rows.append(dict(zip(headers, cells)))
return rows
Pagination handling:
# Pattern 1: Next button
while True:
# ... scrape current page ...
next_link = soup.select_one('a.next-page, [rel="next"], .pagination .next a')
if not next_link or not next_link.get('href'):
break
url = urljoin(base_url, next_link['href'])
# Pattern 2: API pagination (infinite scroll sites)
page = 1
while True:
resp = session.get(f"{api_url}?page={page}&limit=50")
data = resp.json()
if not data.get('results'):
break
# ... process results ...
page += 1
# Pattern 3: Cursor-based
cursor = None
while True:
params = {"limit": 50}
if cursor:
params["cursor"] = cursor
resp = session.get(api_url, params=params)
data = resp.json()
# ... process ...
cursor = data.get('next_cursor')
if not cursor:
break
# Playwright pattern for JS-rendered pages
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent="Mozilla/5.0 ...",
)
page = context.new_page()
# Block unnecessary resources (speed + stealth)
page.route("**/*.{png,jpg,jpeg,gif,svg,woff,woff2}",
lambda route: route.abort())
page.goto(url, wait_until="networkidle")
# Wait for specific content (better than arbitrary sleep)
page.wait_for_selector('[data-product-id]', timeout=10000)
# Extract after JS rendering
content = page.content()
# ... parse with BeautifulSoup/Cheerio ...
browser.close()
| Signal | Detection Method | Mitigation | |--------|-----------------|------------| | IP reputation | IP blacklists, datacenter ranges | Residential proxies | | Request rate | Requests/min from same IP | Rate limiting + jitter | | TLS fingerprint | JA3/JA4 hash matching | Use real browser or curl-impersonate | | Browser fingerprint | Canvas, WebGL, fonts | Playwright with stealth plugin | | JavaScript challenges | Cloudflare Turnstile, hCaptcha | Managed browser services | | Cookie/session behavior | Missing cookies, no history | Full session management | | Navigation pattern | Direct URL hits, no referrer | Simulate natural browsing | | Mouse/keyboard events | No interaction telemetry | Event simulation (Playwright) | | Header consistency | Mismatched headers vs UA | Header sets that match |
proxy_strategy:
# Tier 1: Free/Datacenter (for non-protected sites)
basic:
type: "datacenter"
cost: "$1-5/GB"
success_rate: "60-80%"
use_for: "APIs, small sites, no anti-bot"
# Tier 2: Residential (for most protected sites)
standard:
type: "residential"
cost: "$5-15/GB"
success_rate: "90-95%"
use_for: "Cloudflare, major platforms"
rotation: "per-request or sticky 10min"
# Tier 3: Mobile/ISP (for maximum stealth)
premium:
type: "mobile"
cost: "$15-30/GB"
success_rate: "95-99%"
use_for: "Aggressive anti-bot, social media"
rules:
- Start with cheapest tier, escalate only on blocks
- Match proxy geo to target audience geo
- Rotate on 403/429, not every request
- Use sticky sessions for multi-page scrapes
- Monitor proxy health — remove slow/blocked IPs
# Essential stealth for Playwright
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(
headless=True,
args=[
'--disable-blink-features=AutomationControlled',
'--disable-features=IsolateOrigins,site-per-process',
]
)
context = browser.new_context(
viewport={"width": 1920, "height": 1080},
locale="en-US",
timezone_id="America/New_York",
geolocation={"latitude": 40.7128, "longitude": -74.0060},
permissions=["geolocation"],
)
# Remove automation indicators
page = context.new_page()
page.add_init_script("""
Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
Object.defineProperty(navigator, 'plugins', {get: () => [1, 2, 3]});
""")
Cloudflare detected?
├── JS Challenge only → Playwright with stealth + residential proxy
├── Turnstile CAPTCHA → Managed service (ScrapingBee/Bright Data)
├── Under Attack Mode → Wait, try later, or managed service
└── WAF blocking → Different approach needed
├── Check for API endpoints (network tab)
├── Check for mobile app API
└── Consider if data is available elsewhere
# Validation pattern — validate BEFORE storing
from dataclasses import dataclass, field
from typing import Optional
import re
from datetime import datetime
@dataclass
class ScrapedProduct:
url: str
title: str
price: Optional[float]
currency: str = "USD"
scraped_at: str = field(default_factory=lambda: datetime.utcnow().isoformat())
def validate(self) -> list[str]:
errors = []
if not self.url.startswith('http'):
errors.append("Invalid URL")
if not self.title or len(self.title) < 3:
errors.append("Title too short or missing")
if self.price is not None and self.price < 0:
errors.append("Negative price")
if self.price is not None and self.price > 1_000_000:
errors.append("Price suspiciously high — verify")
if self.currency not in ("USD", "EUR", "GBP", "BTC"):
errors.append(f"Unknown currency: {self.currency}")
return errors
| Method | When to Use | Implementation | |--------|------------|----------------| | URL-based | Pages with unique URLs | Hash the canonical URL | | Content hash | Same URL, changing content | MD5/SHA256 of key fields | | Fuzzy matching | Near-duplicate detection | Jaccard similarity > 0.85 | | Composite key | Multi-field uniqueness | Hash(domain + product_id + variant) |
import hashlib
def dedup_key(item: dict, fields: list[str]) -> str:
"""Generate dedup key from selected fields."""
values = "|".join(str(item.get(f, "")) for f in fields)
return hashlib.sha256(values.encode()).hexdigest()
# Usage
seen = set()
for item in scraped_items:
key = dedup_key(item, ["url", "product_id"])
if key not in seen:
seen.add(key)
clean_items.append(item)
Raw HTML → Parse → Extract → Validate → Clean → Deduplicate → Store
↓
Quarantine (failed validation)
Common cleaning operations:
| Problem | Solution |
|---------|----------|
| HTML entities (&) | html.unescape() |
| Extra whitespace | " ".join(text.split()) |
| Unicode issues | unicodedata.normalize('NFKD', text) |
| Price in text ("$49.99") | Regex: r'[\$£€]?([\d,]+\.?\d*)' |
| Date formats vary | dateutil.parser.parse() with dayfirst flag |
| Relative URLs | urllib.parse.urljoin(base, relative) |
| Encoding issues | chardet.detect() then decode |
| Volume | Frequency | Query Needs | Recommendation | |--------|-----------|-------------|----------------| | <10K records | One-time | None | JSON/CSV files | | <10K records | Recurring | Simple lookups | SQLite | | 10K-1M records | Recurring | Complex queries | PostgreSQL | | 1M+ records | Continuous | Analytics | PostgreSQL + partitioning | | Append-only logs | Continuous | Time-series | ClickHouse / TimescaleDB |
import sqlite3
import json
from datetime import datetime
def init_db(path="scraper_data.db"):
conn = sqlite3.connect(path)
conn.execute("""
CREATE TABLE IF NOT EXISTS items (
id INTEGER PRIMARY KEY,
url TEXT UNIQUE,
data JSON NOT NULL,
scraped_at TEXT DEFAULT (datetime('now')),
updated_at TEXT,
checksum TEXT
)
""")
conn.execute("CREATE INDEX IF NOT EXISTS idx_url ON items(url)")
conn.execute("CREATE INDEX IF NOT EXISTS idx_scraped ON items(scraped_at)")
return conn
def upsert(conn, url, data, checksum):
conn.execute("""
INSERT INTO items (url, data, checksum) VALUES (?, ?, ?)
ON CONFLICT(url) DO UPDATE SET
data = excluded.data,
updated_at = datetime('now'),
checksum = excluded.checksum
WHERE items.checksum != excluded.checksum
""", (url, json.dumps(data), checksum))
conn.commit()
# CSV export
import csv
def to_csv(items, path, fields):
with open(path, 'w', newline='') as f:
writer = csv.DictWriter(f, fieldnames=fields)
writer.writeheader()
writer.writerows(items)
# JSON Lines (best for large datasets — streaming)
def to_jsonl(items, path):
with open(path, 'w') as f:
for item in items:
f.write(json.dumps(item) + '\n')
# Incremental export (only new/changed since last export)
def export_since(conn, last_export_time):
cursor = conn.execute(
"SELECT data FROM items WHERE scraped_at > ? OR updated_at > ?",
(last_export_time, last_export_time)
)
return [json.loads(row[0]) for row in cursor]
| HTTP Code | Meaning | Action | |-----------|---------|--------| | 200 | Success | Process normally | | 301/302 | Redirect | Follow (max 5 hops) | | 403 | Forbidden/blocked | Rotate proxy, slow down | | 404 | Not found | Log, skip, mark URL dead | | 429 | Rate limited | Respect Retry-After, back off 2x | | 500-504 | Server error | Retry 3x with backoff | | Connection timeout | Network issue | Retry with different proxy | | SSL error | Certificate issue | Log, investigate, skip |
class CircuitBreaker:
def __init__(self, failure_threshold=5, reset_timeout=300):
self.failures = 0
self.threshold = failure_threshold
self.reset_timeout = reset_timeout
self.last_failure = 0
self.state = "closed" # closed | open | half-open
def record_failure(self):
self.failures += 1
self.last_failure = time.time()
if self.failures >= self.threshold:
self.state = "open"
# Alert: "Circuit open — too many failures"
def record_success(self):
self.failures = 0
self.state = "closed"
def can_proceed(self):
if self.state == "closed":
return True
if self.state == "open":
if time.time() - self.last_failure > self.reset_timeout:
self.state = "half-open"
return True # Try one request
return False
return True # half-open: allow attempt
import json
from pathlib import Path
class Checkpointer:
def __init__(self, path="checkpoint.json"):
self.path = Path(path)
self.state = self._load()
def _load(self):
if self.path.exists():
return json.loads(self.path.read_text())
return {"completed_urls": [], "last_page": 0, "cursor": None}
def save(self):
self.path.write_text(json.dumps(self.state))
def is_done(self, url):
return url in self.state["completed_urls"]
def mark_done(self, url):
self.state["completed_urls"].append(url)
if len(self.state["completed_urls"]) % 50 == 0:
self.save() # Periodic save
dashboard:
real_time:
- metric: "requests_per_minute"
alert_if: "> 60 for small sites"
- metric: "success_rate"
alert_if: "< 90%"
- metric: "avg_response_time_ms"
alert_if: "> 5000"
- metric: "blocked_rate"
alert_if: "> 10%"
per_run:
- metric: "pages_scraped"
- metric: "items_extracted"
- metric: "items_validated"
- metric: "items_deduplicated"
- metric: "new_items"
- metric: "updated_items"
- metric: "errors_by_type"
- metric: "run_duration"
- metric: "proxy_cost"
weekly:
- metric: "data_freshness"
description: "% of records updated in last 7 days"
- metric: "site_structure_changes"
description: "Selectors that stopped matching"
- metric: "total_cost"
description: "Proxy + compute + storage"
Sites redesign. Selectors break. Detect it early:
def health_check(results: list[dict], expected_fields: list[str]) -> dict:
"""Check if scraper is still extracting correctly."""
total = len(results)
if total == 0:
return {"status": "CRITICAL", "message": "Zero results — likely broken"}
field_coverage = {}
for field in expected_fields:
filled = sum(1 for r in results if r.get(field))
coverage = filled / total
field_coverage[field] = coverage
issues = []
for field, coverage in field_coverage.items():
if coverage < 0.5:
issues.append(f"{field}: {coverage:.0%} fill rate (expected >50%)")
if issues:
return {"status": "WARNING", "issues": issues}
return {"status": "OK", "field_coverage": field_coverage}
Daily:
Weekly:
Monthly:
use_case: "Track competitor prices daily"
tool: "requests + BeautifulSoup"
schedule: "Daily at 03:00 UTC (off-peak)"
targets: ["competitor-a.com/products", "competitor-b.com/api"]
data:
- product_id
- product_name
- price
- currency
- in_stock
- scraped_at
storage: "SQLite with price history"
alerts: "Price change > 10% → notify"
use_case: "Aggregate job listings from multiple boards"
tool: "Scrapy with per-site spiders"
schedule: "Every 6 hours"
targets: ["board-a.com", "board-b.com", "board-c.com"]
data:
- title
- company
- location
- salary_range
- posted_date
- url
- source
dedup: "Hash(title + company + location)"
storage: "PostgreSQL"
use_case: "Monitor industry news mentions"
tool: "requests + RSS feeds (preferred) + web fallback"
schedule: "Every 30 minutes"
approach:
1: "RSS/Atom feeds (fastest, cleanest)"
2: "Google News RSS for topic"
3: "Direct scraping if no feed"
data:
- headline
- source
- url
- published_at
- snippet
- sentiment
alerts: "Keyword match → immediate notification"
use_case: "Track brand mentions and sentiment"
tool: "Official APIs (always) + web search fallback"
rules:
- NEVER scrape social platforms directly — use APIs
- Twitter/X: Official API ($100/mo basic)
- Reddit: Official API (free tier available)
- LinkedIn: No scraping (aggressive legal action)
- Instagram: Official API only (Meta Business)
fallback: "Brave/Google search for public mentions"
use_case: "Track property listings and prices"
tool: "Playwright (most listing sites are JS-heavy)"
schedule: "Daily"
challenges:
- Heavy JavaScript rendering
- Anti-bot measures (Cloudflare common)
- Frequent layout changes
- Map-based results
approach: "API endpoint discovery via network tab first"
Single machine (small scale):
├── asyncio + aiohttp (Python) → 50-200 concurrent requests
├── Worker pool (ThreadPoolExecutor) → 10-50 threads
└── Scrapy reactor → Built-in concurrency
Multi-machine (large scale):
├── URL queue: Redis / RabbitMQ / SQS
├── Workers: Multiple Scrapy/custom workers
├── Results: Shared PostgreSQL / S3
└── Coordinator: Celery / custom scheduler
| Lever | Impact | How | |-------|--------|-----| | Static > Browser | 10-50x cheaper | Always try HTTP first | | Block images/CSS/fonts | 60-80% bandwidth saved | Route filtering | | Cache DNS | Minor but cumulative | Local DNS cache | | Compress responses | 50-70% bandwidth | Accept-Encoding: gzip, br | | Smart scheduling | Avoid redundant scrapes | Change detection before full re-scrape | | Proxy tier matching | 3-10x cost difference | Don't use residential for easy sites |
Before building a scraper, check if the site has hidden API endpoints:
Common hidden API patterns:
/api/v1/products?page=1&limit=20/graphql with query parameters/_next/data/... (Next.js data routes)/wp-json/wp/v2/posts (WordPress)# Minimize browser resource usage
context = browser.new_context(
viewport={"width": 1280, "height": 720},
java_script_enabled=True, # Only if needed
has_touch=False,
is_mobile=False,
)
# Block resource types you don't need
page.route("**/*", lambda route: (
route.abort() if route.request.resource_type in
["image", "stylesheet", "font", "media"]
else route.continue_()
))
# When authorized to scrape behind login
# ALWAYS use session-based auth, never store passwords in code
# Pattern: Login once, reuse session
session = requests.Session()
login_resp = session.post("https://example.com/login", data={
"username": os.environ["SCRAPE_USER"],
"password": os.environ["SCRAPE_PASS"],
})
assert login_resp.ok, "Login failed"
# Session cookies are now stored — use for subsequent requests
data_resp = session.get("https://example.com/api/data")
import hashlib
def has_changed(url, session, last_etag=None, last_modified=None):
"""Check if page changed without downloading full content."""
headers = {}
if last_etag:
headers["If-None-Match"] = last_etag
if last_modified:
headers["If-Modified-Since"] = last_modified
resp = session.head(url, headers=headers)
if resp.status_code == 304:
return False, resp.headers.get("ETag"), resp.headers.get("Last-Modified")
return True, resp.headers.get("ETag"), resp.headers.get("Last-Modified")
| Dimension | Weight | What to Assess | |-----------|--------|---------------| | Legal compliance | 20% | robots.txt, ToS, PII handling, audit trail | | Data quality | 20% | Validation, accuracy, completeness, freshness | | Resilience | 15% | Error handling, retries, circuit breakers, checkpointing | | Anti-detection | 15% | Proxy rotation, fingerprint diversity, rate limiting | | Architecture | 10% | Right tool selection, clean code, modularity | | Monitoring | 10% | Success rates, breakage detection, alerting | | Performance | 5% | Speed, cost efficiency, resource usage | | Documentation | 5% | Runbook, schema docs, legal assessment |
Grading: 90+ Excellent | 75-89 Good | 60-74 Needs work | <60 Redesign
| # | Mistake | Fix | |---|---------|-----| | 1 | No robots.txt check | Always check first — it's your legal defense | | 2 | Fixed delays (no jitter) | Add ±30% random jitter to all delays | | 3 | No data validation | Validate every field before storing | | 4 | Using browser for static HTML | HTTP client is 10-50x faster and cheaper | | 5 | Single IP, no rotation | Proxy rotation for any serious scraping | | 6 | No breakage detection | Monitor extraction counts and field fill rates | | 7 | Storing raw HTML only | Extract + structure immediately | | 8 | No checkpoint/resume | Long scrapes must be resumable | | 9 | Ignoring structured data | JSON-LD/microdata is cleaner than CSS selectors | | 10 | Scraping when API exists | Always check for API first |
Single-page apps (React/Vue/Angular): Must use browser rendering OR find the underlying API (network tab). Prefer API discovery — it's faster and more reliable.
Infinite scroll: Intercept the XHR/fetch calls that load more content. Simulate scrolling only as last resort. The API endpoint usually accepts page or offset params.
CAPTCHAs: If you're hitting CAPTCHAs, you're scraping too aggressively. Slow down first. If CAPTCHAs persist: managed services (2Captcha, Anti-Captcha) or rethink approach.
Dynamic class names (CSS modules, Tailwind): Use data attributes, ARIA labels, or text content selectors instead. [data-testid="price"] survives redesigns. .sc-bdVTJa does not.
Multi-language sites: Detect language via html[lang] attribute. Set Accept-Language header to get desired locale. Watch for different URL structures (/en/, /de/, subdomains).
tools
Use when the user wants to connect to, test, or use the McDonalds service at mcp.mcd.cn, including checking authentication, probing MCP endpoints, listing tools, or calling McDonalds MCP tools through a reusable local CLI.
development
Web scraping platform — Twitter/X data, Vinted marketplace, and general web scraping API
development
SlowMist AI Agent Security Review — comprehensive security framework for skills, repositories, URLs, on-chain addresses, and products (Claude Code version)
data-ai
去除中文文本中的 AI 写作痕迹,使其读起来自然。基于维基百科 AI 写作特征指南,检测 24 种 AI 模式。触发词:humanizer-cn、去除 AI 痕迹、去除 AI 写作痕迹、中文文本人性化。