plugins/grabber-development/skills/grabber-development/SKILL.md
Comprehensive Python web scraping knowledge base covering stealth browser automation (Patchright, Camoufox, Nodriver), TLS/HTTP fingerprint impersonation (curl_cffi, primp), anti-bot bypass (Cloudflare, DataDome, PerimeterX), CAPTCHA solving, proxy architecture, AI-assisted extraction (Crawl4AI, Firecrawl, ScrapeGraphAI), framework selection (Scrapy, Crawlee), rate limiting, and production observability. TRIGGER WHEN: building, implementing, writing, coding, creating, optimizing, or debugging Python web scrapers. DO NOT TRIGGER WHEN: the task is outside the specific scope of this component.
npx skillsauth add acaprino/alfio-claude-plugins grabber-developmentInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Knowledge base for building production-grade Python web scraping systems. Covers the full stack from target assessment through production observability.
This section overrides everything else in this skill if there is any conflict. Read it first, act on it first.
When this skill activates on a scraping task, your next non-question tool call MUST launch a visible browser with the capture surface attached. Not Write pyproject.toml. Not Write models.py. Not "let me sketch the architecture first". Browser first, then code.
The default path is user-driven navigation with live capture, not Claude-clicks. The user knows their data and their portal better than you do, and authenticated SaaS sites need them anyway. Steps:
playwright-skill (preferred) or write an inline Patchright script via Bash. The script must run with headless=False, attach every handler in the Capture Surface below, and park on input() waiting for the user.The Claude-drives variant is fine only when there is no login, no 2FA, and no UI-knowledge gap. Same launch, same capture handlers; you call page.goto / page.click yourself instead of parking on input().
Writing project files (pyproject.toml, src/<pkg>/..., models.py) before the capture is in your hands is the failure mode this section exists to prevent. If you catch yourself drafting field-name alias tuples from "common patterns" (Italian + English, REST conventions, framework defaults), stop and launch the browser instead.
The full capture surface, output checklist, and anti-patterns are in the Discovery Gate section below. Read those too. But the imperative is here: browser before code, every time, user navigates by default.
Phase 1 (Target Assessment) and Phase 2 (Data Discovery) are blocking gates, not optional steps. You MUST execute them yourself and have their concrete outputs in hand before scaffolding any project file (pyproject.toml, modules, models, CLI). No exceptions.
You always control the browser session and the capture. The deliverable of discovery is not a script you hand over; it is a live capture you watched. Always launch the browser yourself (via playwright-skill or inline Patchright) with headless=False and the full capture surface attached, and keep the session open inside your turn.
Who clicks depends on the task. The capture is yours either way:
input() checkpoint, let the user navigate while the network capture streams live, then dump the capture when they signal "done".scripts/discover.py and telling the user "run this and paste the output back". That breaks the loop: by the time the user runs it, you have no eyes on the session and no chance to ask "wait, click that filter again, I lost the payload".Capture surface (attach all of these from page launch):
page.on("request") / page.on("response") for XHR + fetch (URL, method, status, headers, cookies, request body, response body when JSON or text)page.on("websocket") then ws.on("framesent") / ws.on("framereceived") for WebSocket traffic in both directionstext/event-stream (SSE) and chunked transferpage.on("worker") for service-worker- and dedicated-worker-initiated requests/graphql, request body has operationName / variables / extensions.persistedQuery.sha256Hashcontext.cookies() after login, plus any anti-bot cookies (cf_clearance, __cf_bm, datadome, _px3, ak_bmsc, incap_ses)page.on("framenavigated") filtered to the main frame, to record every landing URL after redirectsRedact Authorization, Cookie, and password fields in anything saved to disk. Keep them in the in-memory capture you reason from.
Discovery outputs you MUST collect before scaffolding (treat as a checklist; if any item is still a guess, you have not finished discovery):
/#/... guesses)operationName and variables, persisted-query SHA if presentcf_clearance, __cf_bm, datadome, _px3, ak_bmsc, incap_ses present or absentIf any of those is still a guess, you have not finished discovery; do not proceed to scaffolding.
pyproject.toml and module skeleton before observing one real network request from the target/api/invoices, /#/fatture-ricevute) without observation(fatture|invoice|received|ricevute|passive)) as a substitute for the real endpoint nameField(alias=...) tuples of "Italian + English likely names" instead of the names actually returned by the APIdiscover.py script as the first discovery step when you could open the browser yourselfFor every scraping task, follow this sequence (the Discovery Gate above governs steps 1 and 2):
playwright-skill or inline Patchright)page.on("request") and page.on("response")<script> JSON before parsing DOMimpersonate="chrome" -- done| Target Profile | HTTP Client | Browser | Framework | |---------------|-------------|---------|-----------| | No JS, no protection | curl_cffi | none | Scrapy / httpx | | JS-rendered, no protection | none | Playwright | Crawlee | | Basic Cloudflare | curl_cffi + cf_clearance | Patchright (for cookie) | Scrapy | | Heavy Cloudflare | none | Patchright persistent | Crawlee | | DataDome | none | Camoufox + ghost-cursor | custom | | PerimeterX | none | Nodriver / Patchright | custom | | AI extraction needed | none | Crawl4AI / Firecrawl | standalone |
| Tier | Type | Price Range | Use When | |------|------|-------------|----------| | 0 | No proxy | free | Unprotected targets, development | | 1 | Datacenter | $0.10-0.50/GB | Light protection, high volume | | 2 | ISP (static residential) | $0.53-1.47/IP | Account management, login flows | | 3 | Residential | $0.49-8.00/GB | Anti-bot bypass, geo-targeting | | 4 | Mobile | $4-13/GB | Highest trust, last resort |
field-guide.md -- full 2025-2026 Python web scraping field guide covering browser stealth, TLS fingerprinting, behavioral biometrics, anti-bot bypass, CAPTCHA solving, proxy landscape, frameworks, AI-assisted scraping, GraphQL reverse engineering, rate limiting, and observabilitytools
Master memory forensics techniques including memory acquisition, process analysis, and artifact extraction using Volatility and related tools. Use when analyzing memory dumps, investigating incidents, or performing malware analysis from RAM captures.
development
Master binary analysis patterns including disassembly, decompilation, control flow analysis, and code pattern recognition. Use when analyzing executables, understanding compiled code, or performing static analysis on binaries.
development
Idiomatic Kotlin implementation patterns: coroutines and structured concurrency, Flow / StateFlow / SharedFlow, Kotlin Multiplatform (KMP) shared-code architecture, Jetpack Compose UI, Ktor server with JWT auth and Exposed, and type-safe DSL design (lambdas with receivers, delegated properties, inline reified, value classes). TRIGGER WHEN: building, writing, or reviewing Kotlin code using coroutines / Flow / suspend functions, expect/actual, Compose composables / ViewModels, Ktor routing, sealed-class state modeling, scope functions, or DSL builders. DO NOT TRIGGER WHEN: libGDX game work (use libgdx-development), Android Java without Kotlin, or pure JVM tuning unrelated to Kotlin language features.
tools
Strategic website planning skill that conducts structured client discovery, produces professional deliverables (website brief, sitemap, design direction, content strategy), and orchestrates frontend-design, frontend-layout, seo-specialist, and content-marketer agents automatically. TRIGGER WHEN: planning a new website or redesign before any code is written. DO NOT TRIGGER WHEN: the task is outside the specific scope of this component.