skills/extract/SKILL.md
Extract structured data from websites and produce an executable Playwright script plus extracted data. Use when the user wants to scrape, extract, pull, collect, or harvest data from any website — product listings, tables, search results, feeds, profiles, or any repeating content.
npx skillsauth add actionbook/actionbook extractInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Activate when the user wants to obtain data from a website:
The deliverable is always two artifacts:
.cjs file that reproduces the extraction without Actionbook at runtime.Use Actionbook as a conditional accelerator, not a mandatory step. The goal is reliable selectors in the shortest path.
User request
│
├─► actionbook search "<site> <intent>"
│ ├─ Results with Health Score ≥ 70% ──► actionbook get "<ID>" ──► use selectors
│ └─ No results / low score ──► Fallback
│
└─► Fallback: actionbook browser open <url>
├─ actionbook browser snapshot (accessibility tree → find selectors)
├─ actionbook browser screenshot (visual confirmation)
└─ manual selector discovery via DOM inspection
Priority order for selector sources:
| Priority | Source | When |
|----------|--------|------|
| 1 | actionbook get | Site is indexed, health score ≥ 70% |
| 2 | actionbook browser snapshot | Not indexed or selectors outdated |
| 3 | DOM inspection via screenshot + snapshot | Complex SPA / dynamic content |
Non-negotiable rule: if search + get already provides usable selectors for required fields, start from get selectors and do not jump to full fallback (snapshot/screenshot) by default. Exception: lightweight mechanism probes (for hydration/virtualization/pagination) are allowed when runtime behavior may affect script correctness. Escalate to snapshot/screenshot only when probes/sample validation indicate selector gaps or instability.
Websites use patterns that break naive scraping. The generated Playwright script must account for these:
Pages may render a shell first, then stream or hydrate content.
// Wait for hydration to complete — not just DOMContentLoaded
await page.waitForSelector('[data-item]', { state: 'attached' });
await page.waitForFunction(() => {
const items = document.querySelectorAll('[data-item]');
return items.length > 0 && !document.querySelector('[data-pending]');
});
Detection cues: React root with data-reactroot, Next.js __NEXT_DATA__, empty containers that fill after JS runs. If actionbook browser text "<selector>" returns empty but the screenshot shows content, hydration hasn't completed.
Only visible rows exist in the DOM. Scrolling renders new rows and destroys old ones.
// Scroll-and-collect loop for virtualized lists (scroll container aware)
const allItems = [];
const maxScrolls = 50;
let scrolls = 0;
const container = await page.$('<scroll-container-selector>');
if (!container) throw new Error('Scroll container not found');
let previousTop = await container.evaluate(el => el.scrollTop);
while (scrolls < maxScrolls) {
const items = await page.$$eval('[data-row]', rows =>
rows.map(r => ({ text: r.textContent.trim() }))
);
for (const item of items) {
if (!allItems.find(i => i.text === item.text)) allItems.push(item);
}
await container.evaluate(el => el.scrollBy(0, 600));
await page.waitForTimeout(300);
const currentTop = await container.evaluate(el => el.scrollTop);
if (currentTop === previousTop) break;
previousTop = currentTop;
scrolls += 1;
}
Detection cues: Container has fixed height with overflow: auto/scroll, row count in DOM is much smaller than stated total, rows have transform: translateY(...) or position: absolute; top: ...px.
New content appends when the user scrolls near the bottom.
// Scroll to bottom until no new content loads (with no-growth tolerance)
let itemCount = 0;
let noGrowthStreak = 0;
const maxScrolls = 80;
let scrolls = 0;
while (scrolls < maxScrolls && noGrowthStreak < 3) {
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
await page.waitForTimeout(1200);
const newCount = await page.$$eval('.item', els => els.length);
if (newCount > itemCount) {
itemCount = newCount;
noGrowthStreak = 0;
} else {
noGrowthStreak += 1;
}
scrolls += 1;
}
Detection cues: Intersection Observer in page JS, "Load more" button, sentinel element at bottom, network requests firing on scroll.
Multi-page results behind "Next" buttons or numbered pages.
// Click-through pagination (navigation-aware, SPA-safe)
const allData = [];
const maxPages = 50;
let pageIndex = 0;
while (pageIndex < maxPages) {
const pageData = await page.$$eval('.result-item', items =>
items.map(el => ({ title: el.querySelector('h3')?.textContent?.trim() }))
);
allData.push(...pageData);
const nextBtn = await page.$('a.next-page:not([disabled])');
if (!nextBtn) break;
const previousUrl = page.url();
const previousFirstItem = await page
.$eval('.result-item', el => el.textContent?.trim() || '')
.catch(() => '');
await nextBtn.click();
// Post-click detection only: advance must be caused by this click
const advanced = await Promise.any([
page
.waitForURL(url => url.toString() !== previousUrl, { timeout: 5000 })
.then(() => true),
page
.waitForFunction(
prev => {
const first = document.querySelector('.result-item');
return !!first && (first.textContent || '').trim() !== prev;
},
previousFirstItem,
{ timeout: 5000 }
)
.then(() => true),
]).catch(() => false);
if (!advanced) break;
await page.waitForLoadState('networkidle').catch(() => {});
pageIndex += 1;
}
Identify from the user request:
# Try Actionbook index first
actionbook search "<site> <data-description>" --domain <domain>
# If good results (health ≥ 70%), get full selectors
actionbook get "<ID>"
Use this routing strictly:
Path A (default when get is good): requested fields are covered by get selectors and quality is acceptable.
get selectors and move to script draft quickly.browser text, quick scroll checks) before finalizing script strategy.snapshot / screenshot) before first draft unless probe/sample validation shows mismatch.get selectors and mark source as actionbook_get.Path B (partial / unstable): get exists but required fields are missing, selector resolves 0 elements, or validation fails.
Path C (no usable coverage): search/get has no usable result.
Path A mechanism detection timing:
actionbook browser open "<url>" (if current tab context is unknown/stale)Fallback discovery by path:
Path B targeted fallback (only failed fields/steps):
actionbook browser open "<url>" # if not already open
actionbook browser snapshot # focus on failed field/container mapping
# actionbook browser screenshot # optional visual confirmation for failed area
Path C full fallback (no usable coverage):
actionbook browser open "<url>"
actionbook browser snapshot
actionbook browser screenshot
Mechanism probes (run when script strategy needs confirmation):
# Hydration / streaming check
actionbook browser text "<container-selector>"
# Infinite scroll quick signal (explicit before/after decision)
actionbook browser eval "document.querySelectorAll('<item-selector>').length" # before
actionbook browser click "<scroll-container-selector-or-body>" # focus scroll context
actionbook browser eval "const c=document.querySelector('<scroll-container-selector>') || document.scrollingElement; c.scrollBy(0, c.clientHeight || window.innerHeight);"
actionbook browser eval "document.querySelectorAll('<item-selector>').length" # after
# If count increases, treat page as lazy-load/infinite-scroll.
Fallback trigger conditions:
actionbook get cannot map all required fields.actionbook get selectors return empty/unstable values in sample run.Write a standalone Playwright script (extract_<domain>_<slug>.cjs) that:
load — see mechanisms above).JSON.stringify / CSV).maxPages, maxScrolls, timeout budget) to avoid infinite loops.Script template:
// extract_<domain>_<slug>.cjs
// Generated by Actionbook extract skill
// Usage: node extract_<domain>_<slug>.cjs
const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('<URL>', { waitUntil: 'domcontentloaded' });
// -- wait for readiness --
await page.waitForSelector('<container>', { state: 'visible' });
// -- extract --
const data = await page.$$eval('<item-selector>', items =>
items.map(el => ({
// fields mapped from user request
}))
);
// -- output --
const fs = require('fs');
fs.writeFileSync('output.json', JSON.stringify(data, null, 2));
console.log(`Extracted ${data.length} items → output.json`);
await browser.close();
})();
Run the script to confirm it works:
node extract_<domain>_<slug>.cjs
Validation rules:
| Check | Pass condition |
|-------|---------------|
| Script exits 0 | No runtime errors |
| Output file exists | Non-empty file written |
| Record count > 0 | At least one item extracted |
| No null/empty fields | Every declared field has a value in ≥ 90% of records |
| Data matches page | Spot-check first and last record against actionbook browser text |
If validation fails, inspect the output, adjust selectors or wait strategy, and re-run.
Present to the user:
.cjs file they can re-run anytime.Every extract invocation produces:
| Artifact | Path | Format |
|----------|------|--------|
| Playwright script | ./extract_<domain>_<slug>.cjs | Standalone Node.js script using playwright |
| Extracted data | ./output.json (default) or user-specified path | JSON array of objects (default), CSV, or user-specified |
The script must be re-runnable — a user should be able to execute it later without Actionbook installed, as long as Node.js + Playwright are available in the runtime environment.
When multiple selector types are available from actionbook get:
| Priority | Type | Reason |
|----------|------|--------|
| 1 | data-testid | Stable, test-oriented, rarely changes |
| 2 | aria-label | Accessibility-driven, semantically meaningful |
| 3 | CSS selector | Structural, may break on redesign |
| 4 | XPath | Last resort, most brittle |
| Error | Action |
|-------|--------|
| actionbook search returns no results | Fall back to snapshot + screenshot |
| Selector returns 0 elements | Re-snapshot, compare with screenshot, update selector |
| Script times out | Add longer waitForTimeout, check for anti-bot measures |
| Partial data (some fields empty) | Check if content is lazy-loaded; add scroll/wait |
| Anti-bot / CAPTCHA | Inform user; suggest running with headless: false or using their own browser session via actionbook setup extension mode |
development
Browser action engine. Provides up-to-date action manuals for the modern web — operate any website instantly, one tab or dozens, concurrently.
tools
Deep research and analysis tool. Generates comprehensive HTML reports on any topic, domain, paper, or technology. Enhanced with advanced browser automation — SPA handling, network idle wait, batch operations, stealth browsing, and intelligent page analysis. Use when user asks to research, analyze, investigate, deep-dive, or generate a report on any subject.
development
Learn Rust language features and crate updates. Use when user asks about Rust version changelog, what's new in Rust, crate updates, Cargo.toml dependencies, tokio/serde/axum features, or any Rust ecosystem questions.
development
Automates browser interactions for web testing, form filling, screenshots, and data extraction. Use when the user needs to navigate websites, interact with web pages, fill forms, take screenshots, test web applications, or extract information from web pages.