skills/orthogonal-scrape/SKILL.md
Scrape websites, extract structured data, and automate browsers. Use when asked to scrape, extract, crawl, parse, or pull data from web pages or any URL.
npx skillsauth add orthogonal-sh/skills scrapeInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Scrape websites, extract structured data, and automate browser interactions. Pick the best API for the task — or combine several for comprehensive extraction.
Best for: Extracting data using plain English prompts, converting pages to markdown, crawling with AI extraction, and search-based scraping.
AI-powered extraction (describe what you want in natural language):
orth run scrapegraph /v1/smartscraper --body '{
"website_url": "https://example.com/products",
"user_prompt": "Extract all product names, prices, descriptions, and image URLs"
}'
With output schema (enforce structure):
orth run scrapegraph /v1/smartscraper --body '{
"website_url": "https://example.com/products",
"user_prompt": "Extract all products",
"output_schema": {
"properties": {
"products": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"},
"description": {"type": "string"}
}
}
}
}
}
}'
Search + scrape (search the web and extract from results):
orth run scrapegraph /v1/searchscraper --body '{"user_prompt": "Find the latest iPhone prices from major retailers"}'
# Poll for results:
orth run scrapegraph /v1/searchscraper/{request_id}
Convert page to markdown:
orth run scrapegraph /v1/markdownify --body '{"website_url": "https://example.com/article"}'
Crawl with AI extraction:
orth run scrapegraph /v1/crawl --body '{
"url": "https://docs.example.com",
"prompt": "Extract all API endpoints and their descriptions",
"max_pages": 20
}'
# Poll for results:
orth run scrapegraph /v1/crawl/{task_id}
Raw HTML scrape:
orth run scrapegraph /v1/scrape --body '{"website_url": "https://example.com"}'
Get sitemap:
orth run scrapegraph /v1/sitemap --body '{"website_url": "https://example.com"}'
Key parameters: stealth (bypass bot protection, +4 credits), total_pages (paginate up to 100), number_of_scrolls (infinite scroll pages), render_heavy_js (React/Vue/Angular SPAs), steps (interaction steps before extraction).
Best for: High-volume scraping, batch processing, site crawling, URL discovery, and AI-powered answers from pages.
Scrape a single page:
orth run olostep /v1/scrapes --body '{"url_to_scrape": "https://example.com/page"}'
AI-powered answer from the web:
orth run olostep /v1/answers --body '{"task": "What is the pricing for Stripe?"}'
Discover all URLs on a site:
orth run olostep /v1/maps --body '{"url": "https://example.com", "search_query": "pricing"}'
Crawl a site (async):
# Step 1: Start crawl
orth run olostep /v1/crawls --body '{
"start_url": "https://docs.example.com",
"max_pages": 100,
"include_urls": ["/docs/**"]
}'
# Step 2: Check status
orth run olostep /v1/crawls/{crawl_id}
# Step 3: Get pages
orth run olostep /v1/crawls/{crawl_id}/pages
# Step 4: Retrieve content
orth run olostep /v1/retrieve --body '{"retrieve_id": "RETRIEVE_ID"}'
Batch scrape (process many URLs at once):
orth run olostep /v1/batches --body '{
"items": [
{"url_to_scrape": "https://example.com/page1"},
{"url_to_scrape": "https://example.com/page2"},
{"url_to_scrape": "https://example.com/page3"}
]
}'
# Check status:
orth run olostep /v1/batches/{batch_id}
# Get items:
orth run olostep /v1/batches/{batch_id}/items
Key parameters: formats (markdown/html/text), country (US, CA, IT, IN, GB, JP, etc.), actions (page interactions before scraping), wait_before_scraping, remove_css_selectors, llm_extract.
Best for: Extracting data into a consistent, predefined structure. Define input URLs and output fields with prompts.
Simple page scrape:
orth run riveter /v1/scrape --body '{"url": "https://example.com/article"}'
Structured extraction (define your output schema):
orth run riveter /v1/run --body '{
"input": {
"urls": ["https://example.com/products"]
},
"output": {
"name": {"prompt": "Product name", "contexts": ["urls"]},
"price": {"prompt": "Product price", "contexts": ["urls"], "format": "number"},
"description": {"prompt": "Product description", "contexts": ["urls"]}
}
}'
# Check status:
orth run riveter /v1/run_status --query 'run_key=RUN_KEY'
# Get data:
orth run riveter /v1/run_data --query 'run_key=RUN_KEY'
Multi-URL extraction with tools:
orth run riveter /v1/run --body '{
"input": {
"company_urls": ["https://stripe.com", "https://vercel.com"]
},
"output": {
"company_name": {"prompt": "Company name", "contexts": ["company_urls"]},
"pricing_url": {"prompt": "URL to pricing page", "contexts": ["company_urls"], "format": "url"},
"pricing_details": {"prompt": "Pricing tiers and costs", "contexts": ["pricing_url"], "tools": ["web_scrape"]}
}
}'
Key parameters: Output format options (number/json/url/text/email/tag/date/boolean), tools (web_search/web_scrape/query_pdf/query_image), max_tool_calls (0-10), run_when (always/any_filled/all_filled).
Best for: Extracting brand logos, colors, fonts, design systems, screenshots, and AI-powered data extraction from company websites.
Get full brand data:
orth run brand-dev /v1/brand/retrieve --query 'domain=stripe.com'
By company name / email / ticker:
orth run brand-dev /v1/brand/retrieve-by-name --query 'name=Stripe'
orth run brand-dev /v1/brand/retrieve-by-email --query '[email protected]'
orth run brand-dev /v1/brand/retrieve-by-ticker --query 'ticker=AAPL'
Extract design system / styleguide:
orth run brand-dev /v1/brand/styleguide --query 'domain=linear.app'
Extract fonts:
orth run brand-dev /v1/brand/fonts --query 'domain=vercel.com'
Take website screenshot:
orth run brand-dev /v1/brand/screenshot --query 'domain=github.com&fullScreenshot=true'
AI-powered data extraction:
orth run brand-dev /v1/brand/ai/query --body '{
"domain": "anthropic.com",
"data_to_extract": [{"name": "products", "description": "What products does this company offer?"}]
}'
Extract products:
orth run brand-dev /v1/brand/ai/products --body '{"domain": "stripe.com"}'
Best for: Scraping pages that require browser interaction, CAPTCHAs, login flows, or complex JavaScript rendering. Also supports autonomous AI agents for multi-step browser tasks.
Quick scrape (no session needed):
orth run notte /scrape --body '{"url": "https://example.com"}'
Session-based scraping (for complex interactions):
# Step 1: Start a browser session
orth run notte /sessions/start --body '{"url": "https://example.com", "proxies": true, "solve_captchas": true}'
# Step 2: Observe available actions
orth run notte /sessions/{session_id}/page/observe --body '{"instruction": "Find the search box"}'
# Step 3: Execute actions
orth run notte /sessions/{session_id}/page/execute --body '{"instruction": "Click the search button"}'
# Step 4: Scrape the page
orth run notte /sessions/{session_id}/page/scrape --body '{"only_main_content": true}'
# Step 5: Stop session
orth run notte /sessions/{session_id}/stop
AI agent (autonomous multi-step browser task):
orth run notte /agents/start --body '{
"task": "Go to Google, search for AI news, and summarize the top 5 results",
"url": "https://google.com",
"max_steps": 20
}'
# Check status:
orth run notte /agents/{agent_id}
Take screenshot:
orth run notte /sessions/{session_id}/page/screenshot --body '{"full_page": true}'
Key parameters: proxies (rotate proxies), solve_captchas (auto-solve), headless (default true), browser_type (chromium/chrome/firefox), viewport_width/viewport_height.
output_schemastealth: true or Notte's proxies: true + solve_captchas: truerender_heavy_js: true or Notte browser sessionssteps for simple interactions before extraction, or Notte sessions for complex multi-step flowstotal_pages (up to 100) handles multi-page extraction automatically/v1/markdownify for clean markdown from any pageList all endpoints for any API, or add a path for parameter details:
orth api show scrapegraph
orth api show olostep
orth api show riveter
orth api show brand-dev
orth api show notte
Example: orth api show scrapegraph /v1/smartscraper for full parameter details.
testing
Download videos from YouTube, Bilibili, Twitter, and thousands of other sites using yt-dlp. Use when the user provides a video URL and wants to download it, extract audio (MP3), download subtitles, or select video quality. Triggers on phrases like "下载视频", "download video", "yt-dlp", "YouTube", "B站", "抖音", "提取音频", "extract audio".
business
Send messages and manage Slack channels. Use when asked to send Slack messages, post to channels, list channels, or fetch message history.
development
Evaluate YC batch companies for investment — scrapes the YC directory, researches each company and its founders (work history, LinkedIn, website), assesses founder-company fit, and exports to Google Sheets with priority rankings. Use when asked to evaluate YC companies, research a YC batch, screen startups, or do due diligence on YC companies.
development
Take screenshots of websites and web pages