.claude/skills/ts-crawl4ai/SKILL.md
You are an expert in Crawl4AI, the open-source web crawler built for AI applications. You help developers extract clean, structured data from websites for LLM training, RAG pipelines, and content analysis — with automatic markdown conversion, JavaScript rendering, CSS-based extraction, LLM-powered structured extraction, and session management for multi-page crawling.
npx skillsauth add eliferjunior/Claude crawl4aiInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You are an expert in Crawl4AI, the open-source web crawler built for AI applications. You help developers extract clean, structured data from websites for LLM training, RAG pipelines, and content analysis — with automatic markdown conversion, JavaScript rendering, CSS-based extraction, LLM-powered structured extraction, and session management for multi-page crawling.
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://docs.example.com/getting-started",
config=CrawlerRunConfig(
word_count_threshold=10, # Skip blocks with <10 words
cache_mode=CacheMode.ENABLED,
),
)
print(result.markdown) # Clean markdown (LLM-ready)
print(result.cleaned_html) # Cleaned HTML
print(result.media["images"]) # Extracted images
print(result.links["internal"]) # Internal links
print(result.links["external"]) # External links
print(result.metadata) # Title, description, keywords
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel
class Product(BaseModel):
name: str
price: float
description: str
rating: float | None
features: list[str]
extraction = LLMExtractionStrategy(
provider="openai/gpt-4o-mini",
api_token=os.environ["OPENAI_API_KEY"],
schema=Product.model_json_schema(),
instruction="Extract all products from this page",
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://shop.example.com/keyboards",
config=CrawlerRunConfig(extraction_strategy=extraction),
)
products = [Product(**p) for p in json.loads(result.extracted_content)]
for p in products:
print(f"{p.name}: ${p.price} — {p.rating}★")
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
schema = {
"name": "Product listings",
"baseSelector": ".product-card", # Repeating element
"fields": [
{"name": "title", "selector": "h3.product-title", "type": "text"},
{"name": "price", "selector": ".price", "type": "text"},
{"name": "url", "selector": "a", "type": "attribute", "attribute": "href"},
{"name": "image", "selector": "img", "type": "attribute", "attribute": "src"},
],
}
extraction = JsonCssExtractionStrategy(schema)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://shop.example.com/all",
config=CrawlerRunConfig(
extraction_strategy=extraction,
js_code="window.scrollTo(0, document.body.scrollHeight);", # Scroll to load more
wait_for=".product-card:nth-child(20)", # Wait for 20 items
),
)
products = json.loads(result.extracted_content)
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async with AsyncWebCrawler() as crawler:
# Session-based: maintains cookies, auth state
session_id = "docs-crawl"
# Login first
await crawler.arun(
url="https://docs.example.com/login",
config=CrawlerRunConfig(
session_id=session_id,
js_code="""
document.querySelector('#email').value = '[email protected]';
document.querySelector('#password').value = 'pass';
document.querySelector('form').submit();
""",
wait_for="#dashboard",
),
)
# Crawl authenticated pages
urls = ["https://docs.example.com/api", "https://docs.example.com/guides"]
for url in urls:
result = await crawler.arun(
url=url,
config=CrawlerRunConfig(session_id=session_id),
)
# Save markdown for RAG indexing
save_to_knowledge_base(url, result.markdown)
pip install crawl4ai
crawl4ai-setup # Install Playwright browsers
result.markdown for LLM/RAG input; clean, structured, no HTML noiseJsonCssExtractionStrategy for structured pages; no LLM cost, fast, deterministicLLMExtractionStrategy for unstructured pages; Pydantic schema ensures valid outputsession_id for multi-page crawls; maintains cookies, auth state across requestsCacheMode.ENABLED for development; avoid re-crawling the same pageswait_for CSS selectors to ensure content is loaded before extractiondevelopment
Expert guidance for Fireworks AI, the platform for running open-source LLMs (Llama, Mixtral, Qwen, etc.) with enterprise-grade speed and reliability. Helps developers integrate Fireworks' inference API, fine-tune models, and deploy custom model endpoints with function calling and structured output support.
development
Convert any website into clean, structured data with Firecrawl — API-first web scraping service. Use when someone asks to "turn a website into markdown", "scrape website for LLM", "Firecrawl", "extract website content as clean text", "crawl and convert to structured data", or "scrape website for RAG". Covers single-page scraping, full-site crawling, structured extraction, and LLM-ready output.
tools
Expert guidance for Firebase, Google's platform for building and scaling web and mobile applications. Helps developers set up authentication, Firestore/Realtime Database, Cloud Functions, hosting, storage, and analytics using Firebase's SDK and CLI.
development
When the user needs to build file upload functionality for a web application. Use when the user mentions "file upload," "image upload," "upload endpoint," "multipart upload," "presigned URL," "S3 upload," "file validation," "upload to cloud storage," or "accept user files." Handles upload endpoints, file validation (type, size, magic bytes), cloud storage integration, and upload status tracking. For image/video processing after upload, see media-transcoder.