Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

chicagopeabodydev-sudo/crawl4ai-skill

Name: crawl4ai-skill
Author: chicagopeabodydev-sudo

.cursor/skills/crawl4ai-skill/SKILL.md

npx skillsauth add chicagopeabodydev-sudo/library_bot_poc crawl4ai-skill

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

When to use this skill

Use this skill to crawl a website, scrape its contents, and convert that content into clean markdown. It requires a starting URL. Depending on max_depth, max_pages, and the chosen crawl strategy, Crawl4AI can either crawl a single page or continue to additional pages beyond the start URL.

General Steps

Import the needed Crawl4AI modules.
Define a CrawlerRunConfig with the crawl strategy and crawl settings you need.
Create an AsyncWebCrawler and pass any browser-level configuration if needed.
Call AsyncWebCrawler.arun(...) to start the crawl.
Handle the returned result: If you run a single-page crawl, arun(...) returns one CrawlResult. If you run a deep crawl, arun(...) returns a list of CrawlResult objects.

Crawling Strategies

BFSDeepCrawlStrategy (Breadth-First Search) explores all links at one depth before moving deeper.
DFSDeepCrawlStrategy (Depth-First Search) explores as far down one branch as possible before backtracking.
BestFirstCrawlingStrategy scores pages to prioritize the most relevant pages first. This strategy requires a scorer configuration so pages can be ranked before crawling.

Key Crawler Settings

max_depth = number of levels deep to crawl from the starting page
include_external = stay within the same domain or allow crawls outside it
max_pages = maximum number of pages to crawl
score_threshold = minimum score a URL must have to be crawled when using a scoring-based strategy

Example code setting crawler settings

from crawl4ai.deep_crawling import DFSDeepCrawlStrategy

strategy = DFSDeepCrawlStrategy(
    max_depth=2,               # Crawl initial page + 2 levels deep
    include_external=False,    # Stay within the same domain
    max_pages=30,              # Maximum number of pages to crawl (optional)
    score_threshold=0.5,       # Minimum score for URLs to be crawled (optional)
)

Output (CrawlResult object)

When you call arun() on a page, Crawl4AI returns a CrawlResult object containing the crawl output.
CrawlResult properties include raw HTML, cleaned HTML, optional screenshots or PDFs, structured extraction results, and more.
The markdown property may contain a MarkdownGenerationResult, which gives access to variants such as raw_markdown and fit_markdown.

CrawlResult classes and properties

# Pydantic BaseModel syntax
class MarkdownGenerationResult(BaseModel):
    raw_markdown: str
    markdown_with_citations: str
    references_markdown: str
    fit_markdown: Optional[str] = None
    fit_html: Optional[str] = None

class CrawlResult(BaseModel):
    url: str
    html: str
    fit_html: Optional[str] = None
    success: bool
    cleaned_html: Optional[str] = None
    media: Dict[str, List[Dict]] = {}
    links: Dict[str, List[Dict]] = {}
    downloaded_files: Optional[List[str]] = None
    js_execution_result: Optional[Dict[str, Any]] = None
    screenshot: Optional[str] = None
    pdf: Optional[bytes] = None
    mhtml: Optional[str] = None
    markdown: Optional[Union[str, MarkdownGenerationResult]] = None
    extracted_content: Optional[str] = None
    metadata: Optional[dict] = None
    error_message: Optional[str] = None
    session_id: Optional[str] = None
    response_headers: Optional[dict] = None
    status_code: Optional[int] = None
    ssl_certificate: Optional[SSLCertificate] = None
    dispatch_result: Optional[DispatchResult] = None
    redirected_url: Optional[str] = None
    redirected_status_code: Optional[int] = None
    network_requests: Optional[List[Dict[str, Any]]] = None
    console_messages: Optional[List[Dict[str, Any]]] = None
    tables: List[Dict] = Field(default_factory=list)

LLM Extraction

Use LLM extraction when you want Crawl4AI to structure results according to a supplied schema, usually a Pydantic model.

Steps

Chunking (optional): The HTML or markdown is split into smaller segments if it is very long.
Prompt construction: For each chunk, Crawl4AI builds a prompt that includes your instruction and optional schema.
LLM inference: Each chunk is sent to the model, either in parallel or sequentially.
Combining: The results from each chunk are merged and parsed into JSON.

Extraction types:

"schema": The model tries to return JSON conforming to your Pydantic-based schema. You provide schema=YourPydanticModel.model_json_schema().
"block": The model returns freeform text or smaller JSON structures, which the library collects.

import os

from pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig
from crawl4ai import LLMExtractionStrategy

# step 1 - define a Pydantic model
class Product(BaseModel):
    name: str
    price: str

# step 2 - assign the Pydantic model to "schema" of LLMExtractionStrategy
async def main():
    llm_strategy = LLMExtractionStrategy(
        llm_config = LLMConfig(provider="openai/gpt-4o-mini", api_token=os.getenv('OPENAI_API_KEY')),
        schema=Product.model_json_schema(), # Or use model_json_schema()
        extraction_type="schema",
        instruction="Extract all product objects with 'name' and 'price' from the content.",
        chunk_token_threshold=1000,
        overlap_rate=0.0,
        apply_chunking=True,
        input_format="markdown",   # or "html", "fit_markdown"
        extra_args={"temperature": 0.0, "max_tokens": 800}
    )

    # remaining crawler code here...

Additional Resources

For usage examples, see examples.md
Crawl4AI documentation
Deep crawling documentation
Simple crawling
Browser, crawler, and LLM configuration
Markdown generation
Command line interface
LLM extraction

chicagopeabodydev-sudo/crawl4ai-skill

.cursor/skills/crawl4ai-skill/SKILL.md

Crawls websites and extracts (scrapes) web page content including text, tables, lists, and images into clean LLM-friendly markdown. Use when crawling websites, scraping web content, extracting web page data for RAG/LLMs, or when the user mentions crawl4ai, web scraping, or deep crawling.

development

Updated Apr 17, 2026

$ install --global

skillsauth

npx skillsauth add chicagopeabodydev-sudo/library_bot_poc crawl4ai-skill

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 24, 2026, 8:59 PM2.5s1 file scanned

SKILL.md

name:: crawl4ai-skill
description:: Crawls websites and extracts (scrapes) web page content including text, tables, lists, and images into clean LLM-friendly markdown. Use when crawling websites, scraping web content, extracting web page data for RAG/LLMs, or when the user mentions crawl4ai, web scraping, or deep crawling.

When to use this skill

General Steps

Import the needed Crawl4AI modules.
Define a CrawlerRunConfig with the crawl strategy and crawl settings you need.
Create an AsyncWebCrawler and pass any browser-level configuration if needed.
Call AsyncWebCrawler.arun(...) to start the crawl.
Handle the returned result: If you run a single-page crawl, arun(...) returns one CrawlResult. If you run a deep crawl, arun(...) returns a list of CrawlResult objects.

Crawling Strategies

BFSDeepCrawlStrategy (Breadth-First Search) explores all links at one depth before moving deeper.
DFSDeepCrawlStrategy (Depth-First Search) explores as far down one branch as possible before backtracking.
BestFirstCrawlingStrategy scores pages to prioritize the most relevant pages first. This strategy requires a scorer configuration so pages can be ranked before crawling.

Key Crawler Settings

max_depth = number of levels deep to crawl from the starting page
include_external = stay within the same domain or allow crawls outside it
max_pages = maximum number of pages to crawl
score_threshold = minimum score a URL must have to be crawled when using a scoring-based strategy

Example code setting crawler settings

from crawl4ai.deep_crawling import DFSDeepCrawlStrategy

strategy = DFSDeepCrawlStrategy(
    max_depth=2,               # Crawl initial page + 2 levels deep
    include_external=False,    # Stay within the same domain
    max_pages=30,              # Maximum number of pages to crawl (optional)
    score_threshold=0.5,       # Minimum score for URLs to be crawled (optional)
)

Output (CrawlResult object)

When you call arun() on a page, Crawl4AI returns a CrawlResult object containing the crawl output.
CrawlResult properties include raw HTML, cleaned HTML, optional screenshots or PDFs, structured extraction results, and more.
The markdown property may contain a MarkdownGenerationResult, which gives access to variants such as raw_markdown and fit_markdown.

CrawlResult classes and properties

# Pydantic BaseModel syntax
class MarkdownGenerationResult(BaseModel):
    raw_markdown: str
    markdown_with_citations: str
    references_markdown: str
    fit_markdown: Optional[str] = None
    fit_html: Optional[str] = None

class CrawlResult(BaseModel):
    url: str
    html: str
    fit_html: Optional[str] = None
    success: bool
    cleaned_html: Optional[str] = None
    media: Dict[str, List[Dict]] = {}
    links: Dict[str, List[Dict]] = {}
    downloaded_files: Optional[List[str]] = None
    js_execution_result: Optional[Dict[str, Any]] = None
    screenshot: Optional[str] = None
    pdf: Optional[bytes] = None
    mhtml: Optional[str] = None
    markdown: Optional[Union[str, MarkdownGenerationResult]] = None
    extracted_content: Optional[str] = None
    metadata: Optional[dict] = None
    error_message: Optional[str] = None
    session_id: Optional[str] = None
    response_headers: Optional[dict] = None
    status_code: Optional[int] = None
    ssl_certificate: Optional[SSLCertificate] = None
    dispatch_result: Optional[DispatchResult] = None
    redirected_url: Optional[str] = None
    redirected_status_code: Optional[int] = None
    network_requests: Optional[List[Dict[str, Any]]] = None
    console_messages: Optional[List[Dict[str, Any]]] = None
    tables: List[Dict] = Field(default_factory=list)

LLM Extraction

Use LLM extraction when you want Crawl4AI to structure results according to a supplied schema, usually a Pydantic model.

Steps

Chunking (optional): The HTML or markdown is split into smaller segments if it is very long.
Prompt construction: For each chunk, Crawl4AI builds a prompt that includes your instruction and optional schema.
LLM inference: Each chunk is sent to the model, either in parallel or sequentially.
Combining: The results from each chunk are merged and parsed into JSON.

Extraction types:

"schema": The model tries to return JSON conforming to your Pydantic-based schema. You provide schema=YourPydanticModel.model_json_schema().
"block": The model returns freeform text or smaller JSON structures, which the library collects.

import os

from pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig
from crawl4ai import LLMExtractionStrategy

# step 1 - define a Pydantic model
class Product(BaseModel):
    name: str
    price: str

# step 2 - assign the Pydantic model to "schema" of LLMExtractionStrategy
async def main():
    llm_strategy = LLMExtractionStrategy(
        llm_config = LLMConfig(provider="openai/gpt-4o-mini", api_token=os.getenv('OPENAI_API_KEY')),
        schema=Product.model_json_schema(), # Or use model_json_schema()
        extraction_type="schema",
        instruction="Extract all product objects with 'name' and 'price' from the content.",
        chunk_token_threshold=1000,
        overlap_rate=0.0,
        apply_chunking=True,
        input_format="markdown",   # or "html", "fit_markdown"
        extra_args={"temperature": 0.0, "max_tokens": 800}
    )

    # remaining crawler code here...

Additional Resources

For usage examples, see examples.md
Crawl4AI documentation
Deep crawling documentation
Simple crawling
Browser, crawler, and LLM configuration
Markdown generation
Command line interface
LLM extraction

Related Skills

chicagopeabodydev-sudo/supabase-skill

development

VerifiedTrustedCommunity

Connects to Supabase-hosted PostgreSQL and pgvector vector databases. Use when storing or querying vector embeddings, building RAG pipelines with Supabase, using vecs for similarity search, or when the user mentions Supabase, pgvector, or vector databases.

SKILL.mdUpdated Apr 17, 2026

chicagopeabodydev-sudo/supabase-skill

chicagopeabodydev-sudo/streamlit-skill

development

VerifiedTrustedCommunity

Builds simple user interfaces in Python. Use when creating chatbots, dashboards, or data apps with Streamlit—especially chat UIs with st.chat_message and st.chat_input, or when the user mentions Streamlit.

SKILL.mdUpdated Apr 17, 2026

chicagopeabodydev-sudo/streamlit-skill

chicagopeabodydev-sudo/pydantic-skill

data-ai

VerifiedTrustedCommunity

Defines structured data models with Pydantic BaseModel. Use when defining LLM outputs or inputs for RAG, LlamaIndex structured outputs, or when the user mentions Pydantic, structured output, or response schemas.

SKILL.mdUpdated Apr 17, 2026

chicagopeabodydev-sudo/pydantic-skill

chicagopeabodydev-sudo/nemo-guardrails-skill

tools

VerifiedTrustedCommunity

NeMo Guardrails is an open-source Python package for adding programmable guardrails around LLM calls. Use it to block unsafe, malicious, off-topic, or policy-violating user inputs, retrieved RAG content, tool usage, and model responses.

SKILL.mdUpdated Apr 17, 2026

chicagopeabodydev-sudo/nemo-guardrails-skill

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/chicagopeabodydev-sudo/library_bot_poc.git

# Copy into Claude Code skills folder (global)
cp -r library_bot_poc/.cursor/skills/crawl4ai-skill ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

chicagopeabodydev-sudo/library_bot_poc

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT