skills/ck991357/crawl4ai/SKILL.md
功能强大的开源网页抓取和数据处理工具,支持6种工作模式,包括截图、PDF导出和智能爬取
npx skillsauth add aiskillstore/marketplace crawl4aiInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Crawl4AI 是一个功能强大的开源网页抓取和数据处理工具,支持 6 种不同的工作模式。所有二进制输出(截图、PDF)都以 base64 编码返回,便于模型处理。
所有对 crawl4ai 的调用都必须严格遵循以下嵌套结构! 这是一个通用规则,适用于所有模式。
{
"mode": "<模式名称>",
"parameters": {
"param1": "value1",
"param2": "value2"
// ...具体模式的所有参数都放在这里
}
}
// 这是一个绝对会导致失败的错误调用!
{
"mode": "scrape",
"url": "https://example.com" // 错误!'url' 必须在 'parameters' 内部
}
{
"mode": "scrape",
"parameters": {
"url": "https://example.com"
}
}
版本 1.2 重要更新: 新增智能分级抓取系统,自动适应不同网站类型:
| 配置等级 | 适用场景 | 特点 | 性能表现 | |---------|---------|------|----------| | 标准配置 | 普通静态网站 | 高性能,快速抓取 | ⚡ 快速 (90秒超时) | | 增强配置 | JS网站、反爬网站 | 加强反爬,完整渲染 | 🛡️ 稳健 (120秒超时) | | 降级配置 | 极端复杂网站 | 最大化兼容性 | 🐢 保守 (180秒超时) |
智能识别能力:
| 模式 | 功能描述 | 主要用途 | 复杂度 | 推荐场景 |
|------|----------|----------|---------|----------|
| scrape | 抓取单个网页 | 获取页面内容、截图、PDF | ⭐⭐ | 单页面内容获取 |
| deep_crawl | 深度智能爬取 | 使用策略深度爬取网站 | ⭐⭐⭐⭐ | 网站内容探索 |
| batch_crawl | 批量 URL 处理 | 同时处理多个 URL | ⭐⭐ | 批量数据收集 |
| extract | 结构化数据提取 | 基于 CSS 或 LLM 提取数据 | ⭐⭐⭐ | 特定数据提取 |
| pdf_export | PDF 导出 | 将网页导出为 PDF | ⭐ | 文档保存 |
| screenshot | 截图捕获 | 捕获网页截图 | ⭐ | 视觉证据保存 |
注意: crawl 模式已在最新版本中移除,请使用 scrape 或 deep_crawl 替代。
{
"mode": "scrape",
"parameters": {
"url": "https://example.com/article",
"format": "markdown",
"word_count_threshold": 10,
"include_links": true,
"include_images": true
}
}
{
"mode": "batch_crawl",
"parameters": {
"urls": [
"https://example.com/product1",
"https://example.com/product2",
"https://example.com/product3"
],
"concurrent_limit": 4
}
}
{
"mode": "deep_crawl",
"parameters": {
"url": "https://example.com/docs",
"max_depth": 3,
"max_pages": 80,
"keywords": ["教程", "指南", "API"],
"strategy": "best_first"
}
}
{
"mode": "extract",
"parameters": {
"url": "https://news.example.com/article",
"schema_definition": {
"name": "Article",
"baseSelector": ".article-content",
"fields": [
{
"name": "title",
"selector": "h1",
"type": "text"
},
{
"name": "author",
"selector": ".author",
"type": "text"
},
{
"name": "content",
"selector": ".content",
"type": "text"
}
]
},
"extraction_type": "css"
}
}
{
"mode": "scrape",
"parameters": {
"url": "https://example.com",
"return_screenshot": true,
"return_pdf": true,
"screenshot_quality": 90,
"screenshot_max_width": 1200
}
}
scrape)智能分级抓取:新版工具自动根据网站类型选择最佳配置方案。
✅ 正确示例:
{
"mode": "scrape",
"parameters": {
"url": "https://example.com",
"format": "markdown",
"css_selector": ".article-content",
"include_links": true,
"include_images": true,
"return_screenshot": true,
"return_pdf": false,
"screenshot_quality": 80,
"screenshot_max_width": 1200,
"word_count_threshold": 10,
"exclude_external_links": true
}
}
❌ 错误示例(参数未嵌套):
{
"mode": "scrape",
"url": "https://example.com", // 错误!缺少parameters包装
"format": "markdown"
}
智能分级抓取原理:
参数说明:
url (必需): 要抓取的网页 URL,必须以 http:// 或 https:// 开头format: 输出格式,markdown(默认)/html/textcss_selector: 提取特定内容的 CSS 选择器include_links: 是否在输出中包含链接,默认 trueinclude_images: 是否在输出中包含图片,默认 truereturn_screenshot: 是否返回截图(base64),默认 falsereturn_pdf: 是否返回 PDF(base64),默认 falsescreenshot_quality: 截图质量(10-100),默认 70screenshot_max_width: 截图最大宽度,默认 1920word_count_threshold: 内容块最小单词数,默认 10exclude_external_links: 是否排除外部链接,默认 truedeep_crawl)使用智能策略深度爬取整个网站,支持关键词评分和 URL 过滤。
✅ 正确示例:
{
"mode": "deep_crawl",
"parameters": {
"url": "https://example.com",
"max_depth": 3,
"max_pages": 80,
"strategy": "best_first",
"include_external": false,
"keywords": ["产品", "价格", "规格"],
"url_patterns": ["/products/", "/docs/"],
"stream": false
}
}
增强功能:
参数说明:
url (必需): 起始 URLmax_depth: 最大爬取深度,默认 3max_pages: 最大页面数,默认 80strategy: 爬取策略,bfs(默认)/dfs/best_firstinclude_external: 是否跟踪外部链接,默认 falsekeywords: 用于相关性评分的关键词列表url_patterns: URL 模式过滤列表stream: 是否流式返回结果,默认 falsebatch_crawl)同时处理多个 URL,适用于批量数据收集。
✅ 正确示例:
{
"mode": "batch_crawl",
"parameters": {
"urls": [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3"
],
"stream": false,
"concurrent_limit": 4
}
}
❌ 错误示例(urls不是数组):
{
"mode": "batch_crawl",
"parameters": {
"urls": "https://example.com/page1" // 错误!urls必须是数组
}
}
安全限制:
参数说明:
urls (必需): URL 列表,必须是数组格式stream: 是否流式返回,默认 falseconcurrent_limit: 最大并发数,默认 4 (已根据服务器内存升级调整,并设置了硬性上限)extract)从网页中提取结构化数据,主要依赖精确的 CSS 选择器。
✅ 正确示例 (CSS 提取):
{
"mode": "extract",
"parameters": {
"url": "https://news.example.com/article",
"schema_definition": {
"name": "Article",
"baseSelector": ".article-content",
"fields": [
{
"name": "title",
"selector": "h1",
"type": "text"
},
{
"name": "author",
"selector": ".author",
"type": "text"
},
{
"name": "publish_date",
"selector": ".date",
"type": "text"
},
{
"name": "content",
"selector": ".content",
"type": "text"
}
]
},
"css_selector": ".article-content",
"extraction_type": "css"
}
}
⚠️ 关键限制与最佳实践:
crawl4ai 的 extraction_type: "llm" 模式尚未部署有效的 LLM 实例。crawl4ai 提取场景: 仅适用于您能提供精确 CSS 选择器的简单、结构固定的页面。无法实现仅凭自然语言描述和 JSON Schema(即不提供 CSS 选择器)从复杂页面(如表格、列表)中智能提取数据。🛡️ 自动修复机制: 我们的工具会自动修复常见的 schema 格式问题:
baseSelector,自动设置为 css_selector 或 'body'fields,自动创建默认字段配置name,自动设置为 "ExtractedData"💡 最佳实践: 虽然工具会自动修复,但提供完整的 schema 可以获得更精确的提取结果。
参数说明:
url (必需): 要提取的网页 URLschema_definition (必需): 定义输出结构的 JSON schemacss_selector: 基础 CSS 选择器(CSS 提取时使用)extraction_type: 提取类型,css(默认)/llmprompt: LLM 提取的提示语{
"name": "YourSchemaName", // 必需:schema 名称
"baseSelector": "css-selector", // 必需:基础 CSS 选择器
"fields": [ // 必需:字段定义数组
{
"name": "field_name", // 必需:字段名称
"selector": "css-selector", // 必需:字段选择器
"type": "text", // 必需:字段类型
"multiple": true // 可选:是否允许多个值
}
]
}
{
"type": "object",
"properties": {
"field1": {"type": "string"},
"field2": {"type": "array", "items": {"type": "string"}}
}
}
pdf_export)将网页导出为 PDF 格式。
✅ 正确示例:
{
"mode": "pdf_export",
"parameters": {
"url": "https://example.com/document",
"return_as_base64": true
}
}
参数说明:
url (必需): 要导出为 PDF 的网页 URLreturn_as_base64: 是否返回 base64 编码,默认 truescreenshot)捕获网页截图,支持质量压缩和尺寸调整。
✅ 正确示例:
{
"mode": "screenshot",
"parameters": {
"url": "https://example.com",
"full_page": true,
"return_as_base64": true,
"quality": 80,
"max_width": 1200,
"max_height": 3000
}
}
参数说明:
url (必需): 要截图的网页 URLfull_page: 是否截取整个页面,默认 truereturn_as_base64: 是否返回 base64 编码,默认 truequality: 截图质量(10-100),默认 70max_width: 最大宽度,默认 1920max_height: 最大高度,默认 5000版本 1.2 新增反爬特性:
--disable-blink-features=AutomationControlled)目标: 自动收集和分析新闻内容
deep_crawl 发现相关文章链接batch_crawl 批量获取内容extract 结构化提取关键信息目标: 系统化分析竞争对手网站
screenshot 捕获竞品页面scrape 获取详细内容pdf_export 保存证据目标: 建立完整的产品数据库
deep_crawl 发现所有产品页面extract 提取产品信息stream: true,减少批量处理的 URL 数量。后端已设置内存上限(6000MB),高并发可能触发浏览器重启word_count_threshold,检查 css_selectormax_height 值,确保 full_page: truefields 数组中的 selector 是否准确匹配页面元素schema_definition 包含完整的 name、baseSelector、fields 结构scrape 模式测试单个页面parameters 对象内parameters 对象内http:// 或 https:// 开头stream: true)name、baseSelector、fields 结构错误 1: 缺少嵌套参数
// ❌ 错误
{
"mode": "scrape",
"url": "https://example.com"
}
// ✅ 正确
{
"mode": "scrape",
"parameters": {
"url": "https://example.com"
}
}
错误 2: URL 缺少协议
// ❌ 错误
{
"mode": "scrape",
"parameters": {
"url": "example.com"
}
}
// ✅ 正确
{
"mode": "scrape",
"parameters": {
"url": "https://example.com"
}
}
错误 3: 错误的参数类型
// ❌ 错误 - urls 应该是数组
{
"mode": "batch_crawl",
"parameters": {
"urls": "https://example.com"
}
}
// ✅ 正确
{
"mode": "batch_crawl",
"parameters": {
"urls": ["https://example.com"]
}
}
错误 4: extract模式使用错误的参数名
// ❌ 错误 - 应该使用 schema_definition
{
"mode": "extract",
"parameters": {
"url": "https://example.com",
"schema": { // 错误!应该是 schema_definition
"title": "string"
}
}
}
// ✅ 正确
{
"mode": "extract",
"parameters": {
"url": "https://example.com",
"schema_definition": {
"name": "Article",
"baseSelector": ".content",
"fields": [
{
"name": "title",
"selector": "h1",
"type": "text"
}
]
}
}
}
{
"mode": "scrape",
"parameters": {
"url": "https://example.com",
"include_links": true,
"include_images": true,
"return_screenshot": true,
"return_pdf": true,
"screenshot_quality": 90,
"screenshot_max_width": 1200,
"word_count_threshold": 15
}
}
{
"mode": "deep_crawl",
"parameters": {
"url": "https://docs.example.com",
"strategy": "best_first",
"keywords": ["API", "教程", "示例"],
"max_depth": 3,
"max_pages": 80
}
}
{
"mode": "batch_crawl",
"parameters": {
"urls": [
"https://example.com/home",
"https://example.com/about",
"https://example.com/contact",
"https://example.com/products"
],
"concurrent_limit": 4
}
}
{
"mode": "extract",
"parameters": {
"url": "https://news.example.com/article",
"schema_definition": {
"name": "NewsArticle",
"baseSelector": ".article-container",
"fields": [
{
"name": "headline",
"selector": "h1.news-title",
"type": "text"
},
{
"name": "author",
"selector": ".author-name",
"type": "text"
},
{
"name": "publish_date",
"selector": ".publish-date",
"type": "text"
},
{
"name": "main_content",
"selector": ".article-body",
"type": "text"
},
{
"name": "tags",
"selector": ".tag",
"type": "text",
"multiple": true
}
]
},
"extraction_type": "css"
}
}
parameters 对象内schema_definition 参数名include_links 和 include_images 控制输出内容word_count_threshold 过滤低质量内容块crawl 模式,请使用 scrape 或 deep_crawl 替代crawl 模式: 简化模式选择,聚焦核心功能mode: "crawl",请使用替代模式development
Apple Human Interface Guidelines for content display components. Use this skill when the user asks about charts component, collection view, image view, web view, color well, image well, activity view, lockup, data visualization, content display, displaying images, rendering web content, color pickers, or presenting collections of items in Apple apps. Also use when the user says how should I display charts, what's the best way to show images, should I use a web view, how do I build a grid of items, what component shows media, or how do I present a share sheet. Cross-references: hig-foundations for color/typography/accessibility, hig-patterns for data visualization patterns, hig-components-layout for structural containers, hig-platforms for platform-specific component behavior.
tools
Automate HelpDesk tasks via Rube MCP (Composio): list tickets, manage views, use canned responses, and configure custom fields. Always search tools first for current schemas.
testing
Expert Haskell engineer specializing in advanced type systems, pure functional design, and high-reliability software. Use PROACTIVELY for type-level programming, concurrency, and architecture guidance.
tools
GraphQL gives clients exactly the data they need - no more, no less. One endpoint, typed schema, introspection. But the flexibility that makes it powerful also makes it dangerous. Without proper controls, clients can craft queries that bring down your server. This skill covers schema design, resolvers, DataLoader for N+1 prevention, federation for microservices, and client integration with Apollo/urql. Key insight: GraphQL is a contract. The schema is the API documentation. Design it carefully.