crwl-cli/skills/SKILL.md
Headless crawler for public web pages. Use to extract clean markdown, structured links, and batch crawl docs/articles.
npx skillsauth add mulatta/skillz crwl-cliInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use crwl-cli for public docs, articles, blogs, product pages, and index pages
that need LLM-readable markdown or structured links. Do not use it for
logged-in/private pages, login flows, clicking/typing, uploads, downloads, or
browser automation.
Choose approach before crawling:
| Situation | Approach |
| ----------------------------------------------------------------------------------- | -------------------------------------------- |
| Single page (article, docs, blog post) | crwl-cli fetch URL |
| Multiple pages linked from one page (product listings, search results, index pages) | JSON links pipeline |
| Public CMS homepage with notices, menus, sliders, or portal links | --format json --scan-full-page --block-ads |
| JS-rendered content missing | Add --wait-for or --scan-full-page |
| Ad/tracker noise | Add --block-ads |
| Basic bot blocking on public page | Add --stealth --user-agent-mode random |
Never manually copy URLs from markdown output. For link discovery, crawl
with --format json and extract .links with jq. Markdown links may be
truncated or malformed; .links contains structured hrefs.
# Single public page → filtered markdown
crwl-cli fetch https://docs.python.org/3/library/asyncio.html
# Limit noisy pages to main content
crwl-cli fetch https://docs.python.org/3/ --css "#content"
# Diagnose/render pipelines with structured output
crwl-cli fetch https://example.com --format json
# Raw markdown when filtered markdown misses content
crwl-cli fetch https://example.com --format raw
# JS-rendered content
crwl-cli fetch https://example.com --wait-for ".loaded"
# Quote URLs with query strings so the shell does not split on &
crwl-cli fetch 'https://grad.example.edu/site/index.do?epTicket=LOG&lang=en' \
--format json --scan-full-page --block-ads
# Fast text-only crawl
crwl-cli fetch https://example.com --text-mode
Use when a page links to multiple detail pages you need to read. Public CMS homepages often mix notices, menus, sliders, and portal/login links; extract structured links and follow only public content links.
# 1. Crawl listing/index page as JSON
crwl-cli fetch https://shop.example.com/products --format json > listing.json
# 2. Extract canonical detail URLs from .links, not .markdown
jq -r '.links.internal[] | select(.href | test("/products/")) | .href' listing.json > urls.txt
# 3. Batch crawl details
crwl-cli fetch --urls-file urls.txt --format json
--format json output includes:
{
"url": "...",
"success": true,
"status_code": 200,
"markdown": "...",
"links": {
"internal": [{ "href": "...", "text": "...", "title": "..." }],
"external": [{ "href": "...", "text": "...", "title": "..." }]
},
"error": null
}
URL — single public URL to crawl.--urls-file FILE — one URL per line. Empty lines and # comments ignored.
Use for batch crawling URLs extracted from JSON links.--format md|raw|json
md default: filtered markdown for LLM reading.raw: unfiltered markdown for debugging missing content.json: structured output for pipelines and diagnostics.--screenshot — capture a screenshot for rendering/debugging issues.--css SELECTOR — limit extraction to a CSS selector.--exclude-tags TAGS — comma-separated tags to exclude. Default:
nav,footer,script,style.--wait-for SELECTOR — wait for a CSS selector before extraction.--scan-full-page — scroll through the full page before extraction; use for
lazy-loaded public content.--text-mode — disable images for faster text-only crawls.--block-ads — block common ad and tracker requests.--stealth — enable Crawl4AI/Playwright stealth mode for basic bot blocking.--user-agent-mode default|random — use default or randomized user agent.--viewport WIDTHxHEIGHT — set viewport, e.g. 1920x1080.--ignore-https-errors — ignore invalid TLS certificates.--timeout MS — page timeout in milliseconds. Default: 30000.--cache — enable local cache. Default is off; use only when stale content is
acceptable.| Problem | Try |
| ----------------------- | -------------------------------------------------------------- |
| Empty markdown | --format raw, --wait-for SELECTOR, or --scan-full-page |
| Too much noise | --css SELECTOR or --exclude-tags TAGS |
| Slow pages | --timeout 60000 |
| Images slow things down | --text-mode |
| Ad/tracker noise | --block-ads |
| Basic bot block | --stealth --user-agent-mode random |
| Need links | --format json, then read .links.internal[] / .external[] |
| Login required | Stop. crwl-cli is for public headless crawling only. |
tools
Biomedical literature, reference, and entity research helper. Use whenever the user asks for PubMed/PMC/NCBI/Entrez paper search, PMID/PMCID/DOI conversion, biomedical citation/BibTeX/RIS export, legal OA full-text lookup, gene/protein/RNA/transcript evidence, OpenAlex citation/OA enrichment, Semantic Scholar enrichment, PubChem compound/assay/bioactivity lookup, or bio/medical literature review evidence collection.
tools
Use kmap-cli whenever the user asks for Korea-focused 장소찾기/POI lookup, 주변검색, 맛집 후보 찾기, 대중교통 길찾기, 경유지 transit routing, address geocoding, reverse geocoding, saved home/work aliases, or NAVER/Kakao/TMAP map app handoff. Default to TMAP API for machine-readable place/transit data; use NAVER/Kakao only as URL handoff helpers without NAVER/Kakao API keys. Do not use ODsay.
tools
Manage Linkwarden bookmarks, collections, tags, highlights, RSS subscriptions, archives, and API tokens through a restricted CLI. Use when the user asks to save, search, organize, archive, or delete Linkwarden links.
tools
Manage Vikunja projects, tasks, relations, templates, attachments, labels, comments, due/reminder notifications, views, and kanban buckets through a restricted CLI. Use whenever the user asks to inspect or update Vikunja tasks/projects, create structured tasks from sources, attach evidence, link blockers/subtasks/order with task relations, move tasks between projects or kanban buckets, manage workflow labels/comments, or check Vikunja reminders/overdue items. Prefer this skill over raw Vikunja API calls.