skills/crwl-cli/SKILL.md
Crawl web pages and extract markdown. Handles auth via browser profiles.
npx skillsauth add mulatta/skillz crwl-cliInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Choose approach before crawling:
| Situation | Approach |
| ----------------------------------------------------------------------------------- | ------------------------------------------------- |
| Single page (article, docs, blog post) | crwl-cli fetch URL |
| Multiple pages linked from one page (product listings, search results, index pages) | JSON links pipeline (see Multi-step Crawling) |
NEVER manually copy URLs from markdown output. Use --format json and extract .links with jq instead. Markdown text may contain malformed or incomplete URLs, while .links provides structured, reliable hrefs.
# Single URL — markdown output (default)
crwl-cli fetch https://docs.python.org/3/library/asyncio.html
# CSS selector to limit scope
crwl-cli fetch https://docs.python.org/3/ --css "#content"
# JSON output (for pipelines)
crwl-cli fetch https://example.com --format json
# Raw markdown (no content filtering)
crwl-cli fetch https://example.com --format raw
# Fast mode — disable images
crwl-cli fetch https://example.com --text-mode
# Wait for dynamic content
crwl-cli fetch https://example.com --wait-for ".loaded"
# Batch crawl from file
crwl-cli fetch --urls-file urls.txt --format json
When to use: the target page links to multiple detail pages you need data from. Detect this when the page is a product listing, search results, category index, or configurator.
# 1. Crawl listing page → JSON (always use --format json for listings)
crwl-cli fetch https://shop.example.com/products --format json > listing.json
# 2. Extract detail page URLs via .links (NOT from .markdown)
jq -r '.links.internal[] | select(.href | test("/products/")) | .href' listing.json > urls.txt
# 3. Batch crawl all detail pages
crwl-cli fetch --urls-file urls.txt --format json
links structure (--format json only){
"internal": [{ "href": "...", "text": "...", "title": "..." }],
"external": [{ "href": "...", "text": "...", "title": "..." }]
}
--format json.links.internal contain multiple URLs matching a detail page pattern?
jq, write to file, batch crawl with --urls-file.markdown directly.markdown to find URLs — unreliable, manual, and misses links hidden in JS-rendered elements. Always use .links..links provide the canonical hrefs.--urls-file for batch crawling instead of sequential fetch calls.When crawl output contains login prompts ("sign in", "log in", 403/401), follow these steps:
Create a profile — opens Chromium for manual login (requires GUI display; not available in SSH/headless environments):
crwl-cli profile create github
Log in to the site in the browser window, then press q in terminal to save.
Verify the profile works:
crwl-cli profile check github https://github.com/settings/profile
Check that the preview shows authenticated content.
Crawl with the profile:
crwl-cli fetch https://github.com/settings/profile --profile github
Re-crawl with a profile when the result contains:
crwl-cli profile list # List all profiles
crwl-cli profile create <name> # Create (opens browser)
crwl-cli profile check <name> <url> # Test profile session
crwl-cli profile delete <name> # Delete profile
Profiles stored at: ~/.local/share/crwl-cli/profiles/<name>/
profile createopens a Chromium window and requires a GUI display.
Cache is off by default. Enable with --cache.
crwl-cli fetch https://example.com --cache # Store result
crwl-cli cache list # List cached entries
crwl-cli cache clear # Clear all
crwl-cli cache clear --older-than 7 # Clear entries >7 days old
Cache stored at: ~/.local/share/crwl-cli/cache/
| Format | Flag | Content | Use Case |
| ------ | ----------------------- | ----------------------------------------------------- | ------------------------------ |
| md | --format md (default) | Filtered markdown (PruningContentFilter) | LLM consumption |
| raw | --format raw | Full markdown, no filtering | Debugging, complete extraction |
| json | --format json | {url, success, status_code, markdown, links, error} | Pipelines, batch processing |
| Problem | Solution |
| ----------------------- | --------------------------------------------------- |
| Empty markdown | Add --wait-for <selector> for JS-rendered content |
| Timeout | Increase --timeout 60000 |
| Too much noise | Use --css <selector> to scope extraction |
| Images slow things down | Use --text-mode |
| Auth wall | Create a profile: crwl-cli profile create <name> |
| Stale session | Re-check: crwl-cli profile check <name> <url> |
tools
Biomedical literature, reference, and entity research helper. Use whenever the user asks for PubMed/PMC/NCBI/Entrez paper search, PMID/PMCID/DOI conversion, biomedical citation/BibTeX/RIS export, legal OA full-text lookup, gene/protein/RNA/transcript evidence, OpenAlex citation/OA enrichment, Semantic Scholar enrichment, PubChem compound/assay/bioactivity lookup, or bio/medical literature review evidence collection.
tools
Use kmap-cli whenever the user asks for Korea-focused 장소찾기/POI lookup, 주변검색, 맛집 후보 찾기, 대중교통 길찾기, 경유지 transit routing, address geocoding, reverse geocoding, saved home/work aliases, or NAVER/Kakao/TMAP map app handoff. Default to TMAP API for machine-readable place/transit data; use NAVER/Kakao only as URL handoff helpers without NAVER/Kakao API keys. Do not use ODsay.
tools
Manage Linkwarden bookmarks, collections, tags, highlights, RSS subscriptions, archives, and API tokens through a restricted CLI. Use when the user asks to save, search, organize, archive, or delete Linkwarden links.
tools
Manage Vikunja projects, tasks, relations, templates, attachments, labels, comments, due/reminder notifications, views, and kanban buckets through a restricted CLI. Use whenever the user asks to inspect or update Vikunja tasks/projects, create structured tasks from sources, attach evidence, link blockers/subtasks/order with task relations, move tasks between projects or kanban buckets, manage workflow labels/comments, or check Vikunja reminders/overdue items. Prefer this skill over raw Vikunja API calls.