webtomd/SKILL.md
Scrapes a webpage and saves it as a clean Markdown file with frontmatter. Use when the user provides a URL and wants to scrape, save, archive, or convert a page to Markdown. Triggers on "scrape this", "save this page", "convert to markdown", "download docs from", or any URL with intent to save.
npx skillsauth add aluvia-connect/aluvia-skills webtomdInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Two modes:
WebFetch. Quick, may summarize. Best for articles and blog posts.urllib + markdownify. Verbatim content. Best for technical docs and API references.Mode is auto-detected in Step 2. User can override anytime by saying "use fast" or "use precise".
Copy and track progress:
[ ] 1. Resolve proxy (if ALUVIA_API_KEY set)
[ ] 2. Fetch raw HTML + JS-render check
[ ] 3. Auto-detect mode
[ ] 4. Detect and filter nav links
[ ] 5. Ask user which pages to scrape
[ ] 6. Convert and save each page
[ ] 7. Print summary
Check for ALUVIA_API_KEY in the environment:
import os, urllib.request, json
api_key = os.environ.get("ALUVIA_API_KEY")
proxy_url = None
if api_key:
req = urllib.request.Request(
"https://api.aluvia.io/v1/account/connections",
headers={"Authorization": f"Bearer {api_key}"}
)
with urllib.request.urlopen(req, timeout=10) as r:
data = json.loads(r.read())
connections = data.get("data", [])
if connections:
proxy_url = connections[0]["proxy_urls"]["url"]
If proxy_url is set, use it for all subsequent urllib requests via ProxyHandler:
handlers = [urllib.request.ProxyHandler({"http": proxy_url, "https": proxy_url})] if proxy_url else []
opener = urllib.request.build_opener(*handlers)
Use opener.open(req) instead of urllib.request.urlopen() for every fetch in Steps 2 and 6.
If ALUVIA_API_KEY is not set or the call fails, proceed without proxy — do not error.
Use python3 + urllib with a browser User-Agent. Extract <title>, strip trailing | site name.
JS-render check — flag only if BOTH are true:
<body> has < 3 block tags (p, h1-h6, li, section, div)If flagged, warn and offer:
"⚠️ Appears JS-rendered. Options: (1) continue anyway (2) retry via Jina Reader —
https://r.jina.ai/<url>(3) cancel"
If user picks Jina, replace URL and use fast mode. Skip nav detection.
Using the URL and raw HTML already fetched, pick the mode silently then inform the user:
Use precise if any of these match:
/docs/, /api/, /reference/, /guide/, /manual/, /sdk/<code> or <pre> blocks in the HTMLplatform.claude.com, docs., developer.)Use fast otherwise (blogs, marketing, articles, landing pages).
Inform the user:
"Auto-selected [mode] mode based on [reason: URL pattern / code density / domain]. Reply 'fast' or 'precise' to override, or press Enter to continue."
Wait for reply. If Enter or no override, proceed.
Use the raw HTML already fetched in Step 2 — do not re-fetch the URL.
Scan the HTML and identify all navigation and sidebar links that look like documentation or content pages. Extract ALL matching (title, url) pairs — do not truncate or summarize.
/, prepend the base domain (e.g. /docs/guide → https://example.com/docs/guide)If no links are found → scrape main URL only.
If nav links found:
Found N related pages. Which to scrape?
1. Title — url
2. Title — url
Reply: "all", "none", or "1 3 5"
If "all" and N > 10: confirm — "That's N pages with ~1s delays between each. Proceed?"
For each URL — check if file exists first:
"⚠️
./scraped/file.mdexists. Overwrite? (yes / no / skip)"
Rate limit: 1–2s delay between requests when scraping multiple URLs.
fast mode:
Use WebFetch. If response says "Output too large... saved to: /path/file.txt", read it with bash (cat).
If result < 500 chars or has no headings and no lists, suggest:
"⚠️ Fast mode returned thin content. Retry with precise? (yes / no)"
precise mode:
Reuse HTML from step 2 for the main URL. For nav pages, fetch fresh using opener (with proxy if set).
markdownify: python3 -c "import markdownify" 2>/dev/null || pip install markdownify -q --break-system-packages<script>, <style>, <nav>, <footer>, <header>, <aside><main> → <article> → element with class/id content|docs|prose → largest <div> → <body>markdownify (ATX headings, tables, no images), clean blank linesQuality check (both modes):
# heading → warn ⚠️ No headings⚠️ Single text block> ⚠️ JS-rendered page — structure may be lostSave to ./scraped/<slug>.md:
---
title: "Page Title"
source: "https://..."
scraped: "YYYY-MM-DD"
mode: fast | precise
---
Done. Saved N file(s) [mode: fast|precise]:
- ./scraped/page.md (4,200 chars)
- ./scraped/other.md (skipped — already exists)
- ./scraped/broken.md ⚠️ No headings
tools
Stop AI agents from getting blocked on the web. Routes traffic through real mobile carrier IPs (e.g. US carriers AT&T, T-Mobile, Verizon) so sites treat automation like a human. Bypasses Cloudflare, DataDome, PerimeterX, Akamai, and similar anti-bot systems. Use when hitting 403 Forbidden, Cloudflare challenges, CAPTCHAs, rate limits (429), IP bans, WAF blocks, empty responses, or when websites detect and block automated traffic. JSON-first CLI with auto block detection, IP rotation, global geo-targeting (185+ countries), dynamic proxy rules, and headless Chromium via Playwright/CDP.
tools
Stop your AI agent from getting blocked. Aluvia routes traffic through real US mobile carrier IPs (AT&T, T-Mobile, Verizon) so websites treat your agent like a human. Bypasses Cloudflare, DataDome, PerimeterX, Akamai Bot Manager, and other anti-bot systems automatically. Use when your agent hits 403 Forbidden, Cloudflare challenges, CAPTCHAs, rate limits (429), IP bans, empty responses, or WAF blocks. Features auto block detection and bypass, one-command IP rotation, US geo-targeting, dynamic proxy routing rules, and managed headless Chromium via Playwright — all through a JSON-first CLI built for AI agents.
development
Unblock web requests using Aluvia mobile carrier proxies. Use when hitting 403 Forbidden, Cloudflare challenges, CAPTCHAs, rate limits (429), IP bans, WAF blocks, empty responses, or when websites detect and block automated traffic. Routes traffic through real US mobile carrier IPs (AT&T, T-Mobile, Verizon) so websites treat your agent like a human.
development
Unblock web requests using Aluvia mobile carrier proxies. Use when hitting 403 Forbidden, Cloudflare challenges, CAPTCHAs, rate limits (429), IP bans, WAF blocks, empty responses, or when websites detect and block automated traffic. Routes traffic through real US mobile carrier IPs (AT&T, T-Mobile, Verizon) so websites treat your agent like a human.