dev-toolkit/skills/web-scraping/SKILL.md
Web scraping with anti-bot bypass, content extraction, undocumented APIs and poison pill detection. Use when extracting content from websites, handling paywalls, implementing scraping cascades or processing social media. Covers requests, trafilatura, Playwright with stealth mode, yt-dlp and instaloader patterns.
npx skillsauth add jamditis/claude-skills-journalism web-scrapingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Patterns for reliable, ethical web scraping with fallback strategies and anti-bot handling.
Implement multiple extraction strategies with automatic fallback:
from abc import ABC, abstractmethod
from typing import Optional
import requests
from bs4 import BeautifulSoup
import trafilatura
#for .py files
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync
#for .ipynb files
import asyncio
from playwright.async_api import async_playwright
class ScrapingResult:
def __init__(self, content: str, title: str, method: str):
self.content = content
self.title = title
self.method = method # Track which method succeeded
class Scraper(ABC):
@abstractmethod
def fetch(self, url: str) -> Optional[ScrapingResult]: ...
class TrafilaturaCscraper(Scraper):
"""Fast, lightweight extraction for standard articles."""
def fetch(self, url: str) -> Optional[ScrapingResult]:
try:
downloaded = trafilatura.fetch_url(url)
if not downloaded:
return None
content = trafilatura.extract(
downloaded,
include_comments=False,
include_tables=True,
favor_recall=True
)
if not content or len(content) < 100:
return None
# Extract title separately
soup = BeautifulSoup(downloaded, 'html.parser')
title = soup.find('title')
title_text = title.get_text() if title else ''
return ScrapingResult(content, title_text, 'trafilatura')
except Exception:
return None
class RequestsScraper(Scraper):
"""HTTP requests with rotating user agents."""
USER_AGENTS = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36',
]
def fetch(self, url: str) -> Optional[ScrapingResult]:
import random
headers = {
'User-Agent': random.choice(self.USER_AGENTS),
'Accept': 'text/html,application/xhtml+xml',
'Accept-Language': 'en-US,en;q=0.9',
}
try:
response = requests.get(url, headers=headers, timeout=30)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
# Remove script/style elements
for element in soup(['script', 'style', 'nav', 'footer', 'aside']):
element.decompose()
# Find main content
main = soup.find('main') or soup.find('article') or soup.find('body')
content = main.get_text(separator='\n', strip=True) if main else ''
title = soup.find('title')
title_text = title.get_text() if title else ''
if len(content) < 100:
return None
return ScrapingResult(content, title_text, 'requests')
except Exception:
return None
class PlaywrightScraper(Scraper):
"""Heavy JavaScript rendering with stealth mode for anti-bot bypass."""
def fetch(self, url: str) -> Optional[ScrapingResult]:
try:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = browser.new_context(
viewport={'width': 1920, 'height': 1080},
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
)
page = context.new_page()
# Apply stealth to avoid detection
stealth_sync(page)
page.goto(url, wait_until='networkidle', timeout=60000)
# Wait for content to load
page.wait_for_timeout(2000)
# Extract content
content = page.evaluate('''() => {
const article = document.querySelector('article, main, .content, #content');
return article ? article.innerText : document.body.innerText;
}''')
title = page.title()
browser.close()
if len(content) < 100:
return None
return ScrapingResult(content, title, 'playwright')
except Exception:
return None
class PlaywrightScraperAsync:
"""Async Playwright scraper for Jupyter notebooks (.ipynb files).
Jupyter notebooks run their own event loop, so sync Playwright won't work.
Use this async version with `await` in notebook cells.
"""
async def fetch(self, url: str) -> Optional[ScrapingResult]:
try:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(
viewport={'width': 1920, 'height': 1080},
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
)
page = await context.new_page()
# Note: playwright-stealth async version
# from playwright_stealth import stealth_async
# await stealth_async(page)
await page.goto(url, wait_until='networkidle', timeout=60000)
# Wait for content to load
await page.wait_for_timeout(2000)
# Extract content
content = await page.evaluate('''() => {
const article = document.querySelector('article, main, .content, #content');
return article ? article.innerText : document.body.innerText;
}''')
title = await page.title()
await browser.close()
if len(content) < 100:
return None
return ScrapingResult(content, title, 'playwright_async')
except Exception:
return None
# Usage in Jupyter notebook cells:
# scraper = PlaywrightScraperAsync()
# result = await scraper.fetch('https://example.com')
class ScrapingCascade:
"""Try multiple scrapers in order until one succeeds."""
def __init__(self):
self.scrapers = [
TrafilaturaCscraper(),
RequestsScraper(),
PlaywrightScraper(),
]
def fetch(self, url: str) -> Optional[ScrapingResult]:
for scraper in self.scrapers:
result = scraper.fetch(url)
if result:
return result
return None
The cascade above (requests → trafilatura → Playwright + playwright-stealth) handles plain HTML and lightly-protected JS sites. Modern anti-bot stacks (Cloudflare Bot Management / Turnstile, DataDome, Akamai Bot Manager, PerimeterX) layer multiple detection signals: TLS / HTTP-2 fingerprints, browser fingerprints, JS-execution proofs, residential-IP reputation, session behavior. No single tool defeats all of them.
playwright-stealth (2.0+, current) patches obvious detection vectors — navigator.webdriver, chrome.runtime, plugin enumeration, language settings, WebGL fingerprints. Treat it as the floor, not the ceiling. If a target fingerprints TLS or runs Turnstile, stealth alone won't pass.
| Tool | Layer it addresses | Notes |
|---|---|---|
| curl_cffi | TLS / HTTP-2 fingerprint | Drop-in replacement for requests that mimics Chrome/Safari/Edge JA3+ALPN. Can't run JS — pair with a parsed-HTML extractor when JS isn't required. |
| playwright-stealth 2.x | JS-runtime fingerprint | The starting line for Playwright/Chromium. Updates lag the bot stacks; expect to combine with rotation. |
| Camoufox | JS + browser fingerprint at C++ level | Firefox-based stealth browser. Spoofs fingerprint values low enough that JS-side checks can't see through them. Use when Chromium-based stealth is detected. |
| SeleniumBase UC Mode | Turnstile + browser fingerprint | The closest thing to a one-shot Turnstile solver in 2026, but heavier than playwright-stealth. |
| Residential proxy pool | IP reputation | Datacenter IPs (DigitalOcean, AWS) get challenged on first request. Residential pools cost more but bypass the cheapest layer of defense. |
Use the lightest tool that works. Targets without aggressive defense don't need Camoufox or proxy pools — curl_cffi plus a sleep is usually enough. Reserve heavier tools for sites that explicitly serve a Turnstile challenge or DataDome interstitial.
Use browser developer tools to discover APIs:
When you copy a cURL from dev tools, it includes many parameters. Strip it down by:
prefix for search terms)import requests
import time
def search_suggestions(keyword: str) -> dict:
"""
Get autocompleted search suggestions from an undocumented API.
Stripped down from browser dev tools capture.
"""
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:100.0) Gecko/20100101 Firefox/100.0',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Language': 'en-US,en;q=0.5',
}
params = {
'prefix': keyword,
'suggestion-type': ['WIDGET', 'KEYWORD'],
'alias': 'aps',
'plain-mid': '1',
}
response = requests.get(
'https://completion.amazon.com/api/2017/suggestions',
params=params,
headers=headers
)
return response.json()
# Collect suggestions for multiple keywords
keywords = ['a', 'b', 'cookie', 'sock']
data = []
for keyword in keywords:
suggestions = search_suggestions(keyword)
suggestions['search_word'] = keyword # track seed keyword
time.sleep(1) # rate limit yourself
data.extend(suggestions.get('suggestions', []))
Source: Leon Yin, "Finding Undocumented APIs," Inspect Element, 2023
Detect paywalls, anti-bot pages, and other failures:
from dataclasses import dataclass
from enum import Enum
import re
class PoisonPillType(Enum):
PAYWALL = 'paywall'
CAPTCHA = 'captcha'
RATE_LIMIT = 'rate_limit'
CLOUDFLARE = 'cloudflare'
LOGIN_REQUIRED = 'login_required'
NOT_FOUND = 'not_found'
NONE = 'none'
@dataclass
class PoisonPillResult:
detected: bool
type: PoisonPillType
confidence: float
details: str
class PoisonPillDetector:
PATTERNS = {
PoisonPillType.PAYWALL: [
r'subscribe to continue',
r'subscription required',
r'become a member',
r'sign up to read',
r'you\'ve reached your limit',
r'article limit reached',
],
PoisonPillType.CAPTCHA: [
r'verify you are human',
r'captcha',
r'robot verification',
r'prove you\'re not a robot',
],
PoisonPillType.RATE_LIMIT: [
r'too many requests',
r'rate limit exceeded',
r'slow down',
r'429',
],
PoisonPillType.CLOUDFLARE: [
r'checking your browser',
r'cloudflare',
r'ddos protection',
r'please wait while we verify',
],
PoisonPillType.LOGIN_REQUIRED: [
r'sign in to continue',
r'log in required',
r'create an account',
],
}
PAYWALL_DOMAINS = {
'nytimes.com': PoisonPillType.PAYWALL,
'wsj.com': PoisonPillType.PAYWALL,
'washingtonpost.com': PoisonPillType.PAYWALL,
'ft.com': PoisonPillType.PAYWALL,
'bloomberg.com': PoisonPillType.PAYWALL,
}
def detect(self, url: str, content: str, status_code: int = 200) -> PoisonPillResult:
# Check status code
if status_code == 429:
return PoisonPillResult(True, PoisonPillType.RATE_LIMIT, 1.0, 'HTTP 429')
if status_code == 403:
return PoisonPillResult(True, PoisonPillType.CLOUDFLARE, 0.8, 'HTTP 403')
if status_code == 404:
return PoisonPillResult(True, PoisonPillType.NOT_FOUND, 1.0, 'HTTP 404')
# Check known paywall domains
from urllib.parse import urlparse
domain = urlparse(url).netloc.replace('www.', '')
for paywall_domain, pill_type in self.PAYWALL_DOMAINS.items():
if paywall_domain in domain:
# Check if content is suspiciously short (paywall truncation)
if len(content) < 500:
return PoisonPillResult(True, pill_type, 0.9, f'Short content from {domain}')
# Pattern matching
content_lower = content.lower()
for pill_type, patterns in self.PATTERNS.items():
for pattern in patterns:
if re.search(pattern, content_lower):
return PoisonPillResult(True, pill_type, 0.7, f'Pattern match: {pattern}')
return PoisonPillResult(False, PoisonPillType.NONE, 0.0, '')
import yt_dlp
from pathlib import Path
def download_video_metadata(url: str) -> dict:
"""Extract metadata without downloading video."""
ydl_opts = {
'skip_download': True,
'quiet': True,
'no_warnings': True,
}
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
info = ydl.extract_info(url, download=False)
return {
'title': info.get('title'),
'description': info.get('description'),
'duration': info.get('duration'),
'upload_date': info.get('upload_date'),
'view_count': info.get('view_count'),
'channel': info.get('channel'),
'thumbnail': info.get('thumbnail'),
}
def download_video(url: str, output_dir: Path, audio_only: bool = False) -> Path:
"""Download video or audio."""
output_template = str(output_dir / '%(title)s.%(ext)s')
ydl_opts = {
'outtmpl': output_template,
'quiet': True,
}
if audio_only:
ydl_opts['format'] = 'bestaudio/best'
ydl_opts['postprocessors'] = [{
'key': 'FFmpegExtractAudio',
'preferredcodec': 'mp3',
}]
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
info = ydl.extract_info(url, download=True)
filename = ydl.prepare_filename(info)
if audio_only:
filename = filename.rsplit('.', 1)[0] + '.mp3'
return Path(filename)
def get_transcript(url: str) -> list[dict]:
"""Extract auto-generated or manual subtitles."""
ydl_opts = {
'skip_download': True,
'writesubtitles': True,
'writeautomaticsub': True,
'subtitleslangs': ['en'],
'quiet': True,
}
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
info = ydl.extract_info(url, download=False)
# Check for subtitles
subtitles = info.get('subtitles', {})
auto_captions = info.get('automatic_captions', {})
# Prefer manual subtitles over auto-generated
subs = subtitles.get('en') or auto_captions.get('en')
if not subs:
return []
# Get the vtt or json format
for sub in subs:
if sub['ext'] in ['vtt', 'json3']:
# Download and parse subtitle file
# ... implementation depends on format
pass
return []
import instaloader
from pathlib import Path
class InstagramScraper:
def __init__(self, username: str = None, session_file: str = None):
self.loader = instaloader.Instaloader(
download_videos=True,
download_video_thumbnails=False,
download_geotags=False,
download_comments=False,
save_metadata=True,
compress_json=False,
)
if session_file and Path(session_file).exists():
self.loader.load_session_from_file(username, session_file)
def get_profile_posts(self, username: str, limit: int = 50) -> list[dict]:
"""Get recent posts from a profile."""
profile = instaloader.Profile.from_username(self.loader.context, username)
posts = []
for i, post in enumerate(profile.get_posts()):
if i >= limit:
break
posts.append({
'shortcode': post.shortcode,
'url': f'https://instagram.com/p/{post.shortcode}/',
'caption': post.caption,
'timestamp': post.date_utc.isoformat(),
'likes': post.likes,
'comments': post.comments,
'is_video': post.is_video,
'video_url': post.video_url if post.is_video else None,
})
return posts
def download_post(self, shortcode: str, output_dir: Path):
"""Download a single post's media."""
post = instaloader.Post.from_shortcode(self.loader.context, shortcode)
self.loader.download_post(post, target=str(output_dir))
def scrape_tiktok_profile(username: str, output_dir: Path, limit: int = 50) -> list[dict]:
"""Scrape TikTok profile videos."""
profile_url = f'https://tiktok.com/@{username}'
ydl_opts = {
'quiet': True,
'extract_flat': True, # Don't download, just get info
'playlistend': limit,
}
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
info = ydl.extract_info(profile_url, download=False)
videos = []
for entry in info.get('entries', []):
videos.append({
'id': entry.get('id'),
'title': entry.get('title'),
'url': entry.get('url'),
'timestamp': entry.get('timestamp'),
'view_count': entry.get('view_count'),
})
return videos
def download_tiktok_video(url: str, output_dir: Path) -> Path:
"""Download a single TikTok video."""
ydl_opts = {
'outtmpl': str(output_dir / '%(id)s.%(ext)s'),
'quiet': True,
}
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
info = ydl.extract_info(url, download=True)
return Path(ydl.prepare_filename(info))
import random
from fake_useragent import UserAgent
class RequestManager:
def __init__(self):
self.ua = UserAgent()
self.session = requests.Session()
def get_headers(self) -> dict:
return {
'User-Agent': self.ua.random,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
}
def fetch(self, url: str, retry_count: int = 3) -> requests.Response:
for attempt in range(retry_count):
try:
response = self.session.get(
url,
headers=self.get_headers(),
timeout=30
)
response.raise_for_status()
return response
except requests.RequestException as e:
if attempt == retry_count - 1:
raise
time.sleep(2 ** attempt) # Exponential backoff
import time
import random
from urllib.parse import urlparse
class PoliteRequester:
def __init__(self, min_delay: float = 1.0, max_delay: float = 3.0):
self.min_delay = min_delay
self.max_delay = max_delay
self.last_request_per_domain = {}
def wait_for_domain(self, url: str):
domain = urlparse(url).netloc
last_request = self.last_request_per_domain.get(domain, 0)
elapsed = time.time() - last_request
delay = random.uniform(self.min_delay, self.max_delay)
if elapsed < delay:
time.sleep(delay - elapsed)
self.last_request_per_domain[domain] = time.time()
Scraping is technically simple, ethically nuanced, and legally a moving target. The current state in the US (2026):
Computer Fraud and Abuse Act (CFAA). Van Buren v. United States (2021) and hiQ Labs v. LinkedIn (2022) narrowed the CFAA so that scraping public, non-credentialed pages does NOT constitute "unauthorized access." Logging in (or using credentials), bypassing technical access controls, or scraping after an explicit cease-and-desist letter remains legally fraught. State equivalents (e.g., California's CDAFA) sometimes go further than federal law.
Terms of service. Many sites' ToS forbid scraping. ToS is a contract, not a criminal statute — breach exposes you to civil claims (breach of contract, tortious interference, trespass to chattels in some jurisdictions), not jail. The risk profile differs sharply from CFAA.
robots.txt is a polite request, not a legal mandate. Ignoring it doesn't make you criminally liable, but courts have cited it as evidence of intent. For journalism in the public interest, that intent can be defensible; for commercial use, it's harder.
EU GDPR / UK DPA. If your scraping pulls personal data of EU/UK residents, GDPR/DPA apply regardless of where you run the scraper. Public availability does NOT exempt personal data from these regimes — Lloyd v. Google (UK Supreme Court 2021) and CJEU's Schrems II lineage make scraping personal data without a lawful basis a real liability.
Practical baseline:
robots.txt. Honor crawl delays. Honor Disallow:.429.Notes on specific platforms. Instagram's instaloader and TikTok scraping via yt-dlp work today but break frequently — Meta and TikTok roll out anti-bot updates monthly. Account bans on the credentials you used are common. For journalism, the official APIs (Meta Content Library, TikTok Research API) are slower but more durable.
testing
Configure install-time cooldowns for npm/bun (minimum release age) and run a sandboxed pre-install scan when the cooldown has to be bypassed. Use when the user asks about supply-chain attacks, npm/bun security, "minimum release age", a "cooldown" for installs, hardening against Shai-Hulud-class worms, or how to safely install a package that was just published. Also use after any recent supply-chain incident in the npm ecosystem.
tools
Generate CLAUDE.md project memory files that transfer institutional knowledge, not obvious information. Use when setting up new journalism projects, onboarding collaborators, or documenting project-specific quirks. Includes templates for editorial tools, event websites, publications, research projects, content pipelines, and digital archives.
development
Use when suggesting APIs for a project, looking for free data sources, building weekend projects that need external data, or when the user needs weather, news, finance, sports, ML, or entertainment data without paid subscriptions
development
Choose the correct CLAUDE.md or LESSONS.md template for journalism projects. Use when starting a new project, setting up documentation, or unsure which template category fits best. Provides decision trees and selection guidance for 6 journalism-focused template types.