skills/web-scraper/SKILL.md
Extract structured data from web pages and load it into databases. Use when a user asks to scrape a website, build a data pipeline, extract data from a webpage, pull prices from a site, collect links, gather product listings, download page content, parse HTML, set up ETL, or automate data collection. Handles static HTML, JavaScript-rendered pages, anti-bot proxies (Bright Data), data transformation, deduplication, and database loading.
npx skillsauth add tusosos/manus-knowledge-base web-scraperInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Extract structured data from web pages by parsing HTML, selecting elements with CSS selectors, and outputting clean data in JSON, CSV, or other formats. Handles both static HTML and JavaScript-rendered pages.
When a user asks you to scrape or extract data from a web page, follow these steps:
Determine:
For static HTML:
import requests
from bs4 import BeautifulSoup
def fetch_page(url):
headers = {
"User-Agent": "Mozilla/5.0 (compatible; research-bot/1.0)"
}
response = requests.get(url, headers=headers, timeout=30)
response.raise_for_status()
return BeautifulSoup(response.text, "html.parser")
For JavaScript-rendered pages:
from playwright.sync_api import sync_playwright
def fetch_js_page(url):
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url, wait_until="networkidle")
content = page.content()
browser.close()
return BeautifulSoup(content, "html.parser")
Identify the right selectors by inspecting the page structure:
def extract_items(soup, selectors):
items = []
containers = soup.select(selectors["container"])
for container in containers:
item = {}
for field, selector in selectors["fields"].items():
el = container.select_one(selector)
if el:
if el.name == "img":
item[field] = el.get("src", "")
elif el.name == "a":
item[field] = {"text": el.get_text(strip=True), "href": el.get("href", "")}
else:
item[field] = el.get_text(strip=True)
else:
item[field] = None
items.append(item)
return items
Usage example:
selectors = {
"container": "div.product-card",
"fields": {
"name": "h2.product-title",
"price": "span.price",
"rating": "span.rating-value",
"link": "a.product-link",
}
}
items = extract_items(soup, selectors)
def scrape_all_pages(base_url, selectors, max_pages=10):
all_items = []
for page_num in range(1, max_pages + 1):
url = f"{base_url}?page={page_num}"
soup = fetch_page(url)
items = extract_items(soup, selectors)
if not items:
break
all_items.extend(items)
print(f"Page {page_num}: {len(items)} items (total: {len(all_items)})")
return all_items
import json
import csv
def save_json(data, filename):
with open(filename, "w") as f:
json.dump(data, f, indent=2)
def save_csv(data, filename):
if not data:
return
with open(filename, "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=data[0].keys())
writer.writeheader()
writer.writerows(data)
User request: "Scrape all product names and prices from this catalog page"
Script outline:
soup = fetch_page("https://example-store.com/catalog")
products = []
for card in soup.select("div.product-item"):
name = card.select_one("h3.title")
price = card.select_one("span.price")
products.append({
"name": name.get_text(strip=True) if name else "N/A",
"price": price.get_text(strip=True) if price else "N/A",
})
save_csv(products, "products.csv")
print(f"Extracted {len(products)} products")
Output:
Extracted 48 products
Saved to products.csv
Preview:
| name | price |
|-----------------------|---------|
| Wireless Keyboard | $49.99 |
| USB-C Hub 7-port | $34.99 |
| Ergonomic Mouse | $29.99 |
User request: "Pull the statistics table from this Wikipedia article"
Script outline:
soup = fetch_page("https://en.wikipedia.org/wiki/Example_Article")
table = soup.select_one("table.wikitable")
headers = [th.get_text(strip=True) for th in table.select("tr:first-child th")]
rows = []
for tr in table.select("tr")[1:]:
cells = [td.get_text(strip=True) for td in tr.select("td")]
if len(cells) == len(headers):
rows.append(dict(zip(headers, cells)))
save_json(rows, "table_data.json")
print(f"Extracted {len(rows)} rows with columns: {headers}")
Output:
Extracted 25 rows with columns: ['Year', 'Population', 'Growth Rate']
Saved to table_data.json
Clean raw scraped data before loading — normalize prices, deduplicate by content hash, validate required fields.
import hashlib
def transform_products(raw_items):
"""Clean and deduplicate scraped product data."""
seen_hashes = set()
clean = []
for item in raw_items:
# Skip items missing required fields
if not item.get("name") or not item.get("price"):
continue
# Normalize price: "$1,299.99" → 129999 (cents)
price_str = item["price"].replace(",", "").replace("$", "").strip()
try:
price_cents = int(float(price_str) * 100)
except ValueError:
continue
# Deduplicate by content hash
content_hash = hashlib.md5(
f"{item['name']}|{item.get('link', '')}".encode()
).hexdigest()
if content_hash in seen_hashes:
continue
seen_hashes.add(content_hash)
clean.append({
"name": item["name"][:500],
"price_cents": price_cents,
"url": item.get("link", {}).get("href", ""),
"rating": item.get("rating"),
"content_hash": content_hash,
"scraped_at": datetime.utcnow().isoformat(),
})
return clean
Batch upsert into Postgres/Supabase for persistent storage with automatic price change tracking.
from supabase import create_client
def load_to_supabase(products, supabase_url, supabase_key):
"""Batch upsert products into Supabase with conflict handling."""
client = create_client(supabase_url, supabase_key)
# Upsert in batches of 100
for i in range(0, len(products), 100):
batch = products[i:i+100]
client.table("products").upsert(
batch,
on_conflict="content_hash" # Update if exists
).execute()
print(f" Loaded batch {i//100 + 1}: {len(batch)} records")
return len(products)
For sites that block datacenter IPs, use Bright Data's residential proxy or Web Unlocker.
import requests
def fetch_with_proxy(url, bright_data_config):
"""Fetch a page through Bright Data residential proxy."""
proxy_url = (
f"http://{bright_data_config['customer']}"
f"-zone-{bright_data_config['zone']}"
f":{bright_data_config['password']}"
f"@brd.superproxy.io:22225"
)
response = requests.get(
url,
proxies={"http": proxy_url, "https": proxy_url},
headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)"},
timeout=30,
)
response.raise_for_status()
return BeautifulSoup(response.text, "html.parser")
robots.txt. Check it before scraping and honor disallow rules.time.sleep(1)). Do not hammer servers.None for missing values.tools
Download video and audio from YouTube and other platforms with yt-dlp. Use when a user asks to download YouTube videos, extract audio from videos, download playlists, get subtitles, download specific formats or qualities, batch download, archive channels, extract metadata, embed thumbnails, download from social media platforms (Twitter, Instagram, TikTok), or build media ingestion pipelines. Covers format selection, audio extraction, playlists, subtitles, metadata, and automation.
development
Download YouTube videos with customizable quality and format options. Use this skill when the user asks to download, save, or grab YouTube videos. Supports various quality settings (best, 1080p, 720p, 480p, 360p), multiple formats (mp4, webm, mkv), and audio-only downloads as MP3.
development
Use this skill any time a spreadsheet file is the primary input or output. This means any task where the user wants to: open, read, edit, or fix an existing .xlsx, .xlsm, .csv, or .tsv file (e.g., adding columns, computing formulas, formatting, charting, cleaning messy data); create a new spreadsheet from scratch or from other data sources; or convert between tabular file formats. Trigger especially when the user references a spreadsheet file by name or path — even casually (like "the xlsx in my downloads") — and wants something done to it or produced from it. Also trigger for cleaning or restructuring messy tabular data files (malformed rows, misplaced headers, junk data) into proper spreadsheets. The deliverable must be a spreadsheet file. Do NOT trigger when the primary deliverable is a Word document, HTML report, standalone Python script, database pipeline, or Google Sheets API integration, even if tabular data is involved.
development
Use when you have a spec or requirements for a multi-step task, before touching code