.claude/skills/ts-data-extractor/SKILL.md
Extract structured data from any document format using unified document processing. Use when a user asks to extract data from a document, parse a PDF, pull structured data from files, convert documents to JSON or CSV, extract fields from invoices or forms, or scrape data from documents.
npx skillsauth add eliferjunior/Claude data-extractorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Extract structured data from documents in any format: PDF, DOCX, HTML, TXT, images, and more. Converts unstructured or semi-structured content into clean JSON, CSV, or other structured formats. Handles invoices, forms, reports, and free-text documents.
When a user asks you to extract data from a document, follow this process:
# Determine file type
file document.pdf
# Install dependencies based on format
pip install pdfplumber python-docx beautifulsoup4 lxml openpyxl
Library selection by format:
pdfplumber (text + tables), PyMuPDF (fitz) for complex layoutspython-docxbeautifulsoup4 with lxmlopenpyxl or pandaspytesseract (OCR) with PillowPDF extraction:
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
for i, page in enumerate(pdf.pages):
text = page.extract_text()
print(f"--- Page {i+1} ---")
print(text)
# Extract tables if present
tables = page.extract_tables()
for table in tables:
for row in table:
print(row)
DOCX extraction:
from docx import Document
doc = Document("document.docx")
for para in doc.paragraphs:
print(f"[{para.style.name}] {para.text}")
# Extract tables
for table in doc.tables:
for row in table.rows:
print([cell.text for cell in row.cells])
HTML extraction:
from bs4 import BeautifulSoup
with open("document.html") as f:
soup = BeautifulSoup(f, "lxml")
# Extract specific elements
for table in soup.find_all("table"):
rows = table.find_all("tr")
for row in rows:
cells = [td.get_text(strip=True) for td in row.find_all(["td", "th"])]
print(cells)
Once you have raw text, extract the target fields:
Pattern-based extraction:
import re
import json
text = "..." # extracted text
# Define patterns for common fields
patterns = {
"invoice_number": r"Invoice\s*#?\s*:?\s*(\w+[-/]?\w+)",
"date": r"Date\s*:?\s*(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})",
"total": r"Total\s*:?\s*\$?([\d,]+\.?\d*)",
"email": r"[\w.-]+@[\w.-]+\.\w+",
}
extracted = {}
for field, pattern in patterns.items():
match = re.search(pattern, text, re.IGNORECASE)
if match:
extracted[field] = match.group(1) if match.lastindex else match.group(0)
print(json.dumps(extracted, indent=2))
Line-item extraction from tables:
import pandas as pd
# From a list of table rows
headers = table_data[0]
rows = table_data[1:]
df = pd.DataFrame(rows, columns=headers)
# Clean up
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
df = df.dropna(how="all")
# Type conversion
extracted["total"] = float(extracted["total"].replace(",", ""))
# Date normalization
from datetime import datetime
extracted["date"] = datetime.strptime(extracted["date"], "%m/%d/%Y").isoformat()
# Validate required fields
required = ["invoice_number", "date", "total"]
missing = [f for f in required if f not in extracted]
if missing:
print(f"Warning: missing fields: {missing}")
# JSON output
with open("extracted_data.json", "w") as f:
json.dump(extracted, f, indent=2)
# CSV output
df.to_csv("extracted_items.csv", index=False)
# Pretty print summary
print(f"Extracted {len(extracted)} fields from document")
print(f"Line items: {len(df)} rows")
User request: "Extract the invoice details from this PDF"
Actions:
Output:
{
"invoice_number": "INV-2025-0042",
"date": "2025-03-15",
"vendor": "Acme Corp",
"subtotal": 1250.00,
"tax": 100.00,
"total": 1350.00,
"line_items": [
{"description": "Widget A", "qty": 10, "unit_price": 75.00, "amount": 750.00},
{"description": "Widget B", "qty": 5, "unit_price": 100.00, "amount": 500.00}
]
}
User request: "Pull all names and email addresses from this company directory document"
Actions:
Output: A CSV file with columns: name, email, department, phone.
User request: "Extract the quarterly results table from this HTML page"
Actions:
Output: A clean CSV with quarterly metrics and a summary of key figures.
development
Expert guidance for Fireworks AI, the platform for running open-source LLMs (Llama, Mixtral, Qwen, etc.) with enterprise-grade speed and reliability. Helps developers integrate Fireworks' inference API, fine-tune models, and deploy custom model endpoints with function calling and structured output support.
development
Convert any website into clean, structured data with Firecrawl — API-first web scraping service. Use when someone asks to "turn a website into markdown", "scrape website for LLM", "Firecrawl", "extract website content as clean text", "crawl and convert to structured data", or "scrape website for RAG". Covers single-page scraping, full-site crawling, structured extraction, and LLM-ready output.
tools
Expert guidance for Firebase, Google's platform for building and scaling web and mobile applications. Helps developers set up authentication, Firestore/Realtime Database, Cloud Functions, hosting, storage, and analytics using Firebase's SDK and CLI.
development
When the user needs to build file upload functionality for a web application. Use when the user mentions "file upload," "image upload," "upload endpoint," "multipart upload," "presigned URL," "S3 upload," "file validation," "upload to cloud storage," or "accept user files." Handles upload endpoints, file validation (type, size, magic bytes), cloud storage integration, and upload status tracking. For image/video processing after upload, see media-transcoder.