dist/pi/skills/parsing-documents/SKILL.md
Extract structured data from PDF documents — text, tables, forms, and metadata. Use when reading or extracting content from a `.pdf` file, parsing invoices/reports/scanned documents, or converting PDF data to JSON/CSV. NOT for generating PDFs, and NOT for plain-text/markdown files (read those directly).
npx skillsauth add alexei-led/claude-code-config parsing-documentsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Extract structured information from PDF documents. Try the cheapest reliable method first.
Use the Read tool directly — it handles PDFs natively, including scanned/image PDFs:
Read → /path/to/document.pdf
pdftotext document.pdf output.txt # Extract text
pdftotext -layout document.pdf - # Preserve layout
pdfinfo document.pdf # PDF metadata
pdftoppm -png document.pdf output # Convert pages to images
pdfplumber — best for tables:
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
for page in pdf.pages:
tables = page.extract_tables()
text = page.extract_text()
pypdf — metadata and forms:
from pypdf import PdfReader
reader = PdfReader("document.pdf")
metadata = reader.metadata
fields = reader.get_form_text_fields()
def extract_tables(pdf_path):
tables = []
with pdfplumber.open(pdf_path) as pdf:
for page_num, page in enumerate(pdf.pages, 1):
for table in page.extract_tables():
tables.append({"page": page_num, "data": table})
return tables
def extract_region(pdf_path, page_num, bbox):
with pdfplumber.open(pdf_path) as pdf:
page = pdf.pages[page_num]
return page.crop(bbox).extract_text() # bbox = (x0, top, x1, bottom)
import re
patterns = {
"invoice_number": r"Invoice\s*#?\s*(\d+)",
"date": r"Date:\s*([\d/\-]+)",
"total": r"Total:\s*\$?([\d,]+\.?\d*)",
}
data = {
key: (m.group(1) if (m := re.search(p, text, re.IGNORECASE)) else None)
for key, p in patterns.items()
}
JSON for structured extraction:
output = {
"metadata": {"title": "...", "pages": 5},
"pages": [{"page_number": 1, "text": "...", "tables": [...]}],
}
CSV for table data: csv.writer(f).writerows(table).
pdftotext; tables → pdfplumber; forms → pypdfpdfinfo; may need a password — ask the user, do not guesspdftotexttable_settings = {"vertical_strategy": "lines", "horizontal_strategy": "lines"}
tables = page.extract_tables(table_settings)
If content cannot be reliably extracted (scanned images, encryption), report the failure explicitly rather than inventing values. If the task would require changes beyond extraction, stop and ask.
tools
Idiomatic shell development for POSIX sh, Bash, Zsh, Fish, hooks, CI shell steps, and scriptable CLI glue. Use when writing or changing `.sh`, `.bash`, `.zsh`, `.fish`, `.bats`, shell functions, shell pipelines, or command-runner recipes. Emphasizes portability, quoting, safe filesystem/process handling, non-TUI CLI tools, ShellCheck, shfmt, Bats, and ShellSpec. NOT for Python, TypeScript, Go, web code, or infrastructure operations.
tools
Use when planning, executing, checkpointing, finishing, or inspecting lightweight spec-driven work. Runs one task at a time using `.spec/` markdown files and the bundled `specctl` helper. NOT for broad product discovery beyond a short requirement interview.
testing
Author, inspect, troubleshoot, and review infrastructure across IaC, Kubernetes, cloud resources, containers, CI/CD, and Linux hosts. Use when changing Terraform/OpenTofu, Kubernetes, Helm, Kustomize, Dockerfiles, GitHub Actions, AWS, GCP, Cloud Run, BigQuery, IAM, logs, instances, or service health. NOT for deploy/apply/rollback workflows (see deploying-infra). NOT for shell scripts or generic command pipelines (see writing-shell).
development
Configure safe git workflow hygiene: pre-commit/pre-push hooks, Gitleaks secret scanning, .gitignore rules, local git config, and guardrails. Use when setting up git hooks, gitleaks/git leaks, staged pre-commit checks, pre-push validation, core.hooksPath, .gitignore, or git config best practices. NOT for creating commits (use committing-code), cleaning branches/worktrees (use cleanup-git), or creating worktrees (use using-git-worktrees).