.cursor/skills/pdf-processing-pro/SKILL.md
Production-ready PDF processing with forms, tables, OCR, validation, and batch operations. Use when working with complex PDF workflows in production environments, processing large volumes of PDFs, or requiring robust error handling and validation.
npx skillsauth add nilecui/SkillsBase PDF Processing ProInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Production-ready PDF processing toolkit with pre-built scripts, comprehensive error handling, and support for complex workflows.
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
text = pdf.pages[0].extract_text()
print(text)
python scripts/analyze_form.py input.pdf --output fields.json
# Returns: JSON with all form fields, types, and positions
python scripts/fill_form.py input.pdf data.json output.pdf
# Validates all fields before filling, includes error reporting
python scripts/extract_tables.py report.pdf --output tables.csv
# Extracts all tables with automatic column detection
All scripts include:
--help flag for all scriptsFor complete form workflows including:
See FORMS.md
For complex table extraction:
See TABLES.md
For scanned PDFs and image-based documents:
See OCR.md
analyze_form.py - Extract form field information
python scripts/analyze_form.py input.pdf [--output fields.json] [--verbose]
fill_form.py - Fill PDF forms with data
python scripts/fill_form.py input.pdf data.json output.pdf [--validate]
validate_form.py - Validate form data before filling
python scripts/validate_form.py data.json schema.json
extract_tables.py - Extract tables to CSV/Excel
python scripts/extract_tables.py input.pdf [--output tables.csv] [--format csv|excel]
extract_text.py - Extract text with formatting preservation
python scripts/extract_text.py input.pdf [--output text.txt] [--preserve-formatting]
merge_pdfs.py - Merge multiple PDFs
python scripts/merge_pdfs.py file1.pdf file2.pdf file3.pdf --output merged.pdf
split_pdf.py - Split PDF into individual pages
python scripts/split_pdf.py input.pdf --output-dir pages/
validate_pdf.py - Validate PDF integrity
python scripts/validate_pdf.py input.pdf
# 1. Analyze form structure
python scripts/analyze_form.py template.pdf --output schema.json
# 2. Validate submission data
python scripts/validate_form.py submission.json schema.json
# 3. Fill form
python scripts/fill_form.py template.pdf submission.json completed.pdf
# 4. Validate output
python scripts/validate_pdf.py completed.pdf
# 1. Extract tables
python scripts/extract_tables.py monthly_report.pdf --output data.csv
# 2. Extract text for analysis
python scripts/extract_text.py monthly_report.pdf --output report.txt
import glob
from pathlib import Path
import subprocess
# Process all PDFs in directory
for pdf_file in glob.glob("invoices/*.pdf"):
output_file = Path("processed") / Path(pdf_file).name
result = subprocess.run([
"python", "scripts/extract_text.py",
pdf_file,
"--output", str(output_file)
], capture_output=True)
if result.returncode == 0:
print(f"✓ Processed: {pdf_file}")
else:
print(f"✗ Failed: {pdf_file} - {result.stderr}")
All scripts follow consistent error patterns:
# Exit codes
# 0 - Success
# 1 - File not found
# 2 - Invalid input
# 3 - Processing error
# 4 - Validation error
# Example usage in automation
result = subprocess.run(["python", "scripts/fill_form.py", ...])
if result.returncode == 0:
print("Success")
elif result.returncode == 4:
print("Validation failed - check input data")
else:
print(f"Error occurred: {result.returncode}")
All scripts require:
pip install pdfplumber pypdf pillow pytesseract pandas
Optional for OCR:
# Install tesseract-ocr system package
# macOS: brew install tesseract
# Ubuntu: apt-get install tesseract-ocr
# Windows: Download from GitHub releases
--parallel flag (where supported)"Module not found" errors:
pip install -r requirements.txt
Tesseract not found:
# Install tesseract system package (see Dependencies)
Memory errors with large PDFs:
# Process page by page instead of loading entire PDF
with pdfplumber.open("large.pdf") as pdf:
for page in pdf.pages:
text = page.extract_text()
# Process page immediately
Permission errors:
chmod +x scripts/*.py
All scripts support --help:
python scripts/analyze_form.py --help
python scripts/extract_tables.py --help
For detailed documentation on specific topics, see:
documentation
Presentation creation, editing, and analysis. When Claude needs to work with presentations (.pptx files) for: (1) Creating new presentations, (2) Modifying or editing content, (3) Working with layouts, (4) Adding comments or speaker notes, or any other presentation tasks
testing
Design a PostgreSQL-specific schema. Covers best-practices, data types, indexing, constraints, performance patterns, and advanced features
tools
Generate and edit high-quality AI images using Google's Gemini 3 Pro Image model (Nano Banana Pro) via MCP. Use when user wants to create images, edit photos, generate graphics, or needs visual content with text rendering.
development
Design microservices architectures with service boundaries, event-driven communication, and resilience patterns. Use when building distributed systems, decomposing monoliths, or implementing microservices.