PDF Processing Pro

File Index

| File | Contents | When to Load | |------|----------|--------------| | FORMS.md | AcroForm field types, filling workflows, validation, flattening | Form analysis, filling, or validation tasks | | TABLES.md | pdfplumber table settings, multi-page merge, export formats | Table extraction from any PDF | | OCR.md | Tesseract integration, preprocessing pipeline, language setup | Scanned PDFs or image-based documents | | scripts/analyze_form.py | CLI tool: extract form field metadata to JSON | Run directly for form field discovery |

Scope Boundary

| Task | This Skill | Defer To | |------|-----------|----------| | Extract text/tables from native PDFs | Yes | -- | | Fill and flatten PDF forms | Yes | -- | | OCR scanned PDFs | Yes | -- | | Batch process PDF directories | Yes | -- | | Diagnose extraction failures | Yes | -- | | Create Word documents | No | docx | | Read/write Excel files | No | xlsx | | Create presentations | No | pptx | | Generate PDF from HTML/templates | No | reportlab, weasyprint | | Edit PDF layout/design | No | Manual tools (Acrobat, LibreOffice) |

Library Selection Decision Matrix

Pick the FIRST library that matches the PDF characteristics:

| PDF Characteristic | Primary Library | Why | Fallback | |-------------------|----------------|-----|----------| | Native text, simple layout | pdfplumber | Best text positioning, word-level coords | pypdf for metadata-only | | Tables with visible gridlines | pdfplumber | Line-based table detection built in | camelot (Lattice mode) | | Tables WITHOUT gridlines | camelot (Stream mode) | Text-position clustering superior | pdfplumber with text strategy | | Form fields (AcroForm) | pypdf | Native AcroForm read/write support | -- | | Form fields (XFA) | REJECT or pdfrw | pypdf/pdfplumber cannot read XFA; warn user XFA support is limited | External: Adobe API | | Scanned/image-based | pdf2image + pytesseract | Convert to image then OCR | ocrmypdf for batch | | Mixed (some pages scanned, some native) | Detect per-page, route accordingly | See Hybrid Detection below | -- | | Encrypted (user password) | pypdf with decrypt() | Built-in decryption support | pikepdf for owner-password | | Encrypted (owner password, no user pw) | pikepdf | Can remove restrictions when no user password set | -- | | Very large (500+ pages) | pdfplumber page-by-page | Stream processing, never load full doc | pypdf with lazy loading | | PDF/A archival format | pypdf | Preserves PDF/A compliance on write | -- |

Hybrid Detection Procedure

Before processing any PDF, determine its nature:

def classify_pdf_page(page):
    """Returns 'native', 'scanned', or 'mixed'."""
    text = page.extract_text() or ""
    images = page.images
    word_count = len(text.split())
    if word_count > 20 and not images:
        return "native"
    if word_count < 5 and images:
        return "scanned"
    return "mixed"  # has both -- extract text first, OCR image regions

PDF Forensics Checklist

When extraction produces garbage, empty strings, or wrong characters, diagnose in this order:

| Check | How | What It Means | |-------|-----|---------------| | 1. Text embedded? | page.extract_text() returns content | If empty: scanned PDF, route to OCR | | 2. Encoding correct? | Look for \x00 between chars, or (cid:XX) in output | CIDFont mapping missing -- try pikepdf to extract or re-encode | | 3. Text vs image layer? | len(page.images) > 0 AND len(page.extract_text()) > 0 | Both present: text may be hidden behind image overlay (common in redacted PDFs) | | 4. Custom font encoding? | Characters display correctly in viewer but extract as symbols | Font uses custom ToUnicode CMap -- extract with pdfplumber using x_tolerance=1 | | 5. Right-to-left text? | Arabic/Hebrew content appears reversed | Use pdfplumber -- it handles bidi better than pypdf | | 6. Multi-column layout? | Text from adjacent columns interleaved | Crop to column bounding boxes, extract each separately | | 7. Rotated pages? | Content appears rotated 90/180/270 degrees | Check page.rotation -- apply inverse before extraction | | 8. Form type? | reader.get_fields() returns None but form visible | XFA form -- pypdf cannot read; check /AcroForm/XFA in trailer |

Table Extraction Failure Modes

| Failure | Symptom | Fix | |---------|---------|-----| | No gridlines | extract_tables() returns [] | Switch to "vertical_strategy": "text", "horizontal_strategy": "text" | | Merged cells | None values in middle of rows | Post-process: forward-fill None values from left/above | | Multi-page table | Headers repeat, data split across pages | Detect repeated header row, concatenate data rows, skip duplicate headers | | Rotated text in cells | Cell values empty but table structure detected | Extract chars with rotation attribute, reconstruct text | | Nested tables | Outer table detected, inner tables become single cells | Crop to inner cell bbox, run extract_tables() on cropped region | | Spanning columns | Column count inconsistent across rows | max_cols = max(len(r) for r in table), pad short rows | | Invisible borders (white lines) | Looks bordered in viewer but lines strategy fails | Use "edge_min_length": 1 and lower "snap_tolerance" to 1 | | Table is actually an image | No text or lines detected in table region | Crop region to image, OCR that region only |

OCR Quality Decision

| DPI of Source | Expected Accuracy | Recommendation | |--------------|-------------------|----------------| | <150 | <70% -- unreliable | Reject or request re-scan at 300 DPI | | 150-200 | 70-85% -- usable with review | Process with aggressive preprocessing | | 200-300 | 85-95% -- production quality | Standard preprocessing sufficient | | >300 | 95%+ -- diminishing returns | Use as-is; higher DPI wastes processing time |

OCR Preprocessing Pipeline (in order)

Deskew -- correct rotation (>0.5 degree skew kills accuracy)
Binarize -- convert to black/white (Otsu threshold, not fixed)
Denoise -- median filter (kernel=3 for 300 DPI, kernel=5 for 150 DPI)
Remove borders -- crop scan artifacts and black edges
Scale -- upscale to 300 DPI if source is lower (bicubic interpolation)

Skip steps that don't apply. Order matters -- deskew before binarize prevents artifacts.

Tesseract PSM Selection

| Page Layout | PSM Value | When to Use | |-------------|-----------|-------------| | Full page, single column | --psm 3 (default) | Standard documents | | Single column, variable text sizes | --psm 4 | Invoices, receipts | | Single line of text | --psm 7 | Extracting one field | | Single word | --psm 8 | Captchas, labels | | Sparse text on page | --psm 11 | Forms with scattered fields | | Table/structured data | --psm 6 | Tables, spreadsheets |

Form Processing Edge Cases

| Scenario | Detection | Handling | |----------|-----------|----------| | AcroForm vs XFA | Check reader.trailer['/Root'].get('/AcroForm') for /XFA key | AcroForm: use pypdf. XFA: warn user, suggest Adobe API or pdfrw | | Read-only fields | field_flags & 1 == 1 | Skip during fill, preserve existing value | | Calculated fields | Field has /AA (additional actions) with /C (calculate) | DO NOT overwrite -- value auto-computes from other fields | | Digital signature fields | Field type /Sig | NEVER fill -- invalidates existing signatures | | Checkbox on-value varies | On-value can be /Yes, /On, /1, or custom | Read /AP/N keys to discover actual on-value per field | | Radio button groups | Multiple fields share same /T name | Set value on parent group, not individual buttons | | Rich text fields | /Ff bit 26 set (RichText) | Value may contain XML; pass plain text, let renderer format | | Barcode fields | Custom field type, renders barcode from value | Validate barcode format (Code128, QR, etc.) before filling |

Batch Processing Architecture

| PDF Count | File Size (avg) | Strategy | Why | |-----------|----------------|----------|-----| | <50 | Any | Sequential for loop | Overhead of parallelism not worth it | | 50-500 | <5MB | ProcessPoolExecutor(max_workers=cpu_count) | CPU-bound extraction benefits from multiprocessing | | 50-500 | >5MB | ProcessPoolExecutor(max_workers=4) | Cap workers to avoid memory exhaustion | | 500+ | <5MB | Chunked multiprocessing (chunks of 100) | Prevents file descriptor exhaustion | | 500+ | >5MB | Sequential with explicit gc.collect() per file | Memory safety over speed | | Any | OCR needed | ProcessPoolExecutor (OCR is CPU-heavy) | Tesseract releases GIL; true parallelism |

Memory rule of thumb: pdfplumber uses ~10x file size in RAM. A 50MB PDF needs ~500MB. Plan worker count accordingly.

Rationalization Table

| Decision | Rationale | |----------|-----------| | pdfplumber as default over pypdf | Superior text positioning and table detection; pypdf better only for form fields and metadata | | Diagnose before fix | 80% of extraction failures are misdiagnosed (e.g., treating encoding issues as OCR problems) | | Classify pages individually | Mixed PDFs (scanned + native) are common; blanket OCR wastes time and reduces accuracy on native pages | | Reject low-DPI scans | Sub-150 DPI OCR produces errors that compound downstream; better to fail fast than propagate garbage | | Sequential for large files | Memory exhaustion from parallel large-PDF processing causes silent corruption; safety over speed | | XFA as explicit rejection | Silently failing on XFA forms wastes hours of debugging; explicit detection and warning saves time |

Red Flags

Applying OCR to a native PDF -- text extraction failed for another reason (encoding, font mapping); OCR will produce inferior results
Using pypdf for table extraction -- pypdf has no table detection; always use pdfplumber or camelot for tables
Loading entire PDF into memory for batch processing -- process page-by-page or file-by-file
Ignoring checkbox on-values -- assuming /Yes when the PDF uses /On or /1 causes silent fill failures
Filling calculated or signature fields -- overwrites auto-computed values or invalidates signatures
Using fixed binarization threshold for OCR -- Otsu adaptive thresholding handles varying scan quality; fixed thresholds fail on light or dark scans
Skipping deskew before OCR -- even 1-2 degree rotation drops accuracy 10-20%
Parallel processing 500+ large PDFs without memory limits -- will exhaust RAM and crash or corrupt output

NEVER

NEVER fill a /Sig (signature) field -- invalidates the document's digital signatures
NEVER assume XFA forms work with pypdf -- they silently return empty fields; detect and warn explicitly
NEVER OCR a native-text PDF without first checking why text extraction failed -- the problem is encoding/fonts, not image-based content
NEVER use threading for CPU-bound PDF extraction -- GIL prevents parallelism; use multiprocessing
NEVER skip the forensics checklist when extraction returns empty or garbage -- random fixes waste more time than systematic diagnosis

PDF Processing Pro

File Index

Scope Boundary

Library Selection Decision Matrix

Pick the FIRST library that matches the PDF characteristics:

Hybrid Detection Procedure

Before processing any PDF, determine its nature:

def classify_pdf_page(page):
    """Returns 'native', 'scanned', or 'mixed'."""
    text = page.extract_text() or ""
    images = page.images
    word_count = len(text.split())
    if word_count > 20 and not images:
        return "native"
    if word_count < 5 and images:
        return "scanned"
    return "mixed"  # has both -- extract text first, OCR image regions

PDF Forensics Checklist

When extraction produces garbage, empty strings, or wrong characters, diagnose in this order:

Table Extraction Failure Modes

OCR Quality Decision

OCR Preprocessing Pipeline (in order)

Deskew -- correct rotation (>0.5 degree skew kills accuracy)
Binarize -- convert to black/white (Otsu threshold, not fixed)
Denoise -- median filter (kernel=3 for 300 DPI, kernel=5 for 150 DPI)
Remove borders -- crop scan artifacts and black edges
Scale -- upscale to 300 DPI if source is lower (bicubic interpolation)

Skip steps that don't apply. Order matters -- deskew before binarize prevents artifacts.

Tesseract PSM Selection

Form Processing Edge Cases

Batch Processing Architecture

Memory rule of thumb: pdfplumber uses ~10x file size in RAM. A 50MB PDF needs ~500MB. Plan worker count accordingly.

Rationalization Table

Red Flags

Applying OCR to a native PDF -- text extraction failed for another reason (encoding, font mapping); OCR will produce inferior results
Using pypdf for table extraction -- pypdf has no table detection; always use pdfplumber or camelot for tables
Loading entire PDF into memory for batch processing -- process page-by-page or file-by-file
Ignoring checkbox on-values -- assuming /Yes when the PDF uses /On or /1 causes silent fill failures
Filling calculated or signature fields -- overwrites auto-computed values or invalidates signatures
Using fixed binarization threshold for OCR -- Otsu adaptive thresholding handles varying scan quality; fixed thresholds fail on light or dark scans
Skipping deskew before OCR -- even 1-2 degree rotation drops accuracy 10-20%
Parallel processing 500+ large PDFs without memory limits -- will exhaust RAM and crash or corrupt output

NEVER

NEVER fill a /Sig (signature) field -- invalidates the document's digital signatures
NEVER assume XFA forms work with pypdf -- they silently return empty fields; detect and warn explicitly
NEVER OCR a native-text PDF without first checking why text extraction failed -- the problem is encoding/fonts, not image-based content
NEVER use threading for CPU-bound PDF extraction -- GIL prevents parallelism; use multiprocessing
NEVER skip the forensics checklist when extraction returns empty or garbage -- random fixes waste more time than systematic diagnosis

Adoption

sharkitect-solutions/pdf-processing-pro

$ install --global

Security Scan Results

SKILL.md

PDF Processing Pro

File Index

Scope Boundary

Library Selection Decision Matrix

Hybrid Detection Procedure

PDF Forensics Checklist

Table Extraction Failure Modes

OCR Quality Decision

OCR Preprocessing Pipeline (in order)

Tesseract PSM Selection

Form Processing Edge Cases

Batch Processing Architecture

Rationalization Table

Red Flags

NEVER

Related Skills

sharkitect-solutions/paid-ads

sharkitect-solutions/skills/using-sharkitect-methodology

sharkitect-solutions/end-session

sharkitect-solutions/humanizer

sharkitect-solutions/pdf-processing-pro

$ install --global

Security Scan Results

SKILL.md

PDF Processing Pro

File Index

Scope Boundary

Library Selection Decision Matrix

Hybrid Detection Procedure

PDF Forensics Checklist

Table Extraction Failure Modes

OCR Quality Decision

OCR Preprocessing Pipeline (in order)

Tesseract PSM Selection

Form Processing Edge Cases

Batch Processing Architecture

Rationalization Table

Red Flags

NEVER

Related Skills

sharkitect-solutions/paid-ads

sharkitect-solutions/skills/using-sharkitect-methodology

sharkitect-solutions/end-session

sharkitect-solutions/humanizer