agentic/code/addons/doc-intelligence/skills/pdf-extractor/SKILL.md
Extract text, tables, and images from PDF files. Use when converting PDF documentation, manuals, or reports to searchable text.
npx skillsauth add jmagly/aiwg pdf-extractorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Single responsibility: Extract structured content (text, tables, images) from PDF files into organized, searchable formats. (BP-4)
Before executing, VERIFY:
file <path> confirms PDF format)pdfinfo <path> returns metadata)DO NOT proceed without verification. Inspect PDF metadata first.
ASK USER instead of guessing when:
NEVER assume PDF structure without inspection.
| Context Type | Included | Excluded | |--------------|----------|----------| | RELEVANT | Target PDF, extraction options, output path | Other PDF files | | PERIPHERAL | Similar PDF structure examples | Unrelated documents | | DISTRACTOR | Previous extraction attempts | Other file formats |
# Check file type
file document.pdf
# Get PDF metadata
pdfinfo document.pdf
# Check page count
pdfinfo document.pdf | grep Pages
# Check if encrypted
pdfinfo document.pdf | grep Encrypted
| PDF Type | Detection | Strategy |
|----------|-----------|----------|
| Text-based | pdftotext produces readable text | Direct extraction |
| Scanned/Image | pdftotext produces empty/garbled | OCR required |
| Mixed | Some pages text, some images | Hybrid approach |
| Tables | Visual grid patterns | Table extraction mode |
| Forms | Interactive fields | Form field extraction |
Option A: With skill-seekers (if installed)
# Basic extraction
skill-seekers pdf --pdf document.pdf --name myskill
# With table extraction
skill-seekers pdf --pdf document.pdf --name myskill --extract-tables
# With OCR for scanned docs
skill-seekers pdf --pdf document.pdf --name myskill --ocr
# With parallel processing (large PDFs)
skill-seekers pdf --pdf document.pdf --name myskill --parallel --workers 8
# Password-protected
skill-seekers pdf --pdf document.pdf --name myskill --password "secret"
Option B: Manual extraction guidance
# Basic text extraction
pdftotext -layout document.pdf output.txt
# Extract with page markers
pdftotext -layout -eol unix document.pdf output.txt
# Extract images
pdfimages -all document.pdf images/
# OCR scanned PDF (requires tesseract)
pdftoppm document.pdf page -png
tesseract page-*.png output -l eng
# Check extraction quality
head -100 output/<skill-name>/references/content.md
# Verify table extraction
grep -A 10 "| " output/<skill-name>/references/*.md
# Check image extraction
ls -la output/<skill-name>/assets/images/
On error:
File not found → Verify pathPassword required → Ask user for passwordCorrupt PDF → Try repair with qpdf --checkOCR failed → Check tesseract installation, language packsMemory error → Process in chunks, reduce workersState saved to: .aiwg/working/checkpoints/pdf-extractor/
For large PDFs, extraction saves progress per chunk:
checkpoints/pdf-extractor/
├── document_metadata.json
├── pages_1-50.json
├── pages_51-100.json
└── current_position.json
output/<skill-name>/
├── SKILL.md # Skill description with PDF summary
├── references/
│ ├── index.md # Table of contents
│ ├── chapter_1.md # Content by section
│ ├── chapter_2.md
│ └── tables.md # Extracted tables
└── assets/
└── images/ # Extracted images (if enabled)
├── page_1_fig_1.png
└── page_5_chart_1.png
{
"name": "mymanual",
"description": "Product manual documentation",
"pdf_path": "docs/manual.pdf",
"extract_options": {
"chunk_size": 10,
"min_quality": 6.0,
"extract_images": true,
"min_image_size": 150,
"ocr_enabled": false,
"ocr_language": "eng",
"table_extraction": true
},
"categories": {
"getting_started": ["introduction", "setup", "installation"],
"usage": ["using", "operation", "guide"],
"reference": ["appendix", "specifications", "api"]
}
}
| Metric | Good | Acceptable | Poor | |--------|------|------------|------| | Text extraction rate | >95% | 80-95% | <80% | | Table accuracy | >90% | 70-90% | <70% | | Image quality | >300 DPI | 150-300 DPI | <150 DPI | | OCR confidence | >90% | 70-90% | <70% |
| Issue | Diagnosis | Solution |
|-------|-----------|----------|
| Garbled text | Scanned PDF | Enable OCR mode |
| Missing tables | Complex layout | Use --extract-tables with pdfplumber |
| Poor OCR | Low resolution | Increase DPI, check language pack |
| Memory error | Large PDF | Use chunked extraction, reduce workers |
| Corrupt PDF | File damaged | Try qpdf --check or mutool clean |
Required:
Optional (for advanced features):
data-ai
Report which research-corpus radar sidecars are overdue for refresh. Computes staleness (days since last refresh vs the cadence window) for every radar, sorted most-overdue-first. Runs via `aiwg corpus radar-status`.
data-ai
Aggregate research-corpus radar sidecars into a corpus or per-cluster freshness report — totals, overdue count, per-cluster / per-GRADE / per-trajectory breakdowns, an overdue table, and per-radar rationale snippets. Runs via `aiwg corpus radar-report`.
testing
Scaffold radar/freshness sidecars for research-corpus REFs. Pulls title/authors from the citation sidecar and GRADE from the analysis doc, defaults the refresh cadence from GRADE and the cluster from a corpus-local map, and stamps documentation/radar/REF-XXX-radar.md. Runs via `aiwg corpus radar-init`.
data-ai
Compute an entity's publication trajectory — per-year paper counts, topic drift, hot-streak detection (≥3 consecutive A-grade years), and career phase. Runs via `aiwg corpus profile-temporal`.