Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

jmagly/pdf-extractor

Name: pdf-extractor
Author: jmagly

agentic/code/addons/doc-intelligence/skills/pdf-extractor/SKILL.md

npx skillsauth add jmagly/aiwg pdf-extractor

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

PDF Extractor Skill

Purpose

Single responsibility: Extract structured content (text, tables, images) from PDF files into organized, searchable formats. (BP-4)

Grounding Checkpoint (Archetype 1 Mitigation)

Before executing, VERIFY:

[ ] PDF file exists and is readable (file <path> confirms PDF format)
[ ] PDF is not corrupted (pdfinfo <path> returns metadata)
[ ] Password known if encrypted
[ ] Output directory is writable
[ ] Required tools available (pdfplumber, pytesseract for OCR)

DO NOT proceed without verification. Inspect PDF metadata first.

Uncertainty Escalation (Archetype 2 Mitigation)

ASK USER instead of guessing when:

PDF appears to be scanned (needs OCR) but OCR tools unavailable
Multiple table formats detected - unclear which parser to use
Password-protected but no password provided
Image extraction quality unclear (resolution, format preferences)
Language detection needed for OCR

NEVER assume PDF structure without inspection.

Context Scope (Archetype 3 Mitigation)

| Context Type | Included | Excluded | |--------------|----------|----------| | RELEVANT | Target PDF, extraction options, output path | Other PDF files | | PERIPHERAL | Similar PDF structure examples | Unrelated documents | | DISTRACTOR | Previous extraction attempts | Other file formats |

Workflow Steps

Step 1: Inspect PDF (Grounding)

# Check file type
file document.pdf

# Get PDF metadata
pdfinfo document.pdf

# Check page count
pdfinfo document.pdf | grep Pages

# Check if encrypted
pdfinfo document.pdf | grep Encrypted

Step 2: Determine Extraction Strategy

| PDF Type | Detection | Strategy | |----------|-----------|----------| | Text-based | pdftotext produces readable text | Direct extraction | | Scanned/Image | pdftotext produces empty/garbled | OCR required | | Mixed | Some pages text, some images | Hybrid approach | | Tables | Visual grid patterns | Table extraction mode | | Forms | Interactive fields | Form field extraction |

Step 3: Execute Extraction

Option A: With skill-seekers (if installed)

# Basic extraction
skill-seekers pdf --pdf document.pdf --name myskill

# With table extraction
skill-seekers pdf --pdf document.pdf --name myskill --extract-tables

# With OCR for scanned docs
skill-seekers pdf --pdf document.pdf --name myskill --ocr

# With parallel processing (large PDFs)
skill-seekers pdf --pdf document.pdf --name myskill --parallel --workers 8

# Password-protected
skill-seekers pdf --pdf document.pdf --name myskill --password "secret"

Option B: Manual extraction guidance

# Basic text extraction
pdftotext -layout document.pdf output.txt

# Extract with page markers
pdftotext -layout -eol unix document.pdf output.txt

# Extract images
pdfimages -all document.pdf images/

# OCR scanned PDF (requires tesseract)
pdftoppm document.pdf page -png
tesseract page-*.png output -l eng

Step 4: Validate Output

# Check extraction quality
head -100 output/<skill-name>/references/content.md

# Verify table extraction
grep -A 10 "| " output/<skill-name>/references/*.md

# Check image extraction
ls -la output/<skill-name>/assets/images/

Recovery Protocol (Archetype 4 Mitigation)

On error:

PAUSE - Stop extraction, preserve partial output
DIAGNOSE - Check error type:
- File not found → Verify path
- Password required → Ask user for password
- Corrupt PDF → Try repair with qpdf --check
- OCR failed → Check tesseract installation, language packs
- Memory error → Process in chunks, reduce workers
ADAPT - Switch strategy based on diagnosis
RETRY - Resume with adapted approach (max 3 attempts)
ESCALATE - Ask user for guidance

Checkpoint Support

State saved to: .aiwg/working/checkpoints/pdf-extractor/

For large PDFs, extraction saves progress per chunk:

checkpoints/pdf-extractor/
├── document_metadata.json
├── pages_1-50.json
├── pages_51-100.json
└── current_position.json

Output Structure

output/<skill-name>/
├── SKILL.md              # Skill description with PDF summary
├── references/
│   ├── index.md          # Table of contents
│   ├── chapter_1.md      # Content by section
│   ├── chapter_2.md
│   └── tables.md         # Extracted tables
└── assets/
    └── images/           # Extracted images (if enabled)
        ├── page_1_fig_1.png
        └── page_5_chart_1.png

Configuration Options

{
  "name": "mymanual",
  "description": "Product manual documentation",
  "pdf_path": "docs/manual.pdf",
  "extract_options": {
    "chunk_size": 10,
    "min_quality": 6.0,
    "extract_images": true,
    "min_image_size": 150,
    "ocr_enabled": false,
    "ocr_language": "eng",
    "table_extraction": true
  },
  "categories": {
    "getting_started": ["introduction", "setup", "installation"],
    "usage": ["using", "operation", "guide"],
    "reference": ["appendix", "specifications", "api"]
  }
}

Extraction Quality Metrics

| Metric | Good | Acceptable | Poor | |--------|------|------------|------| | Text extraction rate | >95% | 80-95% | <80% | | Table accuracy | >90% | 70-90% | <70% | | Image quality | >300 DPI | 150-300 DPI | <150 DPI | | OCR confidence | >90% | 70-90% | <70% |

Troubleshooting

| Issue | Diagnosis | Solution | |-------|-----------|----------| | Garbled text | Scanned PDF | Enable OCR mode | | Missing tables | Complex layout | Use --extract-tables with pdfplumber | | Poor OCR | Low resolution | Increase DPI, check language pack | | Memory error | Large PDF | Use chunked extraction, reduce workers | | Corrupt PDF | File damaged | Try qpdf --check or mutool clean |

Dependencies

Required:

Python 3.10+
pdfplumber or pypdf

Optional (for advanced features):

pytesseract + tesseract-ocr (for OCR)
Pillow (for image processing)
camelot-py (for complex tables)

References

Skill Seekers PDF Support: https://github.com/jmagly/Skill_Seekers/blob/main/docs/PDF_MCP_TOOL.md
REF-001: Production-Grade Agentic Workflows (BP-1, BP-4)
REF-002: LLM Failure Modes (Archetype 1-4 mitigations)

jmagly/pdf-extractor

agentic/code/addons/doc-intelligence/skills/pdf-extractor/SKILL.md

Extract text, tables, and images from PDF files. Use when converting PDF documentation, manuals, or reports to searchable text.

122 stars

documentation

Updated Apr 24, 2026

$ install --global

skillsauth

npx skillsauth add jmagly/aiwg pdf-extractor

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 24, 2026, 5:16 PM107.5s1 file scanned

SKILL.md

namespace:: aiwg
name:: pdf-extractor
description:: Extract text, tables, and images from PDF files. Use when converting PDF documentation, manuals, or reports to searchable text.
tools:: Read, Write, Bash
platforms:: [all]

PDF Extractor Skill

Purpose

Single responsibility: Extract structured content (text, tables, images) from PDF files into organized, searchable formats. (BP-4)

Grounding Checkpoint (Archetype 1 Mitigation)

Before executing, VERIFY:

[ ] PDF file exists and is readable (file <path> confirms PDF format)
[ ] PDF is not corrupted (pdfinfo <path> returns metadata)
[ ] Password known if encrypted
[ ] Output directory is writable
[ ] Required tools available (pdfplumber, pytesseract for OCR)

DO NOT proceed without verification. Inspect PDF metadata first.

Uncertainty Escalation (Archetype 2 Mitigation)

ASK USER instead of guessing when:

PDF appears to be scanned (needs OCR) but OCR tools unavailable
Multiple table formats detected - unclear which parser to use
Password-protected but no password provided
Image extraction quality unclear (resolution, format preferences)
Language detection needed for OCR

NEVER assume PDF structure without inspection.

Context Scope (Archetype 3 Mitigation)

Workflow Steps

Step 1: Inspect PDF (Grounding)

# Check file type
file document.pdf

# Get PDF metadata
pdfinfo document.pdf

# Check page count
pdfinfo document.pdf | grep Pages

# Check if encrypted
pdfinfo document.pdf | grep Encrypted

Step 2: Determine Extraction Strategy

Step 3: Execute Extraction

Option A: With skill-seekers (if installed)

# Basic extraction
skill-seekers pdf --pdf document.pdf --name myskill

# With table extraction
skill-seekers pdf --pdf document.pdf --name myskill --extract-tables

# With OCR for scanned docs
skill-seekers pdf --pdf document.pdf --name myskill --ocr

# With parallel processing (large PDFs)
skill-seekers pdf --pdf document.pdf --name myskill --parallel --workers 8

# Password-protected
skill-seekers pdf --pdf document.pdf --name myskill --password "secret"

Option B: Manual extraction guidance

# Basic text extraction
pdftotext -layout document.pdf output.txt

# Extract with page markers
pdftotext -layout -eol unix document.pdf output.txt

# Extract images
pdfimages -all document.pdf images/

# OCR scanned PDF (requires tesseract)
pdftoppm document.pdf page -png
tesseract page-*.png output -l eng

Step 4: Validate Output

# Check extraction quality
head -100 output/<skill-name>/references/content.md

# Verify table extraction
grep -A 10 "| " output/<skill-name>/references/*.md

# Check image extraction
ls -la output/<skill-name>/assets/images/

Recovery Protocol (Archetype 4 Mitigation)

On error:

PAUSE - Stop extraction, preserve partial output
DIAGNOSE - Check error type:
- File not found → Verify path
- Password required → Ask user for password
- Corrupt PDF → Try repair with qpdf --check
- OCR failed → Check tesseract installation, language packs
- Memory error → Process in chunks, reduce workers
ADAPT - Switch strategy based on diagnosis
RETRY - Resume with adapted approach (max 3 attempts)
ESCALATE - Ask user for guidance

Checkpoint Support

State saved to: .aiwg/working/checkpoints/pdf-extractor/

For large PDFs, extraction saves progress per chunk:

checkpoints/pdf-extractor/
├── document_metadata.json
├── pages_1-50.json
├── pages_51-100.json
└── current_position.json

Output Structure

output/<skill-name>/
├── SKILL.md              # Skill description with PDF summary
├── references/
│   ├── index.md          # Table of contents
│   ├── chapter_1.md      # Content by section
│   ├── chapter_2.md
│   └── tables.md         # Extracted tables
└── assets/
    └── images/           # Extracted images (if enabled)
        ├── page_1_fig_1.png
        └── page_5_chart_1.png

Configuration Options

{
  "name": "mymanual",
  "description": "Product manual documentation",
  "pdf_path": "docs/manual.pdf",
  "extract_options": {
    "chunk_size": 10,
    "min_quality": 6.0,
    "extract_images": true,
    "min_image_size": 150,
    "ocr_enabled": false,
    "ocr_language": "eng",
    "table_extraction": true
  },
  "categories": {
    "getting_started": ["introduction", "setup", "installation"],
    "usage": ["using", "operation", "guide"],
    "reference": ["appendix", "specifications", "api"]
  }
}

Extraction Quality Metrics

Troubleshooting

Dependencies

Required:

Python 3.10+
pdfplumber or pypdf

Optional (for advanced features):

pytesseract + tesseract-ocr (for OCR)
Pillow (for image processing)
camelot-py (for complex tables)

References

Skill Seekers PDF Support: https://github.com/jmagly/Skill_Seekers/blob/main/docs/PDF_MCP_TOOL.md
REF-001: Production-Grade Agentic Workflows (BP-1, BP-4)
REF-002: LLM Failure Modes (Archetype 1-4 mitigations)

Related Skills

jmagly/radar-status

data-ai

VerifiedTrustedCommunity

Report which research-corpus radar sidecars are overdue for refresh. Computes staleness (days since last refresh vs the cadence window) for every radar, sorted most-overdue-first. Runs via `aiwg corpus radar-status`.

140SKILL.mdUpdated May 28, 2026

jmagly/radar-report

data-ai

VerifiedTrustedCommunity

Aggregate research-corpus radar sidecars into a corpus or per-cluster freshness report — totals, overdue count, per-cluster / per-GRADE / per-trajectory breakdowns, an overdue table, and per-radar rationale snippets. Runs via `aiwg corpus radar-report`.

140SKILL.mdUpdated May 28, 2026

jmagly/radar-init

testing

VerifiedTrustedCommunity

Scaffold radar/freshness sidecars for research-corpus REFs. Pulls title/authors from the citation sidecar and GRADE from the analysis doc, defaults the refresh cadence from GRADE and the cluster from a corpus-local map, and stamps documentation/radar/REF-XXX-radar.md. Runs via `aiwg corpus radar-init`.

140SKILL.mdUpdated May 28, 2026

jmagly/profile-temporal

data-ai

VerifiedTrustedCommunity

Compute an entity's publication trajectory — per-year paper counts, topic drift, hot-streak detection (≥3 consecutive A-grade years), and career phase. Runs via `aiwg corpus profile-temporal`.

140SKILL.mdUpdated May 28, 2026

jmagly/profile-temporal

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/jmagly/aiwg.git

# Copy into Claude Code skills folder (global)
cp -r aiwg/agentic/code/addons/doc-intelligence/skills/pdf-extractor ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

jmagly/aiwg

122 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT