skills/extracting-pdf-text/SKILL.md
Extract text from PDFs for LLM consumption. Use when processing PDFs for RAG, document analysis, or text extraction. Supports API services (Mistral OCR) and local tools (PyMuPDF, pdfplumber). Handles text-based PDFs, tables, and scanned documents with OCR.
npx skillsauth add boazcstrike/opencode extracting-pdf-textInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill provides tools and guidance for extracting text from PDFs in formats suitable for language model consumption.
| PDF Type | Best Approach | Script |
|----------|--------------|--------|
| Simple text PDF | PyMuPDF | ./scripts/extract_pymupdf.py |
| PDF with tables | pdfplumber | ./scripts/extract_pdfplumber.py |
| Scanned/image PDF (local) | pytesseract | ./scripts/extract_with_ocr.py |
| Complex layout, highest accuracy | Mistral OCR API | ./scripts/extract_mistral_ocr.py |
| End-to-end RAG pipeline | marker-pdf | pip install marker-pdf |
Best for: Text-heavy PDFs, speed-critical workflows, basic structure preservation.
uv run ./scripts/extract_pymupdf.py input.pdf output.md
The script outputs markdown with preserved headings and paragraphs. For LLM-optimized output, it uses pymupdf4llm which formats text for RAG systems.
Best for: PDFs with tables, financial documents, structured data.
uv run ./scripts/extract_pdfplumber.py input.pdf output.md
Tables are converted to markdown format. Note: pdfplumber works best on machine-generated PDFs, not scanned documents.
Best for: Scanned PDFs when API access is unavailable.
uv run ./scripts/extract_with_ocr.py input.pdf output.txt
Requires: pytesseract, pdf2image, and Tesseract installed (brew install tesseract on macOS).
Best for: Complex layouts, scanned documents, highest accuracy, multilingual content, math formulas.
Pricing: ~1000 pages per dollar (very cost-effective)
export MISTRAL_API_KEY="your-key"
uv run ./scripts/extract_mistral_ocr.py input.pdf output.md
Features:
For detailed API options and other services, see references/api-services.md.
For detailed comparisons of local tools, see references/local-tools.md.
development
Google Workflow: Announce a Drive file in a Chat space.
development
Google Workflow: Convert a Gmail message into a Google Tasks entry.
development
Google Tasks: Manage task lists and tasks.
development
Google Slides: Read and write presentations.