tools/extracting-pdf-text/SKILL.md
Extract text from PDFs for LLM consumption. Use when processing PDFs for RAG, document analysis, or text extraction. Supports API services (Mistral OCR) and local tools (PyMuPDF, pdfplumber). Handles text-based PDFs, tables, and scanned documents with OCR.
npx skillsauth add letta-ai/skills extracting-pdf-textInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill provides tools and guidance for extracting text from PDFs in formats suitable for language model consumption.
| PDF Type | Best Approach | Script |
|----------|--------------|--------|
| Simple text PDF | PyMuPDF | scripts/extract_pymupdf.py |
| PDF with tables | pdfplumber | scripts/extract_pdfplumber.py |
| Scanned/image PDF (local) | pytesseract | scripts/extract_with_ocr.py |
| Complex layout, highest accuracy | Mistral OCR API | scripts/extract_mistral_ocr.py |
| End-to-end RAG pipeline | marker-pdf | pip install marker-pdf |
Best for: Text-heavy PDFs, speed-critical workflows, basic structure preservation.
uv run scripts/extract_pymupdf.py input.pdf output.md
The script outputs markdown with preserved headings and paragraphs. For LLM-optimized output, it uses pymupdf4llm which formats text for RAG systems.
Best for: PDFs with tables, financial documents, structured data.
uv run scripts/extract_pdfplumber.py input.pdf output.md
Tables are converted to markdown format. Note: pdfplumber works best on machine-generated PDFs, not scanned documents.
Best for: Scanned PDFs when API access is unavailable.
uv run scripts/extract_with_ocr.py input.pdf output.txt
Requires: pytesseract, pdf2image, and Tesseract installed (brew install tesseract on macOS).
Best for: Complex layouts, scanned documents, highest accuracy, multilingual content, math formulas.
Pricing: ~1000 pages per dollar (very cost-effective)
export MISTRAL_API_KEY="your-key"
uv run scripts/extract_mistral_ocr.py input.pdf output.md
Features:
For detailed API options and other services, see references/api-services.md.
For LLM consumption, markdown is preferred:
For detailed comparisons of local tools, see references/local-tools.md.
testing
Navigates archived ChatGPT or Claude-style conversation exports and a MemFS reference archive on demand. Use when recalling what a past assistant knew, searching old conversations, rendering specific chats, seeding reference memory from export sidecars, or mining historical context without doing a full import.
testing
Migrates deprecated Letta Filesystem folders/files to MemFS using markdown document corpora, chunking, local lexical search, and QMD semantic search via the memfs-search skill. Use when replacing folders.files.upload, working with PDFs or document QA, or emulating open_file, grep_file, and search_file behavior.
data-ai
Configures Letta agent compaction settings and custom summarization prompts. Use when a user asks to change an agent's compaction prompt, improve summaries after context eviction, tune sliding-window or all-message compaction, or design companion/coding-agent continuity summaries.
development
Semantic search over agent memory files. Use when you need to find conceptually related memory blocks, discover forgotten reference files, check what you already know before creating new memory, or search beyond exact keyword matching. Currently supports QMD (local, no API keys).