document_ocr/SKILL.md
Convert scanned PDFs and document images into clean Markdown using docling for layout (figures, tables, reading order) plus a vision-language OCR model. Use when a user needs high-quality OCR of scanned documents, historical literature, or photographed pages — preserving multi-column reading order, diacritics, special characters, and figures. Supports local vLLM/Ollama servers and cloud vision APIs (OpenAI, Anthropic). Assumes an OCR backend already exists.
npx skillsauth add brunoasm/my_claude_skills document-ocrInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Turn scanned PDFs and document images into clean, structured Markdown with figures preserved. The pipeline combines:
docling supplies reliable figure/table crops; the VLM supplies high-quality text. Born-digital PDFs are detected and extracted directly (no server needed).
Adapted from Bruno de Medeiros' OntoMorphoGrapher proof-of-concept, where the
olmocr-docling backend processed an 18-PDF historical-entomology corpus
(1833–2015, 5 languages) with no catastrophic failures.
Use when:
Do not use when:
extract-from-pdfs skill, which can run downstream of this one.This skill does not start or manage any server. Standing one up is out of scope. Before running, the user must have one of:
--backend olmocr-docling);--backend ollama);--backend anthropic-docling) or OpenAI
(--backend vlm-docling pointed at the OpenAI base URL).Always remind the user of this requirement and confirm which backend they
have. For example setup commands (illustrative, not maintained), see
references/server_setup.md. The born-digital PDF fast path needs no backend.
conda env create -f environment.yml
conda activate document_ocr
Then make the backend reachable (pick one):
export OCR_HOST=http://YOUR_HOST:30001 # vLLM / Ollama / OpenAI-compatible
# or
export ANTHROPIC_API_KEY=sk-ant-... # cloud Claude
# or
export OPENAI_API_KEY=sk-... # cloud OpenAI
| --backend | What it uses | Needs |
|--------------------|-------------------------------------------|-------|
| olmocr-docling ★ | docling figures + olmOCR-2 (vLLM) | --host |
| vlm-docling | docling figures + any VLM (OpenAI-compat) | --host (+ --api-key for cloud) |
| anthropic-docling| docling figures + Claude (cloud) | --api-key / ANTHROPIC_API_KEY |
| ollama | full-page DeepSeek-OCR (Ollama) | --host |
| docling | docling layout + per-region OCR | --host |
★ Recommended. See references/backends.md for the full matrix, why full-page
beats per-region OCR, model defaults, and tuning notes.
.png/.jpg/.jpeg/.tif/.tiff), or a
directory of them. Ask where output should go.python scripts/ocr_document.py --input docs/ --dry-run
This labels each PDF DIGITAL or NEEDS_OCR so the user knows what will hit
the server.python scripts/ocr_document.py \
--input docs/ --output-dir out/ \
--backend olmocr-docling --host "$OCR_HOST" --dpi 200
The tool runs a preflight check and fails with a clear message if the backend
isn't configured. Per-page failures (timeouts, content filters) are skipped
with a placeholder; the rest of the document still processes.out/<stem>/<stem>.md and the figures/ crops. Iterate if needed:
raise --dpi, switch --model, or adjust the caption regex / bad_words /
language list (see references/backends.md).Per-document folder with the stitched markdown (<stem>.md), a per-page JSON
cache/, cropped figures/, and rendered pages/. Full schema and frontmatter
fields: references/output_format.md.
--dry-run first and a small sample before the full
run, so cost and quality are known before committing.Pipeline and prompts adapted from Bruno de Medeiros' OntoMorphoGrapher proof-of-concept (docling + olmOCR-2 over vLLM).
development
Place lab supply orders from member requests — route by request header to Amazon Business, the Pritzker Lab Google Form, or a direct vendor; stage the cart/form and stop for human review before any purchase. Use when the user pastes an order request or asks to order supplies, place an order, or fill the Pritzker form.
tools
Engages structured analysis to explore multiple perspectives and context dependencies before responding. Use when users ask confirmation-seeking questions, make leading statements, request binary choices, or when feeling inclined to quickly agree or disagree without thorough consideration.
tools
Generate phylogenies from genome assemblies using BUSCO/compleasm-based single-copy orthologs with scheduler-aware workflow generation
testing
This skill should be used when extracting structured data from scientific PDFs for systematic reviews, meta-analyses, or database creation. Use when working with collections of research papers that need to be converted into analyzable datasets with validation metrics.