skills/document-vision-reader/SKILL.md
Convert rendered non-plain-text files into temporary screenshots, inspect those screenshots as images, and answer the user's request from the visible visual output. Use when an agent needs to inspect PDFs, PPT/PPTX decks, scanned documents, forms, spreadsheets, or other files whose rendered appearance matters more than raw text extraction.
npx skillsauth add laitszkin/apollo-toolkit document-vision-readerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
screenshot when the host needs OS-level captures; pdf, doc, pptx, or spreadsheet when the source file needs format-specific rendering before screenshot capture.Use visual reading as the source of truth when the target file is not a plain-text artifact and the rendered appearance matters more than extractable text.
Typical cases:
Do not use this skill for plain-text files when normal text reading is sufficient.
Before capturing anything, determine:
If the file is long, capture only the requested pages or the smallest range needed to answer correctly.
Create a dedicated temporary directory, for example:
TMP_DIR="$(mktemp -d "${TMPDIR:-/tmp}/document-vision-reader.XXXXXX")"
Inside that directory, keep deterministic names such as:
page-001.pngpage-002-top.pngslide-004-notes.pngsheet-budget-row-01.pngDo not write temporary screenshots into the user's project unless they explicitly ask for that.
Use the most reliable renderer available for the source format.
Prefer format-specific helper skills when they improve rendering fidelity:
pdf for PDF page handlingdoc for Word-like documentspptx for slidesspreadsheet for sheets/tablesscreenshot for OS-level capture when no direct renderer is availableIf rendering tooling is missing, stop and ask for the minimal set of screenshots or pages needed.
Check every capture before extracting facts.
Use references/legibility-checklist.md when a page is crowded or the answer depends on small details.
Read the screenshots with image-capable tools only after the captures are ready.
When the answer depends on multiple screenshots, reconcile them explicitly instead of paraphrasing from memory.
In the response:
If the user asked for extraction, structure the output so it can be checked against the screenshots.
After the answer is prepared:
Before finishing, confirm all of the following:
references/rendering-guide.md: format-by-format screenshot staging guidance.references/legibility-checklist.md: quick quality gate for readable visual extraction.development
Read a user-specified PDF that marks the week's key financial events, deeply research each marked event with current sources, capture any additional breaking financial developments, and produce a concise Chinese-capable PDF briefing that explains what happened and why it matters.
documentation
Generate long-form videos (more than 10 minutes) by following user instructions and invoking related skills only when needed (`openai-text-to-image-storyboard`, `docs-to-voice`, `remotion-best-practices`). For text inputs, extract a complete long-form story arc, generate fresh storyboard images (no reuse of previously generated pictures), and render a 16:9 animated long-form video.
tools
協助完成自動化版本發佈。同步文檔、更新版本號、推送 tag 並建立 GitHub Release。
development
Incrementally refresh the architecture atlas when the project diagram drifts from actual code. Measures drift before updating to determine scope, then updates the base atlas and re-renders HTML.