skills/research/SKILL.md
Extract research content from YouTube presentations, PDFs, or PPTX files into structured markdown. Dispatches each pass to a dedicated sub-agent (research-extractor / research-vision / research-refiner) so per-deck vision passes scale to hundreds of slides without bloating the parent context.
npx skillsauth add api-haus/my-claude-workflow researchInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Extract research material into /mnt/archive4/PAPERS/Prepared/ as annotated markdown with images, transcripts, and OCR. The orchestrator (you) is a thin coordinator: every load-bearing pass runs in a dedicated sub-agent's context window so the parent session stays small.
The Python pipeline ships with the skill, including its own venv. Everything is self-contained at ~/.claude/skills/research/:
~/.claude/skills/research/
├── SKILL.md ← this file
├── pyproject.toml ← uv-managed Python dep spec
├── package.json ← npm-managed Node dep spec (Pass 2.5 validators)
├── .venv/ ← skill-local Python venv (created by `uv sync`, gitignored)
├── node_modules/ ← skill-local Node modules (created by `npm install`, gitignored)
└── tools/
├── README.md
├── extract_research.py
├── extract_research_phase2.py
├── cleanup_research.py
├── research_video.py
├── transcribe_to_srt.py
├── upgrade_to_slide_renders.py
├── redetect_scenes.py
├── subsample_long_scenes.py
├── srt_to_windows.py
├── validate_research.py ← Pass 2.5: extract LaTeX/Mermaid blocks → Node validator
├── validate_md.mjs ← Pass 2.5: KaTeX + mermaid.parse() syntax checker
└── render_md_html.mjs ← Pass 2.5: optional self-contained HTML preview
Why a skill-local venv (not the project's tools/.venv/): projects vary wildly in their Python requirements — some have no venv at all, some have one with conflicting versions (numpy pinned for ML, opencv with GUI flavour, …). The research pipeline needs specific versions of PyMuPDF, opencv-python-headless, faster-whisper, etc. Pinning those at the skill level decouples the toolchain from whatever the project happens to have lying around.
uv as the package manager. Dependencies are pinned in pyproject.toml; the venv is created/updated with uv sync from the skill directory. uv resolves and installs in seconds vs minutes for plain pip — important when the skill is dispatched from many projects.
cd ~/.claude/skills/research
uv sync # Python: creates .venv, installs deps (~30 s cold)
npm install --no-audit --no-fund # Node: installs Pass 2.5 validators (~10 s cold)
The Node install pulls KaTeX (LaTeX validator + renderer), mermaid + jsdom (Mermaid parser), markdown-it + @vscode/markdown-it-katex (HTML preview). Required by tools/validate_research.py (Pass 2.5). Total disk footprint ~30 MB. Skip only if you intend to never run Pass 2.5 — the rest of the pipeline does not depend on it.
Plus system dependencies (tracked in tools/README.md):
# Arch
sudo pacman -S libreoffice-fresh yt-dlp ffmpeg
# Debian / Ubuntu
sudo apt install libreoffice yt-dlp ffmpeg
# macOS
brew install --cask libreoffice
brew install yt-dlp ffmpeg
LibreOffice is needed for PPTX → PNG rendering. yt-dlp + ffmpeg are needed for the video pipeline. OCR is provided by OpenOCR (openocr-python), pinned in pyproject.toml — no system OCR engine is needed. OpenOCR auto-downloads ONNX detection + recognition models (~36 MB total) to ~/.cache/openocr/ on first use.
The agents invoke scripts using the skill venv directly:
~/.claude/skills/research/.venv/bin/python ~/.claude/skills/research/tools/<script>.py --only=<slug>
Or via uv run (which auto-syncs if pyproject.toml changed):
uv run --project ~/.claude/skills/research python ~/.claude/skills/research/tools/<script>.py --only=<slug>
Both forms work; the explicit-path form is faster because it skips the uv sync check.
One-off helpers (e.g. split_<slug>_notes.py for a particular deck's PowerPoint Notes-Pages text-layer split) stay in <project>/tools/, never in the skill — they are project-specific and would clutter the shared skill.
Projects that adopted /research before this restructuring (notably woweyreey) may have their own <project>/tools/*.py copies running against <project>/tools/.venv/. Those continue to work but are legacy: no further updates land there. Migration path for those projects: cd ~/.claude/skills/research && uv sync, then update any project-specific tools/<script>.py invocations to point at the skill copy.
You are the orchestrator for the /research skill. You do not read 268-slide PDFs or run the OCR pipeline yourself. Instead, you scope the work and dispatch it to specialised sub-agents:
| Pass | Agent | Purpose |
|---|---|---|
| 1 — extract & mark | research-extractor | Add the source to tools/extract_research.py SOURCES, run scripts, archive source to /mnt/archive4/PAPERS/. Then read the produced markdown and mark every problematic area inline with a <!-- FIXME(extract): … --> comment — garbled equations, suspect OCR, and (critically) each page that needs a vision-pass description. Report the slug, asset counts, and the count of pages flagged for vision. |
| 2 — vision (conditional) | research-vision | Read slide / figure images and write **Diagram (LLM vision pass):** blocks via Edit. Dispatched ONLY when more than 5 pages need a vision pass (per the extractor's FIXME(extract): … needs vision marks). When 5 or fewer pages need vision, skip this pass entirely — Pass 3 folds the handful of descriptions in. Batches well — dispatch one agent per ~30 slides to keep individual context lean. |
| 2.5 — validate | (orchestrator runs inline) | Run tools/validate_research.py --only=<slug>: every LaTeX block ($…$, $$…$$) is parsed by KaTeX and every Mermaid fenced block by mermaid.parse(). Errors are written to findings-pass2.5-validate.md for the refiner to fix, and to stderr for the orchestrator. Optional --html produces a browser-openable preview. |
| 3 — refine (+ inline vision) | research-refiner | Heading fixes, broken-Unicode equation re-transcription, speaker-notes typo cleanup, optional top-of-doc summary. Resolves every FIXME(extract) and FIXME(vision) mark left in the document and deletes the comment once handled. When Pass 2 was skipped (≤5 vision pages), the refiner also writes the **X (LLM vision pass):** blocks for those pages itself. Brief MUST cite the Pass-2.5 sidecar so the refiner has a concrete error list to address. |
| 3.5 — re-validate | (orchestrator runs inline, optional) | Re-run tools/validate_research.py --only=<slug> as a clean-room check after refine. If anything regressed (new errors introduced, old errors not fixed), re-dispatch the refiner. |
You may also dispatch additional vision-pass batches between Pass 2 and 3 (e.g. "vision-pass slides 100-130 of the same doc, focusing on plot panels") if the first pass missed coverage. Multiple batches against the same paper must run sequentially — they all Edit the same <slug>.md and concurrent edits collide.
Sub-agent model pins — the orchestrator MUST pass model explicitly on every Agent dispatch.
The model: field in each agent's frontmatter is advisory only and is NOT reliably applied — in practice a dispatched sub-agent inherits the parent session's model (typically Opus) unless the orchestrator passes the model parameter explicitly on the Agent tool call. Relying on the frontmatter silently runs the cheap mechanical passes (extract / vision) on Opus, which is a large, avoidable cost. Every /research dispatch must set model explicitly:
| Agent | Pass model: = | Rationale |
|---|---|---|
| research-extractor | "sonnet" | Pass 1 is mostly script orchestration (edit SOURCES, run scripts, archive) — Sonnet handles it cheaply. |
| research-vision | "sonnet" | Cost-dominant workload — hundreds of images per deck in ~30-image batches. Sonnet 4.6 vision quality is strong at meaningfully lower per-token cost than Opus. |
| research-refiner | "opus" | Quality-control gate, and — when Pass 2 is skipped — the inline vision pass. Refiner is where math correctness is verified, OCR-corrupted equations are reconstructed, and scientifically-load-bearing formulas (HG, Rayleigh, Mie, transport equations) get their final form before the document becomes citable. A wrong-but-plausible LaTeX equation that ships through is harder to detect than a missing one — for example, marker once produced (1 + cos a) for the Rayleigh phase function where the canonical form is (1 + cos²a); a Sonnet refiner missed the dropped exponent because the broken form is syntactically valid LaTeX, but the surrounding paragraph ("forward = backward scatter") only makes sense for the squared form. Opus's stronger cross-source reasoning catches that class of error. The refiner is the ONLY pass that runs on Opus — and it must be the 1M-context Opus build (claude-opus-4-7[1m]); the Agent tool's model enum is coarse, so pass model: "opus" and state the 1M-context requirement in the brief. |
Concretely: extractor / vision dispatches pass model: "sonnet"; refiner dispatches pass model: "opus". The orchestrator itself inherits whatever model the parent session runs on — that is fine and unavoidable; only the dispatched sub-agents need the explicit override. If you find yourself dispatching a /research sub-agent without a model argument, that is a bug — add it.
Math-heavy papers — recommend a manual orchestrator vision-pass audit after the refiner: when the source paper carries scientifically load-bearing equations (radiative-transfer integrals, BRDF formulas, phase functions, error metrics, transport equations) AND the source is photoscanned with Acrobat OCR, the orchestrator (running on the parent-session model, typically Opus 4.x) should do a quick equation-by-equation cross-reference against the page renders before finishing the run. The refiner agent does this against text-layer artefacts, but a second pass by the orchestrator with direct visual access to the page renders is the right belt-and-braces approach for canonical primary sources. Note the audit in your final summary to the user.
FIXME marks — cross-pass communication channelPasses do not write separate findings-sidecar files. Instead, each pass that spots a problem it is not the right pass to fix marks it inline in <slug>.md with a greppable HTML comment placed at the problem site:
<!-- FIXME(extract): … -->. Garbled / suspect equations, OCR artefacts, and every page that needs a vision-pass description (<!-- FIXME(extract): pNNN needs vision — <one line on what's on the page> -->).<!-- FIXME(vision): … -->. Uncertainty flags (a label / number it could not read cleanly) and body-text-vs-render divergences it spotted but does not have authority to fix.FIXME(extract) and FIXME(vision) mark and deletes the comment once handled. Anything the refiner cannot resolve from text-layer + render evidence alone it re-marks <!-- FIXME(audit): … --> and surfaces in its return message for an orchestrator audit.Why inline instead of sidecar files: the mark lives exactly where the problem is, so the fixing pass sees it in context while reading the document it is already reading end-to-end — no second file to open, no template ceremony, no findings content passing through the orchestrator's context. The orchestrator counts marks with a single grep -c, never by reading them.
Orchestrator dispatch briefs must tell each agent (a) to leave its FIXME(<pass>) marks inline at the problem site, and (b) — for the refiner — to grep for FIXME(extract) and FIXME(vision), resolve each, and delete the comment. A refiner that finishes with FIXME(extract) / FIXME(vision) comments still in the document has not completed its pass. The marks are plain text inside the file every pass already edits, so there is no separate-file write that can fail — but if a sub-agent reports it cannot Edit <slug>.md at all (permission denied, harness block), the orchestrator's correct response is to re-dispatch, never to make the edits itself from the agent's return text. (If sub-agent edits are failing systemically, the root cause is usually a missing Write/Edit permission-allow rule for the /mnt/archive4/PAPERS/Prepared/** path — add it to the user's global ~/.claude/settings.json and fix that, don't work around it.)
The /research pipeline mixes GPU-bound local inference (marker / surya) with API-only LLM dispatch (Sonnet vision / Gemini structural cleanup), and the per-paper .md files are written-through by every pass. The right concurrency strategy depends on which pass and whether you're processing one paper or a batch.
Hard sequential — never run two of these at once on the same machine:
<slug>.md; the next pass reads what the previous wrote. Pipelined, not parallel.<slug>.md again.Parallel-safe — only when each agent owns a different <slug>.md:
Recommended dispatch flow for a single paper (the typical /research invocation):
Pass 1 → Pass 2 (only if >5 pages need vision; one or more sequential batches if the deck is large) → Pass 3. Fully sequential. This is what the three-agent dance assumes by default.
Recommended dispatch flow for a batch of N papers (when the user passes a list of sources):
- Phase A — sequential extract. Run Pass 1 once per paper, one at a time (GPU is shared). Wait for each extractor to finish before dispatching the next. Per-paper marker output is cached at
assets/<slug>/marker.mdso this phase is the GPU-bound bottleneck and worth getting right on the first try (avoid--forceretries unless you observe a marker failure in the agent's report).- Phase B — parallel vision. Once Phase A is done, dispatch a
research-visionagent for each paper whose vision-flag count is >5 — in parallel, one per paper. Each owns its own<slug>.mdso there's no edit collision; vision is API-bound (Sonnet) so there's no local-GPU contention. Papers with ≤5 vision flags skip this phase (folded into Phase C).- Phase C — parallel refine. Same pattern: N
research-refineragents, one per paper, dispatched in parallel. For papers that skipped Phase B, the refiner brief carries the inline-vision page list.
The phased flow turns what would be N × (Pass 1 + Pass 2 + Pass 3) sequential dispatches into approximately (N × Pass 1) + max(Pass 2) + max(Pass 3) wall time, which on a typical 5-paper batch is roughly 2× faster.
Pre-marker history: the old skill text said research extraction was exempt from the project's "no parallel sub-agents" rule because PyMuPDF + OpenOCR were CPU-bound and embarrassingly parallel. That exemption no longer applies — the marker prepass moved Pass 1 onto the GPU and Pass 1 is now hard-serialised across papers. Pass 2 and Pass 3 remain parallel-safe across different papers because they don't touch the local GPU.
If your top-level invocation came from /delegate (the multi-agent orchestration mode that uses shared docs/orchestrate/<topic>/ files), you are doubly orchestrating: /delegate dispatched you to handle the research portion, and you in turn dispatch the three research sub-agents. In that mode:
/delegate orchestrator owns docs/orchestrate/<topic>/ and expects status reports there. Pass that directory path through to each sub-agent's brief so they append their findings to docs/orchestrate/<topic>/<NN>-research-<pass>.md./delegate already covered those. Treat your role as "the one that knows /research" within the larger plan.If you are invoked directly (not via /delegate), skip the docs/orchestrate/<topic>/ dance — the briefs talk to the three research agents directly, and your final message to the user summarises the work.
For a small extraction (single-page paper, < 5 slides, or "just rerun extraction on an existing source"), running the three-agent dance is wasteful. In that case:
Every diagram, plot, image-only table, photograph, or code listing the vision agent processes lands in the markdown as a tagged block immediately before the image reference. The tag is one of:
**Diagram (LLM vision pass):** — schematic, flowchart, polar plot, geometry sketch.**Plot (LLM vision pass):** — quantitative-axis graph (density profile, error curve, …).**Table (LLM vision pass):** — image-only data table (transcribed as markdown table inline).**Image (LLM vision pass):** — photograph, screenshot, before/after.**Code (LLM vision pass):** — code shown as image (transcribed as fenced block with language tag).Why "(LLM vision pass)" — the parenthetical attribution is non-negotiable. It does two things:
The research-vision agent is required to use these tags. The research-refiner agent is allowed to flag suspicious blocks but must not silently rewrite them — flag for human review instead.
A slide is skipped (no vision block written) ONLY when:
**X (LLM vision pass):** block already exists for that slide — re-tagging would duplicate.Any slide with a real visual gets a per-slide block.
Every extracted document MUST be renamed (and its asset directory MUST be renamed) to a citable canonical slug before you finish the run. The Pass 1 script emits a slug derived from the source title (e.g. intro-to-gpu-occlusion) — this is scaffolding only and is never the final filename.
Pattern: <author-surname(-coauthor)?>-<year>-<short-topic>.md
brands. For 2: aaltonen-haar. For 3+: first author only or first-last (match adjacent corpus precedent).Examples (from the existing corpus):
| Source title | Canonical slug |
|---|---|
| "Intro to GPU Occlusion" (Leon Brands, GPC 2024) | brands-2024-gpu-occlusion |
| "GPU-Driven Rendering Pipelines" (Haar & Aaltonen, SIGGRAPH 2015) | aaltonen-haar-2015-gpu-driven |
| "Improved Culling for Tiled and Clustered Rendering" (Drobot, SIGGRAPH 2017) | drobot-2017-improved-culling |
| "Real-Time, All-Frequency Shadows in Dynamic Scenes" (Annen et al., TOG 2008) | annen-2008-all-frequency-shadows |
| "Adaptive Shadow Maps" (Fernando et al., SIGGRAPH 2001) | fernando-2001-adaptive-shadow-maps |
| "Sparse Virtual Textures" (Sean Barrett, GDC 2008) | barrett-2008-sparse-virtual-textures |
| "Creating the Atmospheric World of Red Dead Redemption 2" (Bauer, SIGGRAPH 2019) | bauer-2019-rdr2-atmospherics |
Why this matters: the corpus is cross-referenced from docs/, memory files, and other research notes by slug. Title-derived slugs (intro-to-gpu-occlusion, volumetric-fog-in-enshrouded) are not citation-stable — two unrelated talks could share a generic title — and they break the corpus convention. Anything filed under a non-canonical slug must be renamed before commit; deferring this creates dangling references.
Required actions before you finish the run:
*.md files in /mnt/archive4/PAPERS/Prepared/ for adjacent precedent if unsure — match the surrounding pattern).mv /mnt/archive4/PAPERS/Prepared/<scaffolding>.md /mnt/archive4/PAPERS/Prepared/<canonical>.mdmv /mnt/archive4/PAPERS/Prepared/assets/<scaffolding>/ /mnt/archive4/PAPERS/Prepared/assets/<canonical>/slug: frontmatter field, every assets/<scaffolding>/ image path.If the source genuinely has no clear single author (e.g. an Epic UE documentation page, a vendor whitepaper), use the publishing organisation in lowercase as the "author": epic-2022-ue51-virtual-shadow-maps-docs, khronos-2023-.... Match adjacent corpus precedent.
/mnt/archive4/PAPERS/Every research source — PDF, PPTX, YouTube video, HLS / m3u8 stream, local mp4 — MUST be preserved at its canonical name in /mnt/archive4/PAPERS/. This is the long-term archive of every primary document the project depends on. The markdown extracts in /mnt/archive4/PAPERS/Prepared/*.md are derived artefacts; PAPERS/ is the source of truth.
Layout:
| Source type | Where it lives in PAPERS/ |
|---|---|
| PDF | /mnt/archive4/PAPERS/<canonical-slug>.pdf |
| PPTX | /mnt/archive4/PAPERS/<canonical-slug>.pptx |
| YouTube video | /mnt/archive4/PAPERS/<year>-<slug-tail>/<canonical-slug>.mp4 + <canonical-slug>.en.srt |
| HLS / m3u8 stream | same folder layout as YouTube |
| Local mp4/mkv/webm + SRT | same folder layout as YouTube |
The canonical slug is the same one used for /mnt/archive4/PAPERS/Prepared/<slug>.md (see "REQUIRED: Citable Canonical Naming" above). The video-folder prefix <year>-<slug-tail> is just the canonical slug rotated so the year sorts first — e.g. canonical feller-2024-volumetric-fog-enshrouded → folder 2024-feller-volumetric-fog-enshrouded/.
Examples:
/mnt/archive4/PAPERS/
├── annen-2008-all-frequency-shadows.pdf
├── hillaire-2020-sky-atmosphere.pdf
├── bauer-2019-rdr2-atmospherics.pptx
├── wright-2021-radiance-caching-lumen.pptx
├── 2024-feller-volumetric-fog-enshrouded/
│ ├── feller-2024-volumetric-fog-enshrouded.mp4
│ └── feller-2024-volumetric-fog-enshrouded.en.srt
└── 2024-dekeersmaecker-numerical-precision-large-worlds/
├── dekeersmaecker-2024-numerical-precision-large-worlds.mp4
└── dekeersmaecker-2024-numerical-precision-large-worlds.en.srt
When to copy: After Pass 1 (automated extraction) finishes and the canonical slug is decided, copy the source(s) into PAPERS/ before you finish the run. Copy must use the canonical slug, never the scaffolding slug emitted by Pass 1.
For PDFs / PPTXs:
cp "<source-path>" "/mnt/archive4/PAPERS/<slug>.<ext>"
For YouTube / HLS / local videos (yt-dlp + research_video.py write into /tmp/research-<random>/<scaffolding>.{mp4,en.srt}):
mkdir -p "/mnt/archive4/PAPERS/<year>-<slug-tail>/"
cp "/tmp/research-XXXX/<scaffolding>.mp4" "/mnt/archive4/PAPERS/<year>-<slug-tail>/<slug>.mp4"
cp "/tmp/research-XXXX/<scaffolding>.en.srt" "/mnt/archive4/PAPERS/<year>-<slug-tail>/<slug>.en.srt"
Why this matters:
tempfile.mkdtemp(prefix="research-") does not auto-clean, but /tmp is wiped on reboot, and the videos are typically 100 MB+. Without the explicit copy, every /research rerun re-downloads from the network.Skip criteria: none. Even small or "obvious" sources get archived — the point of the archive is that it's complete. The only exception is a source that is genuinely already at its canonical path in PAPERS/ (cp into the same path is a no-op, but check the size — if the existing copy is smaller / corrupt, replace it).
The argument is a URL or file path:
.pdf path → extract text via marker (paper-PDFs with text-layer) or PyMuPDF (slide-deck PDFs and scanned PDFs) + full-page rendering (every page for both slide decks and papers; paper-mode pure-prose pages render as pNNN-text.png and are out of vision-pass scope — see render-policy table below). Marker uses Anthropic Claude claude-sonnet-4-6 for structural cleanup (headings, tables, equations) with redo_inline_math enabled; requires CLAUDE_API_KEY (or ANTHROPIC_API_KEY) in env. Cached per-document at assets/<slug>/marker.md so re-runs are free; cache invalidates on PDF mtime change OR provider/model change OR redo_inline_math flag flip..pptx path → render every slide via LibreOffice → PDF → PNG, plus python-pptx text + speaker notes.mp4/.mkv/.webm local path → video pipeline with --title and --slug flagsThe vision pass NEVER runs on per-figure cutouts extracted from a PDF. This rule supersedes the older "paper-mode extracts embedded images as figures" policy, which produced unusable input for the vision agent and is now removed.
Why cutouts fail. PDF figures — diagrams, flowcharts, plots, cone-tracing illustrations, octree pyramids, cache-architecture diagrams — are typically authored as PostScript/vector composites or as tiled raster mosaics. PyMuPDF's page.get_images(full=True) decomposes a single authored figure into 5-40 separate xref entries: chart chrome split from plot data, sub-panels (a, b, c, d) split apart, vector strokes split from filled regions, decorative banners separated from the photo they frame. When the vision agent receives these cutouts, it cannot recover the authored figure — each cutout is a meaningless fragment. The agent then leans entirely on text-layer prose anchoring to write the description, which means the "vision pass" is in fact a prose-paraphrase pass with image attribution. That defeats the entire point of attaching **Diagram (LLM vision pass):** blocks for auditability.
The unified rule. Every PDF (paper or slide deck) and PPTX (always a slide deck) renders full-page images for the vision pass. The vision agent sees the page exactly as a reader would — caption, figure boundary, surrounding context, and full visual fidelity intact.
| Source class | What renders | Asset filename pattern | Naming rationale |
|---|---|---|---|
| PPTX (always slide deck) | every slide | assets/<slug>/sNNN-slide.png | one render per slide |
| Slide-deck PDF (PowerPoint / Keynote / Google Slides / Beamer / Impress export, or any landscape PDF with 4:3 / 16:10 / 16:9 aspect ratio across all pages) | every page | assets/<slug>/sNNN-slide.png | one render per slide |
| Paper PDF — figure-bearing page (per _page_has_figure) | rendered, vision-pass scope | assets/<slug>/pNNN-page.png | one render per figure-bearing page |
| Paper PDF — pure-prose page | rendered, reference-only embed (vision skips) | assets/<slug>/pNNN-text.png | one render per text-only page; embedded for visual reference (math equations, citation context, marker-fidelity spot-check) but the vision agent does not write a description block |
Paper-mode rendering policy: the extractor renders every paper-PDF page. The figure-bearing/text split is a vision-pass scope decision (encoded in the filename suffix), not a "render or skip" decision. The split is computed by _page_has_figure(page), which returns True if either:
page.get_images(full=True) non-empty), ORlen(page.get_drawings()) >= 12) — catches vector flowcharts, cone diagrams, cache-architecture schematics, octree pyramids.A True result writes the page as pNNN-page.png and the vision agent processes it. A False result writes the page as pNNN-text.png and the markdown emitter prepends <!-- vision-skip: text-only page (embedded for reference / math equation visual) --> immediately above the image reference — the vision agent's hard-input contract treats both signals (filename suffix and HTML comment) as out-of-scope. The reference embed is what lets a human (or a future re-extraction audit) visually verify marker's text-layer extraction of math-bearing prose pages, which is otherwise unverifiable from the markdown alone — marker's LLM cleanup pass occasionally produces KaTeX-incompatible LaTeX (misplaced &, undefined macros), and Pass 2.5 catches the syntax error but the visual reference is what catches the semantic corruption (dropped exponent, swapped operator, etc.).
Why we render text-only pages too (vs the pre-2026-05 policy that skipped them entirely): math-bearing prose pages were unverifiable when the page render was missing — a refiner could only see marker's text-layer output, with no way to audit it against the source PDF page. Embedding the page render at pNNN-text.png is cheap (a few hundred KB per page) and turns "trust marker's LaTeX" into "spot-check marker's LaTeX against the rendered page". Cost: marginal disk; benefit: catches the class of corruption Pass 2.5 cannot (semantically-wrong-but-syntactically-valid LaTeX).
Slide-deck classification (is_slide_deck_pdf):
is_slide_deck = True. Rendered via soffice --headless --convert-to pdf, then PyMuPDF rasterises each page.is_slide_deck_pdf(doc) triggers True if any of:
creator / producer / title / subject mentions PowerPoint / Keynote / Google Slides / Beamer / Impress / "presentation".[1.25, 1.85] (4:3 ≈ 1.33, 16:10 ≈ 1.6, 16:9 ≈ 1.78), AND page count ≥ 3."slide_deck": True/False in the SOURCES entry to force a particular mode.Paper-PDF body-text classification (_classify_paper_pdf):
text-paper → marker prepass via marker_extract.convert_pdf (real markdown structure, LaTeX equations, table reconstruction). Per-page markdown becomes PageData.text; marker_extract.first_heading() populates PageData.heading (replaces font-size > 14 heuristic).scanned → existing OpenOCR fallback path on PyMuPDF page renders. Marker is NOT used (its quality on scan-only PDFs without a usable text layer is poor; OpenOCR's CPU pipeline is the canonical fallback here).--no-marker reverts text-paper PDFs to the legacy PyMuPDF span-walker. --no-llm runs marker locally (surya OCR + layout) without LLM API calls — viable when offline, but loses table-merge / equation / form / section-header repair.Marker prepass output (text-paper route only):
assets/<slug>/ cache dir.assets/<slug>/marker.md (paginated markdown) + assets/<slug>/marker-meta.json (PDF mtime + use_llm + provider + model + redo_inline_math flag + LLM token totals). Cache invalidates on PDF mtime change OR use_llm flip OR provider/model change OR redo_inline_math flip; bypass with --force.LLMImageDescriptionProcessor — that processor auto-describes every figure with the configured LLM, which would duplicate the /research vision pass with a less-strict prompt and inflate the LLM bill ~10×. Image FILES are also not extracted (we use PyMuPDF page renders for the vision pass).claude-sonnet-4-6 via marker.services.claude.ClaudeService, with redo_inline_math: True. API key resolved as: CLAUDE_API_KEY (project .envrc convention, preferred) → ANTHROPIC_API_KEY (Anthropic SDK fallback). convert_pdf raises a clear error if neither is set when use_llm=True. Override: pass claude_model_name="claude-opus-4-7" for the most math-dense primary sources where the cost premium is justified.llm_provider="gemini" to fall back to GoogleGeminiService (model gemini-2.0-flash, key GOOGLE_API_KEY/GEMINI_API_KEY). Not recommended — Flash is the documented source of broken-LaTeX output the Pass 2.5 validator was built to catch (misplaced & inside \begin{split}, undefined macros like \ddy, dropped exponents on phase-function formulas). Use only for compatibility with older cached outputs that you don't want to re-extract.redo_inline_math also one per inline-math block). Cost-per-call dominates the bill on multi-page papers. Sonnet 4.6 matches Opus 4.7 on focused VQA + structured-JSON math/table cleanup at ~5× lower per-token cost; reserve Opus for thesis-scale math-dense sources where a wrong-formula citation would be especially expensive.&, undefined macros), so the extra LLM call per inline-math block is a worthwhile baseline. Marker's own docs: "If you want the absolute highest quality inline math conversion, use this along with --use_llm."marker_extract._maybe_force_cpu_inference runs before torch imports and sets CUDA_VISIBLE_DEVICES="" when free VRAM is below 2 GiB. Surya layout + recognition models need ~1.5 GiB contiguous; on dev workstations running Unity editor / Steam fossilize_replay / ML loads in parallel, free VRAM spikes unpredictably and a snapshot check a second before allocation is not enough. Forcing CPU when low is preferable to torch OOM-then-fall-through-to-PyMuPDF (which silently degrades quality). CPU inference is ~5–10× slower but acceptable for a typical 8-page paper (under a minute). Override: pre-set CUDA_VISIBLE_DEVICES (any value, including empty) to bypass the check.extract_research.py prints marker FAILED (...); falling back to PyMuPDF span-walker — surface this in the extractor agent's report so the orchestrator can decide whether to re-run.Render scale: slide decks render at 2.0× (a slide is already large with low information density per pixel), papers render at 2.5× (paper figures pack smaller-detail axis labels, sub-panel letters, equation glyphs that need extra resolution to be vision-readable).
Removed: the old pNNN-figXX.png cutout pattern. Existing extractions that used it must be re-run with --force against the updated extract_research.py. Any vision-pass blocks generated against cutouts are suspect — re-run the vision pass on the new full-page renders.
PPTX → slide image rendering requires LibreOffice headless (soffice / libreoffice on PATH).
# Arch
sudo pacman -S libreoffice-fresh
# Debian / Ubuntu
sudo apt install libreoffice
# macOS
brew install --cask libreoffice
The script auto-detects soffice / libreoffice via shutil.which and falls back to common install paths (/usr/bin/soffice, /Applications/LibreOffice.app/Contents/MacOS/soffice). If LibreOffice is missing, extract_research.py raises a clear error pointing back to this section.
PPTX conversion takes ~30-60s per deck (LibreOffice cold-start + PDF export). Subsequent runs hit the same temp PDF if the script's tempfile.mkdtemp happens to land on an existing directory; in practice expect ~1 minute per deck on first run.
You may see MuPDF error: format error: No common ancestor in structure tree warnings during PPTX rendering — these are non-fatal, MuPDF complaining about LibreOffice's PDF tagging structure. The rendered images are still correct.
For PDFs / PPTXs and recorded-talk videos, dispatch a research-extractor agent with a brief naming the source path / URL and the canonical slug. The agent adds the source to tools/extract_research.py SOURCES, runs the extraction script(s), runs phase2 OCR + cleanup, archives the source to /mnt/archive4/PAPERS/, then reads the produced markdown end-to-end and marks every problematic area inline with a <!-- FIXME(extract): … --> comment — garbled / suspect equations, OCR artefacts, and (critically) each page that carries a figure / plot / diagram and therefore needs a vision-pass description (<!-- FIXME(extract): pNNN needs vision — <one line> -->). It reports back the slug, asset counts, and the count of pages flagged for vision — the orchestrator uses that count to decide whether Pass 2 is dispatched at all (see Pass 2 below).
For trivial cases (single-page paper, source already in SOURCES, just need to rerun under --force), the orchestrator may run inline:
Determine input type and run the appropriate script:
YouTube URL:
~/.claude/skills/research/.venv/bin/python ~/.claude/skills/research/tools/research_video.py "URL"
HLS stream (m3u8 URL, e.g. GDC Vault):
Step 1 — verify and pick quality from the master playlist:
curl -s "MASTER_M3U8_URL" -H 'Origin: ...' -H 'Referer: ...'
# Lists quality sub-playlists; pick the highest resolution index_1.m3u8
Step 2 — download with ffmpeg (use the quality-specific sub-m3u8, not the master):
mkdir -p /tmp/research-SLUG
ffmpeg -y \
-headers $'User-Agent: Mozilla/5.0...\r\nOrigin: https://...\r\nReferer: https://...\r\n' \
-i 'QUALITY_SUB_M3U8_URL' \
-c copy /tmp/research-SLUG/SLUG.mp4
Step 3 — transcribe with faster-whisper (CUDA). If no auto-captions are available (non-YouTube source), generate SRT from audio using the tracked helper tools/transcribe_to_srt.py:
# Requires libcublas on PATH — set LD_LIBRARY_PATH for this machine:
LD_LIBRARY_PATH=/usr/local/lib/ollama/cuda_v12:$LD_LIBRARY_PATH \
~/.claude/skills/research/.venv/bin/python ~/.claude/skills/research/tools/transcribe_to_srt.py \
/tmp/research-SLUG/SLUG.mp4 \
/tmp/research-SLUG/SLUG.en.srt \
medium
tools/transcribe_to_srt.py is the canonical tracked version of the old inline /tmp/transcribe_to_srt.py snippet — do not recreate it inline. If you need to extend it (different language, larger model, word-level timestamps), edit the tracked file in tools/ and commit the change so the next /research run benefits.
Step 4 — run video pipeline on local file:
~/.claude/skills/research/.venv/bin/python ~/.claude/skills/research/tools/research_video.py \
/tmp/research-SLUG/SLUG.mp4 \
"--title=Full Talk Title (Event Year)" \
"--slug=my-slug"
The script finds the SRT automatically next to the mp4 file (same stem, .en.srt suffix).
PDF/PPTX file: Add to SOURCES in tools/extract_research.py, then:
~/.claude/skills/research/.venv/bin/python ~/.claude/skills/research/tools/extract_research.py --only=SLUG
~/.claude/skills/research/.venv/bin/python ~/.claude/skills/research/tools/extract_research_phase2.py --only=SLUG
Also add to SOURCES_BY_SLUG in tools/extract_research_phase2.py for the --only filter to work.
Cleanup (all types):
~/.claude/skills/research/.venv/bin/python ~/.claude/skills/research/tools/cleanup_research.py --only=SLUG
This produces a rough markdown with native-text-extracted body, screenshots, and transcript text. OCR's role in this skill is narrow: when a page has no native text layer (scanned PDFs, image-only slide exports, PPTX slides whose content is rasterised), phase 1 OCRs the page's image asset and uses the result as the page body — same role native PyMuPDF text extraction plays for PDFs that have a text layer. There is no separate "OCR pass". OCR is just one of two body-text sources phase 1 selects between, gated on whether len(native_text) < 20. This is the only path on which OCR enters the canonical document body.
OCR is never applied to image inclusions inside an otherwise-text-rich doc. Image inclusions are read by the vision pass with full visual context — modern multi-modal LLMs vastly outclass any CPU OCR engine at structured figure description, and an OCR scaffolding block alongside an image only narrows what the vision agent looks at and primes it with mistakes. Phase 2 has no per-image OCR — it only handles PPTX video transcription.
Phase 1's OCR fallback fires for:
research_video.py, not phase 2): captions cover speaker audio but miss slide content shown only visually, so OCR on each detected scene's representative frame supplies the missing slide text.extract_research.py no longer emits any OCR-PENDING markers — the OCR fallback runs inline during phase 1 and the result lands directly in the page body.
research_video.py uses a 0.35 Bhattacharyya histogram threshold tuned for typical recorded-talk video. For slide-heavy talks where consecutive slides share a template (same chrome, only text changes), it under-detects badly — e.g. a 40-min, 64-slide deck can collapse to 5–7 detected scenes. Symptom: pass 1 finishes with a number of frame-XXXX-NNNN.jpg files much smaller than the slide count visible in the deck.
When this happens, use the tracked helpers (do not re-create them inline in /tmp):
# Cut down to ~60 scenes (or whatever the deck has) at threshold 0.18
~/.claude/skills/research/.venv/bin/python ~/.claude/skills/research/tools/redetect_scenes.py \
/tmp/research-SLUG/SLUG.mp4 SLUG \
--threshold 0.18 --interval 1.0
# For any scene that's still > 40s, sample additional frames every 20s
~/.claude/skills/research/.venv/bin/python ~/.claude/skills/research/tools/subsample_long_scenes.py \
/tmp/research-SLUG/SLUG.mp4 SLUG \
--interval 20 --min-len 40
# After identifying real slide-start timestamps via vision, group the SRT
# into per-slide windows (one paragraph per slide):
~/.claude/skills/research/.venv/bin/python ~/.claude/skills/research/tools/srt_to_windows.py \
/tmp/research-SLUG/SLUG.en.srt /tmp/slide_starts.txt \
--out /tmp/windowed_transcript.txt
These helpers are non-destructive — they only write new scene-NNN-*.jpg / sub-NNN-MM-*.jpg files into the asset dir and a TSV in /tmp. To clear stale scene files from a prior run with different parameters, delete them explicitly first; the helpers intentionally do not.
If you find yourself wanting yet-another redetection knob (different colourspace, edge-detection instead of histogram, etc.), edit tools/redetect_scenes.py and commit the change — never spawn a one-off /tmp/*.py for it.
Pass 2 is dispatched ONLY when more than 5 pages need a vision pass. After Pass 1, count the vision flags the extractor left:
grep -c 'FIXME(extract):.*needs vision' /mnt/archive4/PAPERS/Prepared/<slug>.md
research-vision sub-agent (Sonnet 4.6). This is the expensive, context-heavy case — never run it inline; a 268-slide deck would burn the orchestrator's context window in vision-pass alone, and the agent boundary is what makes the pass scale.**X (LLM vision pass):** blocks itself.The rest of this section describes the dispatched-agent case (>5 pages).
The orchestrator decides which pages / slides need vision treatment, then dispatches one research-vision agent per batch. Batches are typically 20-40 images. For very large decks/theses, dispatch multiple batches sequentially (not in parallel — they all Edit the same file).
Triage step (orchestrator does this BEFORE the first dispatch): the extractor only renders figure-bearing pages for paper PDFs (per _page_has_figure), but even within those, not every figure is worth a vision pass. The orchestrator should skim the table of contents / chapter structure and decide which sections are load-bearing for the project's purposes, then list those page ranges in the dispatch brief. For a 200-page thesis, skipping intro / related-work / conclusion / appendix chapters typically halves the vision-pass token cost.
For each batch, the brief MUST contain:
suzuki-yasutomi-2023-gt7-sky-dome).**X (LLM vision pass):** block from a prior batch — re-tagging would duplicate).The agent is responsible for the format — **Diagram (LLM vision pass):** / **Plot (LLM vision pass):** / **Image (LLM vision pass):** / **Table (LLM vision pass):** / **Code (LLM vision pass):**. See agent definition ~/.claude/agents/research-vision.md and the "Diagram description policy" section earlier in this skill.
sNNN-slide.png (one per slide); paper PDFs → pNNN-page.png (one per figure-bearing page) AND pNNN-text.png (one per pure-prose page, out of vision-pass scope — reference embed only, see "Vision pass MUST run on full-page renders" earlier in this skill); videos → frame-XXXX-NNNN.jpg scene captures. Per-figure cutouts (pNNN-figXX.png) are no longer produced.Same as the agent-level skip list (see "Skip rules" earlier in this skill):
**X (LLM vision pass):** block from an earlier batch — re-tagging would duplicate.pNNN-text.png (pure-prose, see render-policy table above). The vision agent treats both the -text filename suffix and the <!-- vision-skip: ... --> HTML comment as out-of-scope — the orchestrator should not include these page numbers in the dispatch list.The vision agent enforces the same list as a second pass.
After all Pass-2 vision batches complete and before dispatching Pass 3, the orchestrator runs the syntax validator. This catches LaTeX and Mermaid syntax errors that vision-pass output, marker prepass, or refiner edits may have left behind, and gives the refiner a concrete error list to fix instead of relying on a second model pass to catch every parse error visually.
~/.claude/skills/research/.venv/bin/python ~/.claude/skills/research/tools/validate_research.py --only=<slug>
# add --html for a browser-openable preview at assets/<slug>/<slug>.preview.html
What it does:
/mnt/archive4/PAPERS/Prepared/<slug>.md line-by-line, extracting every LaTeX block (inline $…$, display $$…$$) and every fenced ```mermaid block. Skips fenced code blocks for non-mermaid languages so dollar signs in shell snippets don't trip the inline-math regex.tools/validate_md.mjs (Node helper).katex.renderToString({throwOnError: true}) — KaTeX is strict about brace balance, undefined macros, missing \right partners, misplaced &, etc.mermaid.parse() (jsdom-backed). When mermaid fails to load in Node, blocks downgrade to warnings rather than errors./mnt/archive4/PAPERS/Prepared/assets/<slug>/findings-pass2.5-validate.md with file:line references, snippet previews, and KaTeX/Mermaid error messages.What gets caught:
& in \begin{split} (e.g. \Phi[q] \in & \left\{ … \\ & \quad …) — Expected '\right', got '&' at position N.\frac{1}{2, x_{).\ddy, \ddx instead of \partial y / \partial x or plain ddy/ddx).Refiner brief MUST cite the report. When dispatching Pass 3, the orchestrator brief states the path to findings-pass2.5-validate.md and lists the specific errors the refiner is expected to address. Skipping this step puts the refiner back into "find the bug visually" mode — which is the failure mode that motivated this pass.
Pass 3.5 — re-validate after refine (recommended): re-run the same command after Pass 3 completes. If the report shows zero errors, the run is done. If errors regressed (refiner introduced new ones, missed some, or the LaTeX they wrote doesn't compile), re-dispatch the refiner with the new error list. This is a fast loop — the validator runs in seconds even on 5K-line documents.
HTML preview (--html): writes assets/<slug>/<slug>.preview.html — a self-contained page with KaTeX-rendered math (server-side, so KaTeX errors paint inline in red) and mermaid client-side render (loads mermaid from jsdelivr CDN). Open in a browser to visually verify the document end-to-end. The preview is gitignored implicitly (under assets/<slug>/); if you want it committed, set --html-out=<path> to direct it elsewhere.
Exit-status contract:
0 — all blocks clean. Proceed to Pass 3 (or, on the post-refine re-validate, finish the run).1 — at least one parse error. Block downstream dispatch until resolved.2 — tool error (Node missing, node_modules/ missing, malformed CLI args). Fix the toolchain before retrying — do NOT skip Pass 2.5 because the validator failed to set up.Inline math false-positive guard: the extractor is conservative about $…$ matches — it requires at least one LaTeX-ish character (\^_{}=<>+-*/) and skips matches that look like currency ($5.00, $200/month). It will not flag prose containing dollar signs. If the validator reports an "inline" block that's actually prose, file it as an extractor false-positive and refine the heuristic in tools/validate_research.py rather than wrapping the prose in a math escape.
After Pass 2.5 (validate) completes with errors enumerated to disk, dispatch a single research-refiner agent (Opus 4.7, 1M context) with a brief listing the specific concerns the orchestrator wants fixed:
FIXME mark. The brief MUST tell the refiner to grep -n 'FIXME(extract)\|FIXME(vision)' the document, fix each flagged item, and delete the comment once handled. Anything it cannot resolve from text-layer + render evidence alone it re-marks <!-- FIXME(audit): … --> and lists in its return message. A refiner that finishes with FIXME(extract) / FIXME(vision) comments still present has not completed its pass.needs vision and instructs the refiner to write the **X (LLM vision pass):** blocks for them itself, following the "Diagram description policy" section of this skill. (When Pass 2 was dispatched, those blocks already exist — the refiner only flags suspect ones, never rewrites them.)/mnt/archive4/PAPERS/Prepared/assets/<slug>/findings-pass2.5-validate.md) — REQUIRED. The refiner is expected to address every error the validator reported. Brief explicitly: "Read the sidecar first; every entry under ## Errors must be fixed in your edit pass."The refiner reads the document end-to-end in its own context window — never run this inline either, because the document is typically 3-5 K lines long after Pass 2.
After Pass 3 returns, the orchestrator re-runs tools/validate_research.py --only=<slug> (Pass 3.5) as the clean-room check — see Pass 2.5 above. When 3.5 is clean and no FIXME(extract) / FIXME(vision) marks remain, the run is done; summarise the work in your final message to the user (note any FIXME(audit) marks the refiner escalated).
All scripts live in tools/ and use the venv at tools/.venv/. None of them silently overwrite an existing per-slug .md — if a <slug>.md already exists, they either skip or write a <slug>.md.regen sidecar.
| Script | Purpose | Destructive? |
|--------|---------|---------------|
| tools/research_video.py | YouTube or local video → scene detection, OCR, transcript alignment. Accepts --title= --slug= flags. SRT is found automatically next to the mp4 (same stem, .en.srt). | Refuses to overwrite existing <slug>.md — writes <slug>.regen-<YYYYMMDD-HHMMSS>-<6hex>.md next to it instead (randomised so concurrent agents don't clobber each other). Pass --force to overwrite in place. |
| tools/redetect_scenes.py | Aggressive scene re-detection for slide-heavy talks (low histogram threshold, finer interval). Writes scene-NNN-*.jpg to the asset dir + a TSV to /tmp. | Append-only. Manually clear stale scene-*.jpg first if you re-run with different parameters. |
| tools/subsample_long_scenes.py | Reads the redetect_scenes TSV and writes additional sub-NNN-MM-*.jpg frames inside any scene longer than --min-len. | Append-only. |
| tools/srt_to_windows.py | Groups an SRT into per-slide transcript windows from a slide_starts.txt. Output to a chosen path (defaults to /tmp). | Writes only to the explicit --out path. |
| tools/transcribe_to_srt.py | faster-whisper SRT generation for non-YouTube videos (HLS, local mp4 with no captions). | Refuses to overwrite an existing SRT — writes <srt-stem>.regen-<YYYYMMDD-HHMMSS>-<6hex>.srt sidecar instead. Pass --force to overwrite in place. |
| tools/extract_research.py | PDF/PPTX → text + image extraction. Supports --only=SLUG and --force. | Refuses to overwrite an existing per-slug .md even under --only — writes a <slug>.regen-<YYYYMMDD-HHMMSS>-<6hex>.md sidecar instead. Pass --force to overwrite in place. Sidecar suffixes are randomised so concurrent agents don't clobber each other. |
| tools/extract_research_phase2.py | Extract videos embedded in PPTX decks and transcribe them with faster-whisper. (Body-text OCR fallback for image-only PDFs / slides moved into phase 1; per-image OCR was removed entirely — the vision pass owns image description.) Supports --only=SLUG[,SLUG2]. | Per-slug .md only. |
| tools/cleanup_research.py | Strip watermarks, duplicate headings, garbage OCR. Supports --only=SLUG. | Per-slug .md only. |
| tools/validate_research.py | Pass 2.5: extract every LaTeX/Mermaid block from /mnt/archive4/PAPERS/Prepared/<slug>.md, validate via the Node helper, write findings-pass2.5-validate.md sidecar. Supports --only=SLUG[,SLUG2], --html. Exits 1 on any parse error. | Read-only on the markdown source; writes only to assets/<slug>/findings-pass2.5-validate.md (and <slug>.preview.html under --html). |
| tools/validate_md.mjs | Node helper invoked by validate_research.py. Reads JSON blocks on stdin, validates LaTeX via katex.renderToString({throwOnError:true}) and Mermaid via mermaid.parse() (jsdom-backed). Returns JSON with per-block ok + error. Not normally called directly. | Pure stdin → stdout, no file writes. |
| tools/render_md_html.mjs | Node helper invoked by validate_research.py --html. Compiles a single markdown to a self-contained HTML preview (KaTeX server-side via @vscode/markdown-it-katex, mermaid client-side via jsdelivr CDN). Not normally called directly. | Writes to the explicit output path passed on argv. |
If you need a one-off media-processing helper that doesn't fit the above: edit / add a tracked file under tools/ and commit it. Do not write throwaway helpers to /tmp — every future /research run will re-derive the same script from scratch otherwise.
http or https and contains m3u8 → HLS stream pipeline (ffmpeg download + whisper)http or https → YouTube pipeline (yt-dlp).pdf → PDF extraction.pptx → PPTX extraction.mp4, .mkv, .webm → local video (research_video.py with --title and --slug)/mnt/archive4/PAPERS/Prepared/
{slug}.md # one markdown per source
assets/{slug}/ # images, frames, videos
In addition, the source master lives in /mnt/archive4/PAPERS/ (see "REQUIRED: Source Archive in /mnt/archive4/PAPERS/" above). PDFs/PPTXs go top-level as <slug>.pdf/<slug>.pptx; videos go in a <year>-<slug-tail>/ subfolder with both the mp4 and the .en.srt. Copying source into PAPERS/ is part of every /research run, not optional.
Python venv at ~/.claude/skills/research/.venv/ (managed by uv sync from pyproject.toml): pymupdf, python-pptx, opencv-python-headless, openocr-python, faster-whisper, marker-pdf.
Node modules at ~/.claude/skills/research/node_modules/ (managed by npm install from package.json): katex, mermaid, jsdom, markdown-it, @vscode/markdown-it-katex. Required by Pass 2.5 validators (validate_research.py, validate_md.mjs, render_md_html.mjs).
System: node (>=20), yt-dlp, ffmpeg, libreoffice (PPTX rendering)
OCR engine: OpenOCR (mobile/ONNX backend, auto-downloads models to ~/.cache/openocr/ on first run). Wrapped behind tools/openocr_engine.py as a singleton — model load happens once per process and is shared between phase 1 (body-text fallback for image-only sources) and the video pipeline (research_video.py per-frame OCR). Override behaviour via env vars: OPENOCR_MODE=server (higher accuracy, requires pip install torch torchvision), OPENOCR_BACKEND=torch, OPENOCR_DROP_SCORE=0.5.
development
Build, run, and analyze Unity profiler data with perf-report-style call-stack attribution
documentation
Write a handoff prompt for a future session. A handoff is a continuation-link — minimal context plus a kickoff line the user can copy-paste. Never a diagnosis, never an investigation script, never a prescribed deliverable.
testing
Multi-agent orchestration mode. The orchestrator never reads, edits, runs, or tests directly — it scopes work, runs a re-implementation audit, presents a freeform method brief with grounded recommendations, then dispatches every step to sub-agents through shared context files at `docs/orchestrate/<topic>/`. Use when invoked via /delegate, when the user asks to orchestrate or coordinate multi-agent work, or when the task explicitly calls for delegation.
development
Create or switch to a git worktree for isolated feature/fix development