skills/distill/SKILL.md
Convert heavy document formats (PDF, Word, Excel, PowerPoint, and 10+ others) to token-efficient Markdown/CSV with structurally-aware digest compression. Use when Claude needs to read documents without burning excessive context budget. Triggers on /distill, 'distill this', 'convert to markdown', 'make this readable'.
npx skillsauth add raddue/crucible distillInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
All subagent dispatches use disk-mediated dispatch. See shared/dispatch-convention.md for the full protocol.
Convert heavy document formats to token-efficient representations (Markdown, CSV) for LLM consumption. The core deliverable is the .digest.md — a structurally-aware compression at 20-30% of token count.
Skill type: Rigid — follow exactly, no shortcuts.
Models:
Announce at start: "I'm using the distill skill to convert documents to token-efficient formats."
/distill <path> [path2 ...]
/distill <directory>
Examples:
/distill docs/report.pdf — convert one file/distill docs/report.pdf data/sheet.xlsx slides/deck.pptx — convert multiple files/distill docs/ — convert all supported files in directory (single-level, not recursive)Mixed mode is supported: /distill docs/ extra/report.pdf
Execute phases in this order. Each phase completes for all files before the next begins.
At skill start, before processing any files, check for required tools:
| Check | Command | If Missing |
|---|---|---|
| Tier 1 | which pandoc | "pandoc not found. Install: apt install pandoc (Debian/Ubuntu) or brew install pandoc (macOS). Tier 1 formats will be skipped." |
| Tier 2 | which pdftotext | "pdftotext not found. Install: apt install poppler-utils (Debian/Ubuntu) or brew install poppler (macOS). PDF conversion will be skipped." |
| Tier 3 | which python3 | "python3 not found. PPTX and XLSX conversion will be skipped." |
| Pre-flight | which unzip | Skip zip bomb detection with note. Not a conversion blocker. |
| Pre-flight | which pdfdetach | Skip PDF attachment detection with note. Not a conversion blocker. |
Build a set of available tiers. Route files only to available tiers. Files targeting unavailable tiers get routed to unsupported-with-guidance (Phase 1b).
Individual file paths: Use directly. Verify each file exists.
Directory paths: Single-level glob for files with supported extensions (not recursive). Build file list sorted alphabetically. Report: "Found {N} convertible files in {directory}: {list}."
Supported extensions for glob: .pdf, .docx, .rtf, .html, .htm, .odt, .epub, .rst, .org, .tex, .ipynb, .pptx, .xlsx
Mixed mode: Process both directory globs and individual paths. Deduplicate by absolute path.
For each file, determine the conversion tier by extension:
| Extension | Tier | Format Flag |
|---|---|---|
| .docx | 1 | docx |
| .rtf | 1 | rtf |
| .html | 1 | html |
| .htm | 1 | html |
| .odt | 1 | odt |
| .epub | 1 | epub |
| .rst | 1 | rst |
| .org | 1 | org |
| .tex | 1 | latex |
| .ipynb | 1 | ipynb |
| .pdf | 2 | — |
| .pptx | 3 | — |
| .xlsx | 3 | — |
Unsupported formats: Output actionable guidance per this table, then continue with remaining files:
| Extension | Guidance |
|---|---|
| .xls | "Legacy Excel format. Export as .xlsx from Excel/LibreOffice, then re-run /distill." |
| .ods | "OpenDocument Spreadsheet. Export as .csv (single-sheet) or .xlsx (multi-sheet), then re-run /distill." |
| .odp | "OpenDocument Presentation. Export as .pptx, then re-run /distill." |
| .key | "Apple Keynote. Export as .pptx from Keynote, then re-run /distill." |
| .numbers | "Apple Numbers. Export as .xlsx from Numbers, then re-run /distill." |
| .pages | "Apple Pages. Export as .docx from Pages, then re-run /distill." |
Unknown extensions: "Unsupported format: {ext}. Supported formats: docx, rtf, html, odt, epub, rst, org, tex, ipynb, pdf, pptx, xlsx."
Unavailable tier: If a file's tier is unavailable (tool missing from Phase 0), report: "{file}: requires {tool} (not installed). Skipping."
Run per-file safety checks before conversion. Failures are per-file — do not halt the batch.
Office formats are ZIP archives. If unzip is available:
UNCOMPRESSED=$(unzip -l "$INPUT_PATH" 2>/dev/null | tail -1 | awk '{print $1}')
If uncompressed size exceeds 500MB (524288000 bytes), abort this file: "File uncompressed size ({size}) exceeds 500MB safety limit. Skipping."
If unzip is not available, skip this check (noted in Phase 0).
For PDF files, if pdfdetach is available:
ATTACHMENTS=$(pdfdetach -list "$INPUT_PATH" 2>/dev/null | grep -c "^[0-9]")
If attachments found, warn: "PDF contains {N} embedded attachments. These are not extracted — only text content is converted." Continue with conversion.
After conversion (not before), verify output is valid UTF-8:
file --mime-encoding "$OUTPUT_PATH"
If not UTF-8, attempt re-encoding: iconv -f <detected-charset> -t UTF-8 "$OUTPUT_PATH" -o "$OUTPUT_PATH.tmp" && mv "$OUTPUT_PATH.tmp" "$OUTPUT_PATH". If re-encoding fails, report and skip.
Process files sequentially. For each file:
INPUT_PATH="$1"
OUTPUT_PATH="${INPUT_PATH%.*}.md"
FORMAT="$2" # from routing table
pandoc -f "$FORMAT" -t markdown --wrap=none "$INPUT_PATH" -o "$OUTPUT_PATH"
Shell safety: All file paths via quoted shell variables. Never inline interpolation. Never use unquoted $() or backtick interpolation of file paths.
Error handling:
Idempotency: Overwrites existing output files without warning.
Step 1 — Extract:
INPUT_PATH="$1"
TEXT_PATH="${INPUT_PATH%.*}.txt"
OUTPUT_PATH="${INPUT_PATH%.*}.md"
pdftotext -layout "$INPUT_PATH" "$TEXT_PATH"
Scanned PDF detection: Count total characters and pages:
CHARS=$(wc -c < "$TEXT_PATH")
PAGES=$(pdfinfo "$INPUT_PATH" 2>/dev/null | grep "^Pages:" | awk '{print $2}')
If pdfinfo is unavailable, estimate pages from pdftotext output (count form-feed characters). If average chars/page < 50, report: "This PDF appears to be scanned/image-based. Text extraction produced minimal content. Consider OCR processing externally before distilling." Skip structuring pass. Clean up temp .txt file.
Step 2 — Structure: Dispatch a Sonnet agent using skills/distill/pdf-structurer-prompt.md to transform the raw pdftotext output into clean Markdown with recovered headings, lists, tables, and code blocks. Write result to OUTPUT_PATH. Clean up temp .txt file.
Venv setup (once per invocation, only if Tier 3 files exist):
VENV="/tmp/crucible-distill-venv"
# Health check
if [ -d "$VENV" ]; then
"$VENV/bin/python3" -c "import sys" 2>/dev/null || rm -rf "$VENV"
fi
# Create if missing
if [ ! -d "$VENV" ]; then
echo "Installing Python dependencies (one-time setup, ~15 seconds)..."
python3 -m venv "$VENV"
"$VENV/bin/pip" install --quiet python-pptx==1.0.2 openpyxl==3.1.5
if [ $? -ne 0 ]; then
echo "Failed to install Python dependencies."
echo "Manual install: pip install python-pptx==1.0.2 openpyxl==3.1.5"
echo "PPTX and XLSX conversion will be skipped."
# Route remaining Tier 3 files to unsupported
return
fi
fi
PPTX conversion:
"$VENV/bin/python3" skills/distill/convert_pptx.py --input "$INPUT_PATH" --output "$OUTPUT_PATH"
XLSX conversion:
"$VENV/bin/python3" skills/distill/convert_xlsx.py --input "$INPUT_PATH" --output-dir "$(dirname "$INPUT_PATH")"
Output: one CSV per sheet at {basename}-{sheetname}.csv. Sheetnames sanitized (spaces → hyphens, special chars stripped).
After all conversions complete, run the digest pass on eligible files.
Eligibility:
.md (not .csv)Dispatch: For each eligible file, dispatch a Sonnet digest agent using skills/distill/digest-prompt.md. Before dispatching, fill template placeholders: replace {{ORIGINAL_WORDS}} with the converted file's word count and {{TARGET_WORDS}} with 25% of that count. The raw pdftotext output (for pdf-structurer-prompt.md) or converted .md content (for digest-prompt.md) is included as a content block below the prompt template in the dispatch file.
Quality check: After the digest agent returns, count words in the digest:
Output: Write digest to {original-path-without-ext}.digest.md.
Word count is a proxy for token count. These diverge for code-heavy or CJK content, but word count is sufficient for v1.
After all conversions and digests complete, output:
## Distill Summary
| File | Format | Tier | Converted | Digest | Token Savings |
|---|---|---|---|---|---|
| {file} | {format} | {tier} | {output} ({words} words) | {digest} ({words} words) | ~{pct}% |
**Total:** {N} files converted, {M} digests produced, ~{pct}% average token savings on digestible content.
Generated files can be added to .gitignore if not needed in version control.
Token savings per file = 1 - (digest words / converted words) expressed as percentage.
Files that were skipped (unsupported, tool missing, pre-flight failure) are listed separately:
**Skipped:** {N} files
- {file}: {reason}
Every Bash command that touches file paths MUST use quoted shell variables:
# CORRECT
pandoc -f "$FORMAT" -t markdown --wrap=none "$INPUT_PATH" -o "$OUTPUT_PATH"
# WRONG — never do this
pandoc -f $FORMAT -t markdown --wrap=none $INPUT_PATH -o $OUTPUT_PATH
"$VAR", never bare $VAR$() or backtick interpolation of paths| Failure | Behavior | |---|---| | Tool not installed | Skip tier, report with install guidance, continue | | Conversion fails (non-zero exit) | Report per-file, continue with remaining files | | Empty conversion output | Report per-file, continue | | Zip bomb detected | Skip file, report, continue | | Scanned PDF | Report, skip digest, continue | | Venv/pip failure | Skip Tier 3, report with manual install instructions | | Digest out of range | One retry, accept second result regardless | | File not found | Report, continue with remaining files | | Permission denied | Report, continue | | Encoding error | Attempt re-encode, skip on failure, continue |
Principle: Never halt the batch for a single file failure. Report and continue.
Standalone usage:
/distill <path> — convert one or more files/distill <directory> — convert all supported files in directoryCalled by:
Dispatches:
skills/distill/pdf-structurer-prompt.mdskills/distill/digest-prompt.mdDoes not dispatch: No quality gate, no red-team, no review loop. Distill is a utility skill — it converts and compresses. Quality is ensured by the digest quality metric (word count check + one retry).
testing
Standalone instance-bug reviewer — runs a parallel finder fan-out + verify gate over a diff or a path and prints ranked, verified findings. Use when the user says "delve", "find bugs in this diff", "review this for bugs", "scan this file/subsystem for defects", "instance-bug sweep", or wants concrete reproducible defects (not a merge verdict, not systemic health). Works on a PR id, a base..head range, or a path, on any forge (GitHub, GitLab, Bitbucket, self-hosted).
testing
Render the Crucible calibration ledger weekly report — the honest "Crucible caught N silent bugs" headline, verdict breakdown, per-skill severity rates, and the inflation detector. Triggers on "/ledger", "weekly report", "weekly ledger", "caught N", "quality ledger", "calibration report", "render the ledger".
development
The Book of Grudges — cross-session bug graveyard. Every fixed bug is recorded as a structured "grudge"; before touching code, skills query the grudgebook for the files in scope and surface past regressions as forced "DO NOT REPEAT" context. Read mode (pre-flight) and write mode (on bug resolution / fix(*) PR). Machine-local, per-repo, never committed. Triggers on /grudge, "check grudges", "record a grudge", "any past bugs here", "regression oracle", "bug graveyard".
testing
Reconcile the Crucible calibration ledger — walk merged fix/hotfix branches to falsify the originating gating-verdicts, compute per-skill Brier calibration scores, and append a falsification log. Triggers on "/calibration-reconcile", "reconcile ledger", "reconcile calibration", "falsify verdicts", "brier score", "calibration reconcile", "compute brier".