skills/manuscript-provenance/SKILL.md
Computational provenance audit verifying every number, table, and figure in a manuscript derives from code, not manual entry. Triggers on: "check provenance", "verify reproducibility", "audit my pipeline", "are my numbers from code", "provenance audit". Companion to manuscript-review (prose audit).
npx skillsauth add mathews-tom/armory manuscript-provenanceInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Pipeline position: Phase 2a (grounding audit). Runs in parallel with manuscript-typography. Depends on: content settled after Phase 1 fixes. Produces macro manifest consumed by manuscript-review Pass 13 (Cross-Element Coherence).
Verify that a manuscript is a faithful rendering of computational outputs. Every number, table, figure, category label, ordering, and threshold in the document must trace to a specific script, config file, or pipeline output. Manual data entry in a manuscript is a reproducibility defect.
This skill produces a provenance map — a structured report linking each manuscript artifact to its generating code — and flags every break in the chain.
Companion skill: manuscript-review audits the document as prose (structure,
argumentation, citations). This skill audits whether the document content is
computationally grounded. Run both for complete pre-publication coverage.
| Concern | manuscript-review | This skill (manuscript-provenance) | | ------------------------- | ----------------------------------------------------------- | -------------------------------------------------------------- | | Reproducibility | Does the paper describe enough to reproduce? (§6) | Does the code actually produce what the paper claims? (§1, §7) | | Figures/Tables | Legible, accessible, well-formatted? (§12) | Generated by scripts, not manual entry? (§2, §3) | | Rendered visuals | Readable at print scale? Floats near references? (§23) | Figure generation script produces correct format? (§3) | | Hyperparameters | Listed in the paper with rationale? (§6) | Values trace to config files, not hardcoded? (§1, §8) | | Code availability | Statement exists in the paper? (§17) | Repo URL valid, README accurate, pipeline works? (§11) | | Terminology | Abbreviations consistent within document? (§14) | Terms match code identifiers? (§5) | | Significant figures | Consistent precision within document? (§12) | Precision matches script output? (§2) | | Figure format | Appropriate format for document quality? (§12) | Format generated by script, not manually exported? (§3) | | Computational cost | Reported in the paper? (§7) | Values trace to benchmarking scripts? (§1) | | Macro-prose coherence | Prose framing appropriate for injected value? (§24) | Value traced to code, macro manifest produced? (§4) | | Cross-element consistency | Prose, captions, figures, tables mutually consistent? (§24) | All elements from same run/pipeline output? (§9) |
Rule: This skill never judges prose quality. manuscript-review never opens the codebase. Each reads the other's report when available.
Integration point — Macro Manifest: This skill produces a macro manifest as part of the §4 audit: a structured list of every macro-injected value with:
\bestf)0.847)manuscript-review's Pass 13 (Cross-Element Coherence, §24) consumes this manifest to check whether the prose surrounding each injected value is appropriate for the actual numeric value. Provenance owns "is this value computationally grounded?" Review owns "does the text wrapping this value make sense given what the value is?"
In scope:
\newcommand, \def, \pgfmathsetmacro)Out of scope:
This audit requires TWO artifacts:
.tex files (preferred), or PDF/DOCX as fallbackIf the user provides only one, ask for the other. LaTeX source is strongly preferred over compiled PDF — provenance auditing requires seeing the raw markup, macros, and input commands.
1a. Manuscript Artifact Extraction
Read all .tex files (main + included via \input/\include). Extract:
\newcommand, \def, \pgfmathsetmacro, and custom
command definitions that carry data valuestabular/table environment — cell values,
row/column ordering, headers\includegraphics paths, caption content, referenced data\input{generated/*.tex} patterns that pull from
script-generated LaTeX fragments\label/\ref pairs for cross-referencingBuild an artifact registry — a flat list of every data-carrying element in the manuscript with its location (file, line number).
1b. Codebase Mapping
Scan the project directory. Identify:
Makefile, snakemake, dvc.yaml, run.sh,
main.py, or equivalent orchestrationconfig.toml, config.yaml, .env, params.yaml,
hyperparameter filesresults/, output/,
figures/, tables/, generated/).tex files in output directories that scripts
produce for \input inclusionBuild a source registry — a flat list of every code artifact that produces or configures manuscript content.
For each entry in the artifact registry, attempt to establish a provenance chain: manuscript value → generated output → script → input data/config.
2a. Value Provenance
For every number in the manuscript:
Classification:
2b. Table Provenance
For each table:
Classification:
2c. Figure Provenance
For each figure:
\includegraphics?Classification:
2d. Terminology Provenance
For each named mode, mechanism, category, or method label:
Classification:
greedy_search, manuscript says "Greedy Search" in some places and
"greedy approach" in others)2e. Ordering Provenance
For each ordered list, ranked comparison, or sequenced enumeration:
Classification:
3a. LaTeX Macro Hygiene
\newcommand{\someMetric}{42.7} defined directly in .tex
files (bad) vs \input{generated/metrics.tex} where that file is script output (good).tex files that carry numeric/data values3b. Pipeline Completeness
3c. Config/Code Separation
3d. Stale Output Detection
3e. Version Pinning
4a. Macro Manifest Generation
Produce the macro manifest — the primary handoff artifact to manuscript-review. For every data-carrying macro identified in Phase 1a and traced in Phase 2a:
Macro: \bestf
Value: 0.847
Source: results/metrics.json → scripts/generate_latex_macros.py → generated/metrics.tex
Locations:
- paper.tex:142 — "achieving an F1 score of \bestf{}"
- paper.tex:287 — "The \bestf{} result represents a substantial improvement"
- abstract.tex:8 — "...with \bestf{} F1 score"
Classification: MACRO-TRACED
Also include every bare number (not a macro) found in Phase 1a that carries data (metrics, counts, parameters) — these are values that SHOULD be macros but aren't:
Bare value: 50
Location: paper.tex:198 — "convergence after 50 epochs"
Should-be-macro: YES — this is a training parameter, should trace to config
Classification: UNTRACED (no macro, no provenance)
Save the manifest as [manuscript-name]-macro-manifest.json alongside the
provenance report. This file is consumed by manuscript-review Pass 13
(Cross-Element Coherence) to verify prose-value appropriateness.
4b. Cross-Reference with manuscript-review
If a manuscript-review report exists for this manuscript, load it and:
If no manuscript-review report exists, recommend running it as a companion audit and note that the macro manifest is available for its Pass 13.
Load references/checklist.md and references/report-template.md.
Read references/checklist.md
Read references/report-template.md
Generate the provenance report following the template structure:
Save two files in the manuscript directory:
[manuscript-name]-provenance-report.md — the full provenance report[manuscript-name]-macro-manifest.json — the structured macro manifest
for consumption by manuscript-review Pass 13The macro manifest JSON structure:
{
"macros": [
{
"name": "\\bestf",
"value": "0.847",
"source_chain": "results/metrics.json → scripts/gen_macros.py → generated/metrics.tex",
"locations": [
{
"file": "paper.tex",
"line": 142,
"context": "achieving an F1 score of \\bestf{}"
},
{
"file": "paper.tex",
"line": 287,
"context": "The \\bestf{} result represents a substantial improvement"
}
],
"classification": "MACRO-TRACED"
}
],
"bare_numbers": [
{
"value": "50",
"location": {
"file": "paper.tex",
"line": 198,
"context": "convergence after 50 epochs"
},
"section": "methodology",
"should_be_macro": true,
"rationale": "Training parameter — should trace to config",
"classification": "UNTRACED"
}
]
}
Present to the user:
CRITICAL — Value in manuscript has no provenance chain AND is a key result (main finding, abstract metric, table headline number). This means the paper's core claims cannot be verified from code.
HIGH — Value/table/figure is untraced or stale, and appears in results or methodology sections. Reproducibility gap.
MEDIUM — Terminology mismatch, manual ordering, partial table generation, config values hardcoded in scripts. Maintenance and consistency risk.
LOW — Minor issues: display-name mapping missing but terms are close, non-critical figures without generation scripts, cosmetic post-editing of generated figures.
Binary provenance. Every artifact is either traced or not. No "partially reproducible" — partial means broken.
Code is truth. When manuscript and code disagree, the manuscript is wrong until proven otherwise. Flag the disagreement; do not assume the manuscript author "meant to" override code output.
Macros over magic numbers. Every data value in LaTeX should be a macro. Every macro should be generated. No exceptions for "obvious" values.
Pipeline as proof. If make (or equivalent) does not produce the PDF from
raw data, the manuscript is not reproducible. Partial pipelines get partial
credit, not a pass.
Config is not code. Hyperparameters, thresholds, model names, file paths — all belong in config files, not scattered through script bodies.
Ordering is data. The sequence of items in a table or enumeration is an assertion. It must come from code (sort order, enum definition) not from the author's sense of what "looks right."
Timestamps matter. A figure generated last month from a script modified yesterday is suspect. Stale outputs are provenance failures.
Companion, not replacement. This audit checks computational grounding. manuscript-review checks document quality. Both are needed. Neither subsumes the other.
User says any of:
All trigger this skill.
testing
Create, review, and restyle data visualizations using Edward Tufte principles: high data-ink ratio, direct labels, range-frame axes, small multiples, accessible color, responsive charts, and honest comparisons. Triggers on: "create a chart", "style this chart", "review this graph", "Tufte chart", "data visualization", "Recharts", "Plotly", "matplotlib", "Chart.js", "ECharts", "D3". Use when generating or critiquing charts, dashboards, sparklines, and data tables.
testing
Manages dependent branch stacks and stacked pull requests using safe Git topology rules. Triggers on: "create stacked PRs", "publish this stack", "sync my PR stack", "rebase this stack", "merge the stack", "retarget child PRs", "split this branch into stacked PRs", "validate this stack", "cleanup stacked branches". Use when local branches or one source branch need to become a dependency-ordered PR stack with correct parent bases, validation, synchronization, merge order, and cleanup.
development
Scaffolds per-repository agent context so coding agents share the same issue tracker rules, triage label vocabulary, domain glossary, ADR layout, and handoff conventions. Triggers on: "set up project context", "configure agent docs", "create CONTEXT.md", "setup agent workflow", "agent issue tracker setup", "triage labels", "domain glossary for agents". Use when a repo needs durable context files before planning, triage, debugging, TDD, architecture review, or multi-agent implementation.
testing
Produces phased task boards from feature requests: dependency-mapped work items, parallelization flags, risk flags, edge cases, test matrices. Triggers on: "decompose this feature", "task breakdown with dependencies", "phased implementation plan", "work breakdown structure". NOT for effort estimates, use estimate-calibrator.