.claude/skills/audit-reproducibility/SKILL.md
Enforce the replication-protocol.md rule by cross-checking numeric claims in a manuscript against the actual R / Stata / Python outputs. Report PASS/FAIL per claim against tolerance thresholds. Use before submission and before releasing a replication package.
npx skillsauth add pedrohcgs/claude-code-my-workflow audit-reproducibilityInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Compare numeric claims in a manuscript (point estimates, standard errors, p-values, counts) against the actual outputs produced by the analysis pipeline. Report PASS / FAIL per claim against the tolerance thresholds defined in .claude/rules/replication-protocol.md.
Core principle: If the paper says ATT = -1.632 (0.584) and the code produces -1.628 (0.591), we verify — numerically — that the difference is within the documented tolerance. No more "looks close enough" eyeballing.
/commit. Pair with a pre-commit invocation on manuscript + analysis changes.$0 — path to the manuscript (.tex, .qmd, .md, .pdf). Required.$1 — path to the outputs directory. Defaults to scripts/R/_outputs/. Can be _targets/objects/, a Stata .do-file log directory, etc.replication-protocol.md for the tolerance thresholds currently in effect.Rscript scripts/R/00_run_all.R) before auditing.sessionInfo.txt or equivalent environment capture exists in the outputs dir.Parse the manuscript for numeric claims. Patterns to match:
ATT = -1.632 (0.584), $\beta = 0.342$ (0.091), hat{\tau} = 1.28** with starred significance& -1.632$^{***}$ & 0.584 & in LaTeX table environmentsour sample of 2,847 firms, $N = 2{,}847$mean = 0.423, SD = 0.087p < 0.01, $p = 0.003$Record each claim as a tuple:
{
claim_id: "Table2_col3_ATT",
location: "Table 2, Column 3, row 'Treatment'",
kind: "point_estimate" | "standard_error" | "p_value" | "count" | "percentage",
reported_value: -1.632,
uncertainty: 0.584, # only for point estimates
significance_stars: 3, # 0-3 or None
raw_context: "the ATT estimate of -1.632 (0.584) indicates..."
}
Write the extracted claims to quality_reports/reproducibility_claims_[manuscript-name].json so the user can review the extraction before audit.
Scan $1 for corresponding values. Priority order:
.rds files — readRDS(path)$coef[["treatment"]] style lookups. Can use Rscript -e "saveRDS(summary(readRDS(...)), '/tmp/audit.rds')" to extract..tex tables — parse LaTeX table cells directly; match on column headers + row labels..csv summary files — pandas/readr parse, key-value lookup..out / .log files (Stata, regress output) — regex extraction..json — direct key lookup.Record each extracted result:
{
source: "scripts/R/_outputs/results.rds",
lookup_key: "fit_main$coefficients['treated']",
value: -1.628,
uncertainty: 0.591,
p_value: 0.005
}
Use fuzzy heuristics when exact labels don't match:
"treatment effect" ~ "ATT" ~ "treated")raw_context field (table number, row label, description)For every claim, produce a match candidate with a confidence score. Claims below 0.7 confidence get flagged as "UNMATCHED — manual review needed" rather than silently passing.
For each matched claim, apply the thresholds from replication-protocol.md:
| Kind | Tolerance | Example |
|---|---|---|
| Integers (N, counts) | Exact | 2,847 must equal 2,847 |
| Point estimates | abs(reported - computed) < 0.01 | -1.632 vs -1.628 → diff = 0.004 → PASS |
| Standard errors | abs(reported - computed) < 0.05 | 0.584 vs 0.591 → diff = 0.007 → PASS |
| P-values | Same significance level | p<0.01 and p<0.01 → PASS; p<0.01 and p=0.03 → FAIL |
| Percentages | ±0.1pp | 42.3% vs 42.35% → PASS |
Respect any tolerance overrides the user has written into their replication-protocol.md fork (they may loosen for MC noise or tighten for administrative data).
Write quality_reports/reproducibility_audit_[manuscript-name].md:
# Reproducibility Audit: [Manuscript Title]
**Date:** [YYYY-MM-DD]
**Manuscript:** [path]
**Outputs directory:** [path]
**Tolerance source:** .claude/rules/replication-protocol.md
## Summary
| Status | Count |
|---|---|
| PASS | N |
| FAIL (diff > tolerance) | M |
| UNMATCHED (manual review) | K |
| **Overall verdict** | **PASS / FAIL** |
## PASS (all within tolerance)
| Claim | Reported | Computed | Diff | Tolerance |
|---|---|---|---|---|
| Table2_col3_ATT | -1.632 (0.584) | -1.628 (0.591) | 0.004 / 0.007 | 0.01 / 0.05 |
## FAIL (outside tolerance — BLOCKER)
| Claim | Reported | Computed | Diff | Tolerance | Location in paper |
|---|---|---|---|---|---|
## UNMATCHED (manual review)
| Claim | Raw context | Candidate sources |
|---|---|---|
## Environment
[sessionInfo excerpt]
## Next steps
1. Fix any FAIL rows — either update the manuscript or rerun analysis.
2. Review UNMATCHED rows — add explicit lookup keys or widen the search scope.
3. After zero FAILs, the paper is replication-ready.
/commit pre-commit gate — see replication-protocol.md for the enforcement pattern..claude/rules/replication-protocol.md — the tolerance contract..claude/skills/review-r/SKILL.md — catches code-style issues; this skill catches NUMERICAL reproducibility..claude/skills/review-paper/SKILL.md — content review; pair with this skill for a full pre-submission audit.-1.632 is reproducible. Whether -1.632 is the RIGHT estimand is a review-paper / domain-reviewer question.sessionInfo.txt capture lets a reviewer see the env; pinning versions is on the user (via renv.lock or a DESCRIPTION file).testing
Stage, commit, push, open a PR, and merge to main. Use ONLY on explicit commit intent — user says "commit", "ship it", "push this", "open a PR", "merge to main", "let's commit this", or prefixes with `/commit`. Do NOT auto-invoke on vague end-of-task phrases ("we're done", "wrap up") — those require explicit confirmation first. Runs the standard commit-PR-merge cycle; never force-pushes or skips hooks.
testing
Perform adversarial visual audit of Quarto or Beamer slides checking for overflow, font consistency, box fatigue, and layout issues.
testing
Validate bibliography entries against citations in all lecture files. Structural checks (missing/unused entries, malformed fields) by default; `--semantic` adds citation-drift detection, DOI verification, and style-consistency checks.
testing
Translate Beamer LaTeX to Quarto RevealJS. Multi-phase workflow with TikZ extraction and QA.