skills/research-results-auditor/SKILL.md
Use when auditing completed results for confounds, claim-drift, protocol integrity, or attribution before locking claims into the paper. Not for deciding what to do after a surprising result (use result-diagnosis). Not for significance tests or effect sizes (use statistical-analysis-planner). Not for engineering failures (use experiment-debugger).
npx skillsauth add a-green-hand-jack/ml-research-skills research-results-auditorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Produce a structured validity verdict for completed experiment results before they become paper claims. This skill asks: is the evidence scientifically defensible as stated?
Use this skill when:
Do not use this skill when experiments are still running or results are not yet final — use result-diagnosis for in-progress diagnosis and next-action decisions. Do not use this skill for test selection or variance reporting — use statistical-analysis-planner.
Pair this skill with:
result-diagnosis upstream: resolve engineering and scientific issues before auditing validitystatistical-analysis-planner upstream: ensure variance is characterized before the auditpaper-evidence-board downstream: update evidence slots with the validity verdictpaper-writing-contract-planner downstream: narrow or forbid claims that fail the auditexperiment-design-planner when the audit identifies a controlled experiment needed to rule out a confound<installed-skill-dir>/
├── SKILL.md
└── references/
└── audit-criteria.md
references/audit-criteria.md.memory/claim-board.md and memory/evidence-board.md before auditing.paper/.agent/paper-evidence-board.md when claims have paper-facing evidence slots.For each result to audit, write in one sentence:
Claim: [method/component X] causes [effect Y] on [task/benchmark Z], as shown by [metric M].
Then identify:
If the claim cannot be written in one sentence without hedges, that is an early signal of drift.
Read references/audit-criteria.md.
For each claim, verify:
statistical-analysis-planner if not)Flag each violation as:
blocker: the claim cannot be made as stated without fixing thisrisk: the claim is defensible but a reviewer will likely raise itnote: disclose in limitations or experiment settingsFor each claim, list at least two alternative explanations and classify:
Alternative: [alternative explanation]
Likelihood: high / medium / low
Ruling-out evidence: [experiment, analysis, or prior work that rules it out]
Status: ruled-out / possible / unaddressed
Common confounds in ML:
Any unaddressed confound with high or medium likelihood is an audit blocker.
Compare the measured quantity to the stated conclusion. Flag drift if:
For each drift finding, record:
Check that comparative claims meet minimum statistical support:
Any comparative claim without reported uncertainty is classified as a risk.
For each audited claim, produce:
Claim: [one-sentence claim]
Protocol verdict: pass | conditional | fail
Confound verdict: clean | possible-[confound-name] | unaddressed-[confound-name]
Drift verdict: none | scoped-[drift-type]
Inferential verdict: supported | conditional | unsupported
Overall verdict: valid | narrowed | blocked | insufficient-evidence
Recommended wording:
Allowed: [maximum defensible phrasing]
Forbidden: [specific language that overclaims]
Required actions:
- [action to resolve each blocker or risk]
Save the audit report to paper/.agent/results-audit-<date>.md.
memory/claim-board.md: narrow or strengthen claims based on verdictsmemory/risk-board.md: add confound and drift risks with severitymemory/action-board.md: add required actions for each blockerpaper/.agent/paper-evidence-board.md: annotate evidence slots with validity verdictBefore finishing:
testing
Bootstrap project-local ml-research-skills. Use from global installs when creating a new ML research project, enabling this collection in an existing ML research repo, or deciding whether to install the full bundle locally. Route to project-init for new projects; do not handle paper or experiment work directly.
development
Route project operations tasks — git, memory, bootstrap, remote, workspace, code review, timeline, ops — to the correct skill. Use when the task involves commits, pushes, worktrees, project memory, enabling project-local skills, SSH/server coordination, sidecar runners, or audits. Do not solve the ops task directly.
testing
Route ML/AI paper writing tasks to the correct skill — contract planning, prose drafting, section writing, consistency editing, review simulation, rebuttal, submission, or citation work. Use when the task involves writing, revising, reviewing, or submitting a paper instead of guessing between paper-writing-assistant, paper-writing-contract-planner, paper-reviewer-simulator, auto-paper-improvement-loop, or citation skills. Do not draft prose directly.
data-ai
Project-local router for ML research skill selection. Use inside an initialized ML research project, or while maintaining this skill repo, when the user describes an ML research/paper/experiment/discovery/ops/release workflow and may not know the skill; route to a domain router or high-signal leaf. Do not use for generic non-ML projects.