skills/baseline-selection-audit/SKILL.md
Audit ML/AI experimental baselines for necessity, fairness, currency, and reviewer risk. Use when choosing baselines or checking SOTA comparisons.
npx skillsauth add a-green-hand-jack/ml-research-skills baseline-selection-auditInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Turn a claim, method, draft experiment plan, or literature map into a reviewer-proof baseline set and fairness ledger.
Use this skill when:
Do not use this skill for citation metadata checks. Use citation-audit for BibTeX and LaTeX correctness. Use citation-coverage-audit when the primary question is missing references rather than missing comparisons.
Pair this skill with:
literature-review-sprint before this skill when the competing paper map is incompletealgorithm-design-planner when the closest baseline changes the method designexperiment-design-planner after this skill to turn selected baselines into a concrete experiment matrixrun-experiment only after baseline scope, fairness rules, and stop conditions are clearresult-diagnosis when baseline results are surprising, unstable, or stronger than the proposed methodpaper-evidence-board when baseline risks must be linked to paper claims, figures, and sectionsresearch-project-memory when baseline decisions, risks, and actions should persist across sessions<installed-skill-dir>/
├── SKILL.md
└── references/
├── baseline-taxonomy.md
├── fairness-ledger.md
├── memory-writeback.md
├── report-template.md
└── reviewer-risk.md
references/baseline-taxonomy.md, references/fairness-ledger.md, and references/reviewer-risk.md.references/report-template.md before writing the final audit.references/memory-writeback.md when the project has memory/, component .agent/ folders, or the user asks for persistent project memory.experiment-design-planner.Collect:
CLM-###, EVD-###, RSK-###, or ACT-###Rewrite the claim into:
We need to show that [method] improves [property] over [comparison set] under [task/protocol], without the result being explained by [confound].
If this cannot be written, route to research-idea-validator, algorithm-design-planner, or paper-evidence-board.
Use:
Classify each candidate using references/baseline-taxonomy.md.
The pool should include:
For each candidate, assign exactly one:
must-have: paper is hard to defend without itshould-have: materially improves reviewer confidence, but omission may be defensibleoptional: useful context, low acceptance impactnot-comparable: related but unfair or invalid as a direct comparisoncitation-only: should be discussed/cited but does not need an experimentEvery must-have baseline needs an owner, experiment form, fairness constraints, and fallback if impossible.
Every not-comparable baseline needs a reason:
Read references/fairness-ledger.md.
For each must-have and should-have baseline, check:
If fairness cannot be achieved, decide whether to:
Read references/reviewer-risk.md.
For each missing, weak, or unfair baseline, write the likely reviewer objection:
Reviewer could say: [attack].
Severity: fatal / major / medium / minor
Mitigation: run / cite / justify / narrow claim / move to appendix / accept risk
Prioritize by acceptance impact:
For experiment-design-planner, output:
If compute is limited, propose a staged plan:
Read references/report-template.md.
If saving to a project and no path is given, use:
docs/experiments/baseline_selection_audit_YYYY-MM-DD_<short-name>.md
If working inside a code repo or code worktree created by init-python-project / new-workspace, prefer:
docs/reports/baseline_selection_audit_YYYY-MM-DD_<short-name>.md
The report must include:
Read references/memory-writeback.md when memory exists.
Update the smallest useful set of entries:
memory/risk-board.md: missing, unfair, unavailable, or not-comparable baseline risksmemory/evidence-board.md: planned baseline comparisons and ablationsmemory/action-board.md: implementation, run, citation, or justification actionsmemory/claim-board.md: claims narrowed by baseline feasibilitymemory/decision-log.md: durable decisions to include, exclude, or stage baselines.agent/worktree-status.md: baseline implementation purpose and exit conditionpaper/.agent/: table/section implications when a draft existsUse certainty labels:
verified for baselines checked against primary sources or official codeuser-stated for constraints supplied by the userinferred for reviewer risks and fairness judgmentsunverified for candidates not yet checkedBefore finalizing:
must-have baselines are explicitexperiment-design-plannertesting
Bootstrap project-local ml-research-skills. Use from global installs when creating a new ML research project, enabling this collection in an existing ML research repo, or deciding whether to install the full bundle locally. Route to project-init for new projects; do not handle paper or experiment work directly.
development
Route project operations tasks — git, memory, bootstrap, remote, workspace, code review, timeline, ops — to the correct skill. Use when the task involves commits, pushes, worktrees, project memory, enabling project-local skills, SSH/server coordination, sidecar runners, or audits. Do not solve the ops task directly.
testing
Route ML/AI paper writing tasks to the correct skill — contract planning, prose drafting, section writing, consistency editing, review simulation, rebuttal, submission, or citation work. Use when the task involves writing, revising, reviewing, or submitting a paper instead of guessing between paper-writing-assistant, paper-writing-contract-planner, paper-reviewer-simulator, auto-paper-improvement-loop, or citation skills. Do not draft prose directly.
data-ai
Project-local router for ML research skill selection. Use inside an initialized ML research project, or while maintaining this skill repo, when the user describes an ML research/paper/experiment/discovery/ops/release workflow and may not know the skill; route to a domain router or high-signal leaf. Do not use for generic non-ML projects.