skills/experiment-design-planner/SKILL.md
Design hypothesis-driven ML/AI experiments before running. Use for ablations, baselines, metrics, controls, seeds, logging, and claim-evidence matrices.
npx skillsauth add a-green-hand-jack/ml-research-skills experiment-design-plannerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Turn a research claim into an experiment plan that can actually answer it. This skill is for planning before running, not for reporting completed results.
Use this skill when:
Pair this skill with:
research-project-memory when the experiment plan should become project-level evidence, risk, and action memoryrun-experiment after the design is ready to executeexperiment-report-writer after results existpaper-reviewer-simulator to stress-test whether the evidence will satisfy reviewersbaseline-selection-audit before finalizing the experiment matrix when baseline choice, fairness, or reviewer-proof comparisons need deeper reviewfigure-results-review after plotted or tabulated results exist and need claim-support review<installed-skill-dir>/
├── SKILL.md
└── references/
├── ablation-matrix.md
├── evidence-standards.md
├── metrics-and-controls.md
└── report-template.md
references/evidence-standards.md and references/metrics-and-controls.md.references/ablation-matrix.md when the plan compares variants, components, baselines, hyperparameters, datasets, or model sizes.references/report-template.md when saving or returning a substantial experiment plan.memory/, update planned evidence, experiment families, risks, and actions using research-project-memory conventions.Extract:
single: one controlled comparisonablation: component or variable isolationbenchmark: compare methods across datasets/taskstheory: empirical support for a theoretical predictiondiagnostic: understand a failure mode or surprising resultRewrite vague goals into testable questions:
Vague: Does our method work?
Testable: Does component X improve metric M over baseline B on datasets D1/D2 under the same training budget?
Write:
If the user cannot state a falsification condition, the experiment is not ready.
Read references/evidence-standards.md.
Decide what evidence is needed:
Identify:
If no baseline exists, make the first experiment a baseline-establishment experiment.
Read references/metrics-and-controls.md.
For each metric, specify:
Define required logging:
Read references/ablation-matrix.md when there is more than one run.
Create a run table with:
Split experiments if a run changes more than one conceptual variable.
Write:
Before finalizing, ask:
If the answer exposes a major weakness, update the design before execution.
Use references/report-template.md.
If saving to a project and no path is given, use:
docs/experiments/experiment_plan_YYYY-MM-DD_<short-name>.md
If working inside a code repo or code worktree created by init-python-project / new-workspace, prefer:
docs/reports/experiment_plan_YYYY-MM-DD_<short-name>.md
The final plan should be runnable by run-experiment and later reportable by experiment-report-writer.
If the project uses research-project-memory, update:
memory/evidence-board.md: planned EVD-### items and EXP-### experiment familiesmemory/provenance-board.md: planned source classes, expected CSV/report outputs, and aggregation requirements when knownmemory/claim-board.md: linked claims, marking planned, evidence-needed, or provisional claims honestlymemory/risk-board.md: baseline, mechanism, metric, seed, compute, and reviewer risks exposed by the designmemory/action-board.md: runnable next actions, including which experiment to launch firstmemory/handoff-board.md: create a ready handoff to run-experiment when the plan is runnablememory/phase-dashboard.md: update the active experiment-design or evidence-production gate.agent/worktree-status.md: experiment purpose and exit condition if a branch/worktree is involvedUse planned status for experiments that have not run. Do not record expected outcomes as observed evidence.
Before finalizing:
memory/testing
Bootstrap project-local ml-research-skills. Use from global installs when creating a new ML research project, enabling this collection in an existing ML research repo, or deciding whether to install the full bundle locally. Route to project-init for new projects; do not handle paper or experiment work directly.
development
Route project operations tasks — git, memory, bootstrap, remote, workspace, code review, timeline, ops — to the correct skill. Use when the task involves commits, pushes, worktrees, project memory, enabling project-local skills, SSH/server coordination, sidecar runners, or audits. Do not solve the ops task directly.
testing
Route ML/AI paper writing tasks to the correct skill — contract planning, prose drafting, section writing, consistency editing, review simulation, rebuttal, submission, or citation work. Use when the task involves writing, revising, reviewing, or submitting a paper instead of guessing between paper-writing-assistant, paper-writing-contract-planner, paper-reviewer-simulator, auto-paper-improvement-loop, or citation skills. Do not draft prose directly.
data-ai
Project-local router for ML research skill selection. Use inside an initialized ML research project, or while maintaining this skill repo, when the user describes an ML research/paper/experiment/discovery/ops/release workflow and may not know the skill; route to a domain router or high-signal leaf. Do not use for generic non-ML projects.