Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

a-green-hand-jack/experiment-design-planner

Name: experiment-design-planner
Author: a-green-hand-jack

skills/experiment-design-planner/SKILL.md

npx skillsauth add a-green-hand-jack/ml-research-skills experiment-design-planner

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Experiment Design Planner

Turn a research claim into an experiment plan that can actually answer it. This skill is for planning before running, not for reporting completed results.

Use this skill when:

a user is about to run a new experiment or ablation
a paper claim needs evidence
baselines, metrics, controls, or datasets are unclear
the user is changing too many variables at once
cluster/compute time should not be wasted on ambiguous runs
reviewer-proof evidence is needed before submission

Pair this skill with:

research-project-memory when the experiment plan should become project-level evidence, risk, and action memory
run-experiment after the design is ready to execute
experiment-report-writer after results exist
paper-reviewer-simulator to stress-test whether the evidence will satisfy reviewers
baseline-selection-audit before finalizing the experiment matrix when baseline choice, fairness, or reviewer-proof comparisons need deeper review
figure-results-review after plotted or tabulated results exist and need claim-support review

Skill Directory Layout

<installed-skill-dir>/
├── SKILL.md
└── references/
    ├── ablation-matrix.md
    ├── evidence-standards.md
    ├── metrics-and-controls.md
    └── report-template.md

Progressive Loading

Always read references/evidence-standards.md and references/metrics-and-controls.md.
Read references/ablation-matrix.md when the plan compares variants, components, baselines, hyperparameters, datasets, or model sizes.
Use references/report-template.md when saving or returning a substantial experiment plan.
If the target repo has memory/, update planned evidence, experiment families, risks, and actions using research-project-memory conventions.
If the experiment depends on current baselines, benchmarks, or leaderboard conventions, verify current sources with web search.

Core Principles

Start from the claim, not the command line.
State the hypothesis before running experiments.
Use a baseline before introducing a new method.
Change one variable at a time unless the experiment is explicitly factorial.
Define controls and nuisance variables before interpreting results.
Make negative results useful by defining falsification and fallback decisions.
Design the table or figure before running the experiment.
Stop conditions matter: decide what result is enough to move on.

Step 1 - Define the Claim and Question

Extract:

paper or project claim
research question
target audience: internal debugging, advisor update, paper evidence, rebuttal, benchmark claim
expected output: Markdown plan, LaTeX experiment section outline, run matrix, or saved file
experiment mode:
- single: one controlled comparison
- ablation: component or variable isolation
- benchmark: compare methods across datasets/tasks
- theory: empirical support for a theoretical prediction
- diagnostic: understand a failure mode or surprising result

Rewrite vague goals into testable questions:

Vague: Does our method work?
Testable: Does component X improve metric M over baseline B on datasets D1/D2 under the same training budget?

Step 2 - State Hypotheses

Write:

primary hypothesis
alternative explanations
expected metric direction and rough effect size
falsification condition
decision rule

If the user cannot state a falsification condition, the experiment is not ready.

Step 3 - Define Evidence Standard

Read references/evidence-standards.md.

Decide what evidence is needed:

one table, one curve, one ablation, one qualitative example, one theorem-aligned diagnostic, or a benchmark suite
number of datasets/tasks
number of seeds or repeats
required baselines
acceptable variance
whether statistical testing or confidence intervals are needed
whether results must support a paper claim or only guide next steps

Step 4 - Choose Baselines and Controls

Identify:

primary baseline
strongest prior method or current SOTA, if relevant
simple baseline
ablation baseline
oracle or upper bound, if useful
controlled variables
nuisance variables

If no baseline exists, make the first experiment a baseline-establishment experiment.

Step 5 - Choose Metrics and Logging

Read references/metrics-and-controls.md.

For each metric, specify:

definition
direction
aggregation
split
variance reporting
failure interpretation
why it answers the question

Define required logging:

command
config path
git commit
dataset version
seed
hyperparameters
hardware/runtime
metrics
artifacts: tables, figures, checkpoints, logs

Step 6 - Build Run Matrix

Read references/ablation-matrix.md when there is more than one run.

Create a run table with:

run ID
changed variable
fixed controls
dataset/split
metric
seed/repeats
expected result
status
output path

Split experiments if a run changes more than one conceptual variable.

Step 7 - Define Stop Conditions and Next Decisions

Write:

what result is sufficient to support the claim
what result falsifies or weakens the claim
what result triggers another ablation
what result means stop and write/report
compute budget ceiling
deadline constraints

Step 8 - Reviewer Risk Check

Before finalizing, ask:

Would a reviewer complain that the baseline is weak?
Is the comparison fair?
Are seeds/repeats enough?
Does the experiment isolate the claimed mechanism?
Are metrics aligned with the claim?
Is there a confounder that could explain the result?
Would a negative result still teach something?

If the answer exposes a major weakness, update the design before execution.

Step 9 - Write the Experiment Plan

Use references/report-template.md.

If saving to a project and no path is given, use:

docs/experiments/experiment_plan_YYYY-MM-DD_<short-name>.md

If working inside a code repo or code worktree created by init-python-project / new-workspace, prefer:

docs/reports/experiment_plan_YYYY-MM-DD_<short-name>.md

The final plan should be runnable by run-experiment and later reportable by experiment-report-writer.

Step 10 - Write Back to Project Memory

If the project uses research-project-memory, update:

memory/evidence-board.md: planned EVD-### items and EXP-### experiment families
memory/provenance-board.md: planned source classes, expected CSV/report outputs, and aggregation requirements when known
memory/claim-board.md: linked claims, marking planned, evidence-needed, or provisional claims honestly
memory/risk-board.md: baseline, mechanism, metric, seed, compute, and reviewer risks exposed by the design
memory/action-board.md: runnable next actions, including which experiment to launch first
memory/handoff-board.md: create a ready handoff to run-experiment when the plan is runnable
memory/phase-dashboard.md: update the active experiment-design or evidence-production gate
relevant worktree .agent/worktree-status.md: experiment purpose and exit condition if a branch/worktree is involved

Use planned status for experiments that have not run. Do not record expected outcomes as observed evidence.

Final Sanity Check

Before finalizing:

claim and hypothesis are explicit
baseline is defined
independent variable is isolated
controls and nuisance variables are listed
metrics are tied to the question
run matrix is concrete
logging requirements are sufficient for reproduction
stop condition and decision rule are explicit
reviewer risks are stated
project memory is updated when the repo has memory/

a-green-hand-jack/experiment-design-planner

skills/experiment-design-planner/SKILL.md

Design hypothesis-driven ML/AI experiments before running. Use for ablations, baselines, metrics, controls, seeds, logging, and claim-evidence matrices.

3 stars

data-ai

Updated May 5, 2026

$ install --global

skillsauth

npx skillsauth add a-green-hand-jack/ml-research-skills experiment-design-planner

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 5, 2026, 4:24 AM147.8s5 files scanned

SKILL.md

name:: experiment-design-planner
description:: Design hypothesis-driven ML/AI experiments before running. Use for ablations, baselines, metrics, controls, seeds, logging, and claim-evidence matrices.
argument-hint:: [project-dir] [--claim <claim>] [--mode single|ablation|benchmark|theory|diagnostic]
allowed-tools:: Read, Write, Edit, Bash, Glob, WebSearch, WebFetch

Experiment Design Planner

Turn a research claim into an experiment plan that can actually answer it. This skill is for planning before running, not for reporting completed results.

Use this skill when:

a user is about to run a new experiment or ablation
a paper claim needs evidence
baselines, metrics, controls, or datasets are unclear
the user is changing too many variables at once
cluster/compute time should not be wasted on ambiguous runs
reviewer-proof evidence is needed before submission

Pair this skill with:

research-project-memory when the experiment plan should become project-level evidence, risk, and action memory
run-experiment after the design is ready to execute
experiment-report-writer after results exist
paper-reviewer-simulator to stress-test whether the evidence will satisfy reviewers
baseline-selection-audit before finalizing the experiment matrix when baseline choice, fairness, or reviewer-proof comparisons need deeper review
figure-results-review after plotted or tabulated results exist and need claim-support review

Skill Directory Layout

<installed-skill-dir>/
├── SKILL.md
└── references/
    ├── ablation-matrix.md
    ├── evidence-standards.md
    ├── metrics-and-controls.md
    └── report-template.md

Progressive Loading

Always read references/evidence-standards.md and references/metrics-and-controls.md.
Read references/ablation-matrix.md when the plan compares variants, components, baselines, hyperparameters, datasets, or model sizes.
Use references/report-template.md when saving or returning a substantial experiment plan.
If the target repo has memory/, update planned evidence, experiment families, risks, and actions using research-project-memory conventions.
If the experiment depends on current baselines, benchmarks, or leaderboard conventions, verify current sources with web search.

Core Principles

Start from the claim, not the command line.
State the hypothesis before running experiments.
Use a baseline before introducing a new method.
Change one variable at a time unless the experiment is explicitly factorial.
Define controls and nuisance variables before interpreting results.
Make negative results useful by defining falsification and fallback decisions.
Design the table or figure before running the experiment.
Stop conditions matter: decide what result is enough to move on.

Step 1 - Define the Claim and Question

Extract:

paper or project claim
research question
target audience: internal debugging, advisor update, paper evidence, rebuttal, benchmark claim
expected output: Markdown plan, LaTeX experiment section outline, run matrix, or saved file
experiment mode:
- single: one controlled comparison
- ablation: component or variable isolation
- benchmark: compare methods across datasets/tasks
- theory: empirical support for a theoretical prediction
- diagnostic: understand a failure mode or surprising result

Rewrite vague goals into testable questions:

Vague: Does our method work?
Testable: Does component X improve metric M over baseline B on datasets D1/D2 under the same training budget?

Step 2 - State Hypotheses

Write:

primary hypothesis
alternative explanations
expected metric direction and rough effect size
falsification condition
decision rule

If the user cannot state a falsification condition, the experiment is not ready.

Step 3 - Define Evidence Standard

Read references/evidence-standards.md.

Decide what evidence is needed:

one table, one curve, one ablation, one qualitative example, one theorem-aligned diagnostic, or a benchmark suite
number of datasets/tasks
number of seeds or repeats
required baselines
acceptable variance
whether statistical testing or confidence intervals are needed
whether results must support a paper claim or only guide next steps

Step 4 - Choose Baselines and Controls

Identify:

primary baseline
strongest prior method or current SOTA, if relevant
simple baseline
ablation baseline
oracle or upper bound, if useful
controlled variables
nuisance variables

If no baseline exists, make the first experiment a baseline-establishment experiment.

Step 5 - Choose Metrics and Logging

Read references/metrics-and-controls.md.

For each metric, specify:

definition
direction
aggregation
split
variance reporting
failure interpretation
why it answers the question

Define required logging:

command
config path
git commit
dataset version
seed
hyperparameters
hardware/runtime
metrics
artifacts: tables, figures, checkpoints, logs

Step 6 - Build Run Matrix

Read references/ablation-matrix.md when there is more than one run.

Create a run table with:

run ID
changed variable
fixed controls
dataset/split
metric
seed/repeats
expected result
status
output path

Split experiments if a run changes more than one conceptual variable.

Step 7 - Define Stop Conditions and Next Decisions

Write:

what result is sufficient to support the claim
what result falsifies or weakens the claim
what result triggers another ablation
what result means stop and write/report
compute budget ceiling
deadline constraints

Step 8 - Reviewer Risk Check

Before finalizing, ask:

Would a reviewer complain that the baseline is weak?
Is the comparison fair?
Are seeds/repeats enough?
Does the experiment isolate the claimed mechanism?
Are metrics aligned with the claim?
Is there a confounder that could explain the result?
Would a negative result still teach something?

If the answer exposes a major weakness, update the design before execution.

Step 9 - Write the Experiment Plan

Use references/report-template.md.

If saving to a project and no path is given, use:

docs/experiments/experiment_plan_YYYY-MM-DD_<short-name>.md

If working inside a code repo or code worktree created by init-python-project / new-workspace, prefer:

docs/reports/experiment_plan_YYYY-MM-DD_<short-name>.md

The final plan should be runnable by run-experiment and later reportable by experiment-report-writer.

Step 10 - Write Back to Project Memory

If the project uses research-project-memory, update:

memory/evidence-board.md: planned EVD-### items and EXP-### experiment families
memory/provenance-board.md: planned source classes, expected CSV/report outputs, and aggregation requirements when known
memory/claim-board.md: linked claims, marking planned, evidence-needed, or provisional claims honestly
memory/risk-board.md: baseline, mechanism, metric, seed, compute, and reviewer risks exposed by the design
memory/action-board.md: runnable next actions, including which experiment to launch first
memory/handoff-board.md: create a ready handoff to run-experiment when the plan is runnable
memory/phase-dashboard.md: update the active experiment-design or evidence-production gate
relevant worktree .agent/worktree-status.md: experiment purpose and exit condition if a branch/worktree is involved

Use planned status for experiments that have not run. Do not record expected outcomes as observed evidence.

Final Sanity Check

Before finalizing:

claim and hypothesis are explicit
baseline is defined
independent variable is isolated
controls and nuisance variables are listed
metrics are tied to the question
run matrix is concrete
logging requirements are sufficient for reproduction
stop condition and decision rule are explicit
reviewer risks are stated
project memory is updated when the repo has memory/

Related Skills

a-green-hand-jack/ml-research-bootstrap

testing

VerifiedTrustedCommunity

Bootstrap project-local ml-research-skills. Use from global installs when creating a new ML research project, enabling this collection in an existing ML research repo, or deciding whether to install the full bundle locally. Route to project-init for new projects; do not handle paper or experiment work directly.

4SKILL.mdUpdated May 26, 2026

a-green-hand-jack/ml-research-bootstrap

a-green-hand-jack/project-ops-router

development

VerifiedTrustedCommunity

Route project operations tasks — git, memory, bootstrap, remote, workspace, code review, timeline, ops — to the correct skill. Use when the task involves commits, pushes, worktrees, project memory, enabling project-local skills, SSH/server coordination, sidecar runners, or audits. Do not solve the ops task directly.

4SKILL.mdUpdated May 19, 2026

a-green-hand-jack/project-ops-router

a-green-hand-jack/paper-writing-router

testing

VerifiedTrustedCommunity

Route ML/AI paper writing tasks to the correct skill — contract planning, prose drafting, section writing, consistency editing, review simulation, rebuttal, submission, or citation work. Use when the task involves writing, revising, reviewing, or submitting a paper instead of guessing between paper-writing-assistant, paper-writing-contract-planner, paper-reviewer-simulator, auto-paper-improvement-loop, or citation skills. Do not draft prose directly.

4SKILL.mdUpdated May 19, 2026

a-green-hand-jack/paper-writing-router

a-green-hand-jack/ml-research-router

data-ai

VerifiedTrustedCommunity

Project-local router for ML research skill selection. Use inside an initialized ML research project, or while maintaining this skill repo, when the user describes an ML research/paper/experiment/discovery/ops/release workflow and may not know the skill; route to a domain router or high-signal leaf. Do not use for generic non-ML projects.

4SKILL.mdUpdated May 19, 2026

a-green-hand-jack/ml-research-router

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/a-green-hand-jack/ml-research-skills.git

# Copy into Claude Code skills folder (global)
cp -r ml-research-skills/skills/experiment-design-planner ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

a-green-hand-jack/ml-research-skills

3 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT