src/autoskillit/skills_extended/plan-experiment/SKILL.md
Convert a scope report into a structured experiment plan with hypothesis, variables, phases, and success criteria.
npx skillsauth add talont-org/autoskillit plan-experimentInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Transform a scope report into an experiment plan. The output is both a research design AND an implementation plan — it describes what is being tested, how to build the experiment infrastructure, and what to run. The plan is posted as a GitHub issue for human review before any compute is spent.
The plan must be specific and actionable: an implementer should be able to read
it and know exactly what files to create, what environment to set up, what
commands to run, and what results to collect. Everything is planned to live in
one self-contained folder under research/.
research recipe (phase 1)/autoskillit:plan-experiment {scope_report_path} [{revision_guidance}]
{scope_report_path} — Absolute path to the scope report produced by /autoskillit:scope
(required). Scan tokens after the skill name for the first path-like token
(starts with /, ./, or .autoskillit/).
{revision_guidance} — Optional. Absolute path to revision guidance produced by
/autoskillit:review-design when verdict=REVISE. Scan for the second path-like token.
When absent or empty (first pass), proceed normally. When present, read it and
incorporate the feedback before writing the plan.
NEVER:
{{AUTOSKILLIT_TEMP}}/plan-experiment/ directory# Experiment Plan: heading — it always goes BEFORErun_in_background: true is prohibited)ALWAYS:
model: "sonnet" when spawning all subagents via the Task tool{{AUTOSKILLIT_TEMP}}/plan-experiment/ directoryresearch/YYYY-MM-DD-{slug}/ folderDetect and read inputs:
{scope_report_path}. Extract:
{revision_guidance}. Extract all revision instructions — these take priority over
your initial analysis in Step 2. Note which sections of the plan need rework.
When absent or empty, omit this sub-step and proceed normally (first pass).Launch subagents (model: "sonnet") to assess feasibility. The following are minimum required — launch as many additional subagents as needed to fill information gaps and produce the best possible experiment plan.
Minimum subagents:
Subagent A — Measurement Feasibility:
Search the project for any existing metrics infrastructure — files that define canonical metric names, measurement dimensions, quality thresholds, or standardized assessment tooling (e.g., files named
metrics.*,benchmark.*,evaluation.*, or any assessment/scoring module). If no dedicated metrics infrastructure exists, treat all dependent variables as "NEW". Cross-reference against the scope report's Metric Context section if present; if absent, proceed without it and note the gap. For each dependent variable in the research question: verify it has an existing measurement mechanism, or flag it as "NEW" requiring formula, unit, and threshold definition. Report what measurement infrastructure already exists vs. what needs to be built.
Subagent B — Data & Input Feasibility:
Assess what data the experiment needs to operate on. Can it be generated synthetically? Does it need to be constructed with specific properties? Are there existing datasets or fixtures that can be reused? What generators or construction scripts would need to be written?
Research Task Directive Compliance: When the research task directive or issue specifies using particular data (e.g., "use MERFISH data", "benchmark on real-world datasets"), the Data Manifest MUST include acquisition steps for that data. The plan must NOT assume the data will already be present — especially in worktrees where gitignored directories are empty. If the directive specifies data that requires download or generation, include the exact commands in the
acquisitionfield.
Subagent C — Environment Assessment:
Determine whether the experiment can run with the project's existing toolchain, or whether it requires additional tools, libraries, or runtimes. If external tools are needed, research the correct package names and versions for a micromamba/conda environment.yml.
Additional subagents (launch as many as needed):
Produce a structured experiment plan. The plan has two halves: the research design (what and why) and the implementation plan (how to build it).
Choose a date-stamped slug for the experiment folder:
research/YYYY-MM-DD-{slug}/ where {slug} is a kebab-case summary of the
research topic (max 40 chars).
# Experiment Plan: {title}
## Motivation
{Why this experiment matters. What decision will its results inform?}
## Hypothesis
**Null hypothesis (H0):** {The default assumption — no effect, no difference}
**Alternative hypothesis (H1):** {The claim being tested — stated with a
measurable outcome}
## Independent Variables
{What is being varied}
| Variable | Values | Rationale |
|----------|--------|-----------|
| {var1} | {value_a, value_b} | {why these values} |
## Dependent Variables (Metrics)
{What is being measured}
| Metric | Unit | Collection Method | Canonical Name |
|--------|------|-------------------|----------------|
| {metric1} | {unit} | {how collected} | {canonical name in evaluation framework, or "NEW"} |
Canonical names must match entries in the project's evaluation framework (if one
exists). For any metric marked "NEW", include: formula, unit, threshold value, and
a note that it must be registered in whatever evaluation catalog the project uses
before the experiment is finalized.
## Controlled Variables
{What is held constant and how}
| Variable | Fixed Value | Rationale |
|----------|-------------|-----------|
| {var1} | {value} | {why fixed} |
## Inputs and Data
{What data the experiment operates on. The inputs determine what the
experiment can prove.}
- {What datasets are needed — existing, synthetic, or constructed?}
- {How will datasets be generated or obtained?}
- {What properties must the data have to be a valid test of the hypothesis?}
- {What range and diversity of inputs avoids narrow conclusions?}
| Dataset | Source | Properties | Purpose |
|---------|--------|------------|---------|
| {dataset1} | {generated/existing/external} | {key characteristics} | {what it tests} |
## Experiment Directory Layout
All experiment artifacts live in one self-contained folder:
research/YYYY-MM-DD-{slug}/ ├── Dockerfile # Docker image spec — builds from environment.yml ├── Taskfile.yml # build-env, run-experiment, test tasks ├── environment.yml # Micromamba/conda env (required) ├── scripts/ │ ├── {script_1} # {description} │ └── ... ├── tests/ # Test suite for experiment scripts │ ├── conftest.py │ └── test_{script_1}.py ├── data/ ├── results/ └── report.md
{Describe each planned file and its purpose.}
## Environment
{Explicitly state one of:}
**Option A — No custom environment needed:**
{The project's existing toolchain is sufficient because {reason}. No
environment.yml will be created.}
Verify that `pytest` is available in the existing toolchain (`pytest --version`). If not, note that test_check requires pytest to pass.
**Option B — Custom environment required (standard for research experiments):**
{The experiment requires {tools/libraries} not available in the project toolchain.
An environment.yml will be created. The implement-experiment skill builds a Docker
image from this YAML — all experiment code runs inside the container. Nothing is
installed on the host.}
```yaml
name: {experiment-slug}
channels:
- conda-forge
dependencies:
- {package1}={version}
- {package2}={version}
- pytest # Required: enables test_check to discover and run experiment tests
{Rationale for each dependency.}
research/YYYY-MM-DD-{slug}/ and subdirectoriesenvironment.yml (if needed) and build the environmentscripts/data/scripts/tests/ directory and tests/conftest.py with shared fixturesscripts/, create a corresponding tests/test_{script_name}.pypytest --collect-only in the research directory to verify test discovery{Adapt phases as needed — experiments may not require every phase. Add or remove phases to match the specific experiment. Each phase should list the specific files to create and commands to run.}
{Step-by-step procedure for running the actual experiment after implementation is complete. Be specific about what commands to run, what data to collect, and in what order.}
{How to interpret the results. Include statistical analysis if relevant to the experiment type — not all experiments require it. Describe what patterns or outcomes would support or refute the hypothesis.}
{Explicit, measurable conditions that answer the research question}
{Confounds that could invalidate results within this experiment}
{Limits on generalizability beyond the test conditions}
{Approximate compute time, disk space, dependencies needed}
### Step 3a — Extract YAML Frontmatter
After writing the prose plan, extract structured metadata and write the
complete experiment plan file with YAML frontmatter prepended before the
`# Experiment Plan:` heading. The final file layout is:
experiment_type: {one of: benchmark, configuration_study, causal_inference, robustness_audit, exploratory}
estimand: treatment: "{the intervention}" # RECOMMENDED; required when causal_inference outcome: "{the measured effect}" population: "{scope of units}" contrast: "{A vs B vs C}" # REQUIRED for causal_inference
hypothesis_h0: "{null hypothesis with measurable threshold}" # REQUIRED hypothesis_h1: "{alt hypothesis with measurable threshold}" # REQUIRED
metrics: # REQUIRED, min 1
baselines: # REQUIRED for benchmark/causal_inference
statistical_plan: # REQUIRED unless exploratory test: "{primary statistical test}" alpha: 0.05 power_target: 0.80 correction_method: "Holm-Bonferroni" # null | Bonferroni | Holm-Bonferroni | BH sample_size_justification: "{why N is sufficient}" min_detectable_effect: "{MDE in metric units}"
environment: # REQUIRED type: "custom" # standard | custom spec_path: "research/{slug}/environment.yml" # required when type=custom
success_criteria: # REQUIRED conclusive_positive: "{conditions supporting H1, referencing metrics}" conclusive_negative: "{conditions supporting H0}" inconclusive: "{conditions where no conclusion can be drawn}"
data_manifest: # REQUIRED — one entry per hypothesis (or shared)
...prose sections unchanged...
Use this prose section ↔ frontmatter mapping to extract fields:
| Prose Section | Frontmatter Field(s) |
|---------------|---------------------|
| `## Hypothesis` (H0/H1 bold labels) | `hypothesis_h0`, `hypothesis_h1`, `estimand` |
| `## Independent Variables` table | `estimand.contrast`, `baselines[]` |
| `## Dependent Variables (Metrics)` table | `metrics[]` |
| `## Environment` | `environment` |
| `## Analysis Plan` | `statistical_plan` |
| `## Success Criteria` | `success_criteria` |
| `## Experiment Directory Layout` | `experiment_slug` |
| `## Inputs and Data` | `data_manifest[]` |
### data_manifest (required)
A list of data source entries, one per hypothesis (or shared across hypotheses). Each entry:
**Field definitions:**
| Field | Required | Description |
|-------|----------|-------------|
| `hypothesis` | yes | List of hypothesis IDs that consume this data |
| `source_type` | yes | One of: `synthetic`, `fixture`, `external`, `gitignored` |
| `description` | yes | Human-readable description of the data |
| `acquisition` | yes | Exact command or method to produce/retrieve the data |
| `location` | no | Filesystem path where data will reside (null for in-script) |
| `verification` | yes | How to confirm the data is present and valid |
| `depends_on` | no | Prerequisite acquisition step (e.g., download before subset) |
Apply these validation rules in order before writing the frontmatter:
V1: benchmark/causal_inference → len(baselines) >= 1 AND each baseline.version not empty ERROR: "Benchmark/causal_inference experiments require at least one named baseline with a version"
V2: causal_inference → estimand.contrast is not null ERROR: "causal_inference requires estimand with treatment, outcome, and contrast fields"
V3: !exploratory → statistical_plan present AND test not null ERROR: "Non-exploratory experiments require a statistical_plan; use {test: 'none'} to waive"
V4: environment.type=custom → spec_path not null ERROR: "Custom environment requires spec_path pointing to environment.yml"
V5: len(metrics) >= 2 → exactly one metric has primary: true WARNING: "Multiple metrics but no primary designated; H1 threshold ambiguous"
V6: any metric.canonical_name = "NEW" WARNING: "Plan includes NEW metrics not yet in any registered evaluation framework"
V7: hypothesis_h1 has no numeric threshold WARNING: "H1 should include a measurable numeric threshold"
V8: success_criteria.conclusive_positive should reference at least one metric.name WARNING: "Success criteria does not reference any declared metric"
V9: data_manifest completeness
ERROR if:
- Any hypothesis referenced in success_criteria has no entry in data_manifest
- Any entry with source_type: external or source_type: gitignored lacks a non-null location
- Any entry with source_type: external lacks a depends_on or explicit download command in acquisition
ERROR: "Data Manifest incomplete: {specific missing field or hypothesis}"
- ERRORs (V1–V4, V9): Stop frontmatter generation, append the error message to the plan
prose under a `## Frontmatter Validation Errors` section, and save the plan WITHOUT
a frontmatter block. Emit the `experiment_plan` token as usual.
- WARNINGs (V5–V8): Continue; log each as a `# WARNING: ...` YAML comment on the
relevant field line.
Field requirements by experiment type:
| Field | benchmark | config_study | causal_inference | robustness_audit | exploratory |
|-------|-----------|-------------|-----------------|-----------------|-------------|
| experiment_type | required | required | required | required | required |
| estimand | recommended | recommended | **required** | recommended | optional |
| hypothesis_h0/h1 | required | required | required | required | required |
| metrics | required | required | required | required | required |
| baselines | **required** | optional | **required** | optional | optional |
| statistical_plan | required | required | required | required | **waived** |
| environment | required | required | required | required | required |
| success_criteria | required | required | required | required | required |
| data_manifest | required | required | required | required | required |
### Step 4 — Write Output
Save the experiment plan to:
`{{AUTOSKILLIT_TEMP}}/plan-experiment/experiment_plan_{topic}_{YYYY-MM-DD_HHMMSS}.md` (relative to the current working directory)
After saving, emit the structured output token as the very last line of your
text output:
> **IMPORTANT:** Emit the structured output tokens as **literal plain text with no
> markdown formatting on the token names**. Do not wrap token names in `**bold**`,
> `*italic*`, or any other markdown. The adjudicator performs a regex match on the
> exact token name — decorators cause match failure.
experiment_plan = {absolute_path_to_experiment_plan}
development
Generate YAML recipes for .autoskillit/recipes/. Use when user says "make script skill", "generate script", "script a workflow", "write a script", "create a script", "new recipe", "write a pipeline", or when loaded by other skills for script formatting.
data-ai
Create Uncertainty Representation visualization planning spec showing error bar definitions, distribution-aware alternatives, and multi-seed variance protocols. Statistical lens answering "How is uncertainty honestly represented?"
data-ai
Create Temporal Dynamics visualization planning spec showing axis scaling (linear vs log), smoothing disclosure, epoch/step alignment, run aggregation (mean + variance bands), early-stopping markers, and wall-clock vs step-count x-axis. Temporal lens answering "Are training dynamics shown clearly and honestly?"
data-ai
Create Narrative Story Arc visualization planning spec showing visual consistency across the report (same color = same model everywhere), logical figure progression, redundant figure detection, and narrative dependency between figures. Narrative lens answering "Do the figures tell a coherent story across the report?"