src/autoskillit/skills_extended/run-experiment/SKILL.md
Execute a designed experiment in a worktree and collect structured results. Supports --adjust retry mode.
npx skillsauth add talont-org/autoskillit run-experimentInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Execute an experiment that has been implemented in a worktree. This skill runs whatever the experiment requires — scripts, benchmarks, custom tooling, manual procedures, data collection, or any combination. It collects results and produces a structured results file.
The nature of the experiment is entirely determined by the experiment plan. This skill does NOT prescribe how experiments should be run — it reads the plan, executes what the plan describes, and reports what happened.
research recipe (phase 2)/autoskillit:implement-worktree-no-merge has set up experiment code--adjust flag is passed, re-run with modified approach after a failure/autoskillit:run-experiment {worktree_path} [--adjust]
{worktree_path} — Absolute path to the worktree containing experiment code
(required). Scan tokens for the first path-like token (starts with /, ./,
or .autoskillit/).--adjust — Optional flag indicating this is a retry after a previous failure.
When present, read the previous results/errors from {{AUTOSKILLIT_TEMP}}/run-experiment/
and adjust the approach before re-running.NEVER:
{{AUTOSKILLIT_TEMP}}/ — this directory is gitignored working space, NOT for version control. Do not use git add -f or git add --force to bypass the gitignore.run_in_background: true is prohibited)ALWAYS:
model: "sonnet" when spawning all subagents via the Task tool{{AUTOSKILLIT_TEMP}}/run-experiment/ in the worktree (disk only, never committed)--adjust retry to fix themWhen context is exhausted mid-execution, experiment results may be partially written
to {{AUTOSKILLIT_TEMP}}/run-experiment/. The recipe routes to on_context_limit,
abandoning the partial experiment run.
Before emitting structured output tokens:
experiment_results = (empty) as a fallbackon_context_limit route handles the partial state; the
downstream --adjust retry can restart the experiment from scratchRead the experiment plan from {{AUTOSKILLIT_TEMP}}/experiment-plan.md in the
worktree (or the project root, checking both locations). This was saved by the
recipe's save_experiment_plan step from the approved GitHub issue.
Also scan the worktree for experiment-related files:
implement-worktree-no-mergeUnderstand what the experiment requires before attempting to run anything.
Before running the experiment:
--adjust flag is set, read previous results from
{{AUTOSKILLIT_TEMP}}/run-experiment/ and identify what went wrong.Launch subagents (model: "sonnet") if needed to investigate the experiment setup, resolve dependencies, or research how to use specific tools mentioned in the plan.
Before executing any hypothesis:
Read the Data Manifest from the experiment plan's YAML frontmatter
(data_manifest field). If no frontmatter or no data_manifest field exists,
log a warning and proceed with best-effort artifact checks.
For each data_manifest entry, verify:
location is specified: the path exists and is non-emptyverification criteria are specified: evaluate them (e.g., file count, size)acquisition command is specified and data is missing: attempt to run
the acquisition command. If it fails, mark the entry as BLOCKED.Produce a data readiness table:
| Hypothesis | Source Type | Location | Status |
|------------|-------------|----------|--------|
| H1, H2 | synthetic | in-script | READY |
| H5 | external | temp/merfish_100k/ | BLOCKED — directory empty |
If any entry the plan said would be acquired is BLOCKED:
blocked_hypotheses listing all blocked entries## Status to FAILEDThis replaces the current behavior of silently marking missing-data hypotheses as N/A. When the plan declared acquisition steps for data and those steps did not produce the data, this is a pipeline failure — not a pipeline-level degradation.
Read env_mode from context (set by setup-environment earlier in the
recipe). Dispatch execution based on the mode:
env_mode = docker:
The Docker image research-{slug} was pre-built by setup-environment.
Execute the experiment inside the container:
RESEARCH_DIR=$(ls -d "${WORKTREE_PATH}"/research/*/ 2>/dev/null | head -1)
SLUG=$(basename "${RESEARCH_DIR%/}")
docker run --rm -v "${RESEARCH_DIR}:/workspace" "research-${SLUG}" \
bash -c "cd /workspace && python scripts/run.py"
Adjust the entry-point command to match the actual script from the experiment
plan. If the research directory contains a Taskfile.yml with a
run-experiment task, prefer task run-experiment inside the container.
env_mode = micromamba-host:
A host micromamba environment experiment-{slug} was created by
setup-environment. Execute the experiment inside that environment:
RESEARCH_DIR=$(ls -d "${WORKTREE_PATH}"/research/*/ 2>/dev/null | head -1)
SLUG=$(basename "${RESEARCH_DIR%/}")
cd "${RESEARCH_DIR}"
micromamba run -n "experiment-${SLUG}" python scripts/run.py
Adjust the entry-point command to match the actual script from the experiment plan.
env_mode = unavailable:
No suitable environment could be provisioned. Emit the blocked_experiment
structured output token and set the results status to FAILED:
blocked_experiment = env_mode is unavailable — setup-environment could not provision docker or micromamba-host
Write a results file with ## Status: FAILED and the reason, then proceed
to Step 5 (Save Results) to emit the results_path token.
env_mode = none:
Standard environment — no container or micromamba needed. Run the experiment directly in the worktree using the system Python:
RESEARCH_DIR=$(ls -d "${WORKTREE_PATH}"/research/*/ 2>/dev/null | head -1)
cd "${RESEARCH_DIR}" && python scripts/run.py
If the plan specifies multiple configurations or comparisons, execute all of them under the dispatched environment mode and collect results for each.
Structure the results as a markdown file:
# Experiment Results: {title}
## Run Metadata
- Date: {YYYY-MM-DD HH:MM:SS}
- Worktree: {worktree_path}
- Commit: {git rev-parse HEAD}
- Environment: {relevant version info}
## Configuration
{Parameters used for this run — from the experiment plan}
## Results
{Present the data collected. Use tables, code blocks, or whatever format
best represents the measurements. Include raw data when feasible.}
## Observations
{Notable patterns, anomalies, unexpected behaviors, anything worth noting}
## Recommendation
{Based on the evidence collected, what does this suggest? This is the
experimenter's interpretation — the generate-report skill will synthesize
the final conclusions.}
## Status
{One of: CONCLUSIVE_POSITIVE | CONCLUSIVE_NEGATIVE | INCONCLUSIVE | FAILED}
{Brief justification for the status}
{{AUTOSKILLIT_TEMP}}/run-experiment/results_{topic}_{YYYY-MM-DD_HHMMSS}.md
(relative to the current working directory) within the worktree.git add or commit files under {{AUTOSKILLIT_TEMP}}/. This directory
is gitignored working space. The files persist on the worktree filesystem
for generate-report to read. Final results are published to research/ by
the generate-report skill.After saving, emit the structured output token as the very last line of your text output:
IMPORTANT: Emit the structured output tokens as literal plain text with no markdown formatting on the token names. Do not wrap token names in
**bold**,*italic*, or any other markdown. The adjudicator performs a regex match on the exact token name — decorators cause match failure.
results_path = {absolute_path_to_results_file}
When pre-flight blocks hypotheses due to missing planned data:
blocked_hypotheses = H5: MERFISH data missing at temp/merfish_100k/ (acquisition: generate_merfish_subset.py --n 100000)
This token is emitted ONLY when the pre-flight gate fails due to data declared in the Data Manifest being inaccessible. It is NOT emitted during normal execution.
When blocked_hypotheses is emitted, results_path still points to the results file
with ## Status: FAILED.
When --adjust is passed, this is a retry after a previous execution failed.
{{AUTOSKILLIT_TEMP}}/run-experiment/ in the worktreeDo NOT redesign the entire experiment — make minimal adjustments to address the specific failure. If the experiment design itself is fundamentally flawed, return a FAILED status so the recipe can escalate.
development
Generate YAML recipes for .autoskillit/recipes/. Use when user says "make script skill", "generate script", "script a workflow", "write a script", "create a script", "new recipe", "write a pipeline", or when loaded by other skills for script formatting.
data-ai
Create Uncertainty Representation visualization planning spec showing error bar definitions, distribution-aware alternatives, and multi-seed variance protocols. Statistical lens answering "How is uncertainty honestly represented?"
data-ai
Create Temporal Dynamics visualization planning spec showing axis scaling (linear vs log), smoothing disclosure, epoch/step alignment, run aggregation (mean + variance bands), early-stopping markers, and wall-clock vs step-count x-axis. Temporal lens answering "Are training dynamics shown clearly and honestly?"
data-ai
Create Narrative Story Arc visualization planning spec showing visual consistency across the report (same color = same model everywhere), logical figure progression, redundant figure detection, and narrative dependency between figures. Narrative lens answering "Do the figures tell a coherent story across the report?"