skills/autoresearch/SKILL.md
Run a rigorous autonomous experiment loop for any optimization target using explicit hypotheses, repeated trials, structured experiment logs, and local HTML reports. Use when asked to "run autoresearch", "optimize X in a loop", "start experiments", or "improve this with benchmark-driven iteration".
npx skillsauth add ckorhonen/claude-skills autoresearchInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Autonomous optimization is only useful when it behaves like disciplined research, not benchmark gambling.
This skill runs a strict experiment loop:
If the request is vague, stop and ask a short up-front Q&A before starting. Do not silently invent the target workload, correctness bar, or tradeoffs.
State the proposed change and why it should help before making it. Avoid bundles of unrelated tweaks.
Every experiment must be enumerated in a machine-readable ledger, including discarded ideas, crashes, and failed checks.
Do not keep a change because of one fast run. Use warmups, repeated measurements, and an explicit decision rule.
Reports, CSVs, JSONL ledgers, and graphs belong in a local .autoresearch/ directory and should not be committed.
scripts/init_experiment.py — initialize .autoresearch/session.json, ensure .autoresearch/ stays untracked, and scaffold autoresearch.md when needed.scripts/run_experiment.py — run warmups and measured trials, parse METRIC lines, run optional checks, and emit a JSON experiment record.scripts/log_experiment.py — append an experiment to .autoresearch/results.jsonl, decide keep vs discard, and refresh CSV and HTML artifacts.scripts/render_report.py — regenerate .autoresearch/results.csv and .autoresearch/report.html from the JSONL ledger.All scripts are non-interactive, expose --help, emit structured JSON on stdout, and keep diagnostics on stderr.
Initialize the session after the up-front Q&A:
python3 scripts/init_experiment.py \
--goal "Reduce latency in hot path" \
--metric-name latency_ms \
--unit ms \
--direction lower \
--command ./autoresearch.sh \
--checks-command ./autoresearch.checks.sh \
--scope src/hot_path.ts
Record the baseline:
# Option A: save to file first, then log
python3 scripts/run_experiment.py \
--id baseline \
--hypothesis "Control run" \
--change-summary "No code changes" \
--baseline \
--output .autoresearch/baseline.json
python3 scripts/log_experiment.py --input .autoresearch/baseline.json
# Option B: pipe directly
python3 scripts/run_experiment.py \
--id baseline \
--hypothesis "Control run" \
--change-summary "No code changes" \
--baseline \
| python3 scripts/log_experiment.py
Record each candidate experiment:
python3 scripts/run_experiment.py \
--id exp-001 \
--hypothesis "Inlining removes allocation churn" \
--change-summary "Inline helper and pre-size buffer" \
--output .autoresearch/exp-001.json
python3 scripts/log_experiment.py --input .autoresearch/exp-001.json
Re-render artifacts on demand:
python3 scripts/render_report.py
Before starting, gather or confirm all of the following. If any item is vague or missing, ask focused questions first.
1% if the user does not specifyIf the user says "just run with it," write the assumptions explicitly into autoresearch.md before the first experiment, then encode them via python3 scripts/init_experiment.py ....
Prefer a dedicated worktree on a fresh branch:
git worktree add ../autoresearch-<goal>-<date> -b autoresearch/<goal>-<date>
If a worktree is not practical, use a new branch in a clean working tree. Do not run this skill in a dirty tree with unrelated user changes.
Create:
autoresearch.md — checked in, durable session briefautoresearch.sh — checked in, benchmark runnerautoresearch.checks.sh — checked in only when correctness gates are required.autoresearch/ — local artifact directory, not checked inEnsure local artifacts stay untracked:
rg -qxF '.autoresearch/' .git/info/exclude || printf '\n.autoresearch/\n' >> .git/info/exclude
Run python3 scripts/init_experiment.py --help if you need the exact interface, then initialize the session, collect a baseline with python3 scripts/run_experiment.py ..., and log it with python3 scripts/log_experiment.py ... before making any code changes.
autoresearch.mdThis is the durable contract for the session. A fresh agent should be able to resume from it without guessing.
# Autoresearch: <goal>
## Objective
<What is being optimized and why it matters.>
## Up-Front Answers
- Primary metric:
- Unit:
- Direction:
- Minimum meaningful improvement:
- Workload command:
- Correctness gates:
- Budget / stop criteria:
## Scope
- In scope:
- Off limits:
## Decision Rule
<How many warmups, how many measured trials, how "keep" is decided.>
## Experiment Ledger
`.autoresearch/results.jsonl`
## Report Outputs
- `.autoresearch/report.html`
- `.autoresearch/results.csv`
- `.autoresearch/plots/`
## Current Best Result
<Best known baseline or kept variant.>
## What We've Learned
<Key wins, dead ends, confounders, and structural insights.>
Update autoresearch.md whenever assumptions change, a new best result is found, or a pattern becomes clear.
autoresearch.shBash script with set -euo pipefail that:
METRIC name=value linesKeep it as small and deterministic as possible. If you change the benchmark protocol, record that in autoresearch.md and establish a new baseline.
autoresearch.checks.shCreate this when the session has correctness gates. It must:
If checks fail, the experiment disposition is checks_failed, not keep. The default script workflow handles this automatically.
Everything below lives in .autoresearch/ and should remain untracked.
.autoresearch/results.jsonlAppend one JSON object per experiment. This is the source of truth for enumerating all experiments.
Required fields:
{
"id": "exp-007",
"timestamp": "2026-03-14T13:00:00Z",
"hypothesis": "Inlining X removes allocation churn in hot path Y.",
"change_summary": "Inline helper and pre-size buffer.",
"files_touched": ["src/hot_path.ts"],
"baseline_commit": "abc1234",
"candidate_ref": "def5678",
"metric_name": "latency_ms",
"direction": "lower",
"warmup_trials": [12.8, 12.7],
"measured_trials": [11.9, 12.0, 11.8, 11.9, 12.1],
"summary": {
"median": 11.9,
"mean": 11.94,
"min": 11.8,
"max": 12.1
},
"secondary_metrics": {
"memory_mb": 84.1
},
"checks": "passed",
"disposition": "keep",
"reason": "Median improved by 5.2%, checks passed, complexity acceptable."
}
.autoresearch/results.csvFlatten the JSONL into a spreadsheet-friendly summary after each experiment or at regular intervals.
.autoresearch/report.htmlGenerate an HTML report that visualizes the session. At minimum include:
.autoresearch/plots/Store generated charts or exported images here when the report uses external assets.
Before each experiment, write a short experiment card into autoresearch.md or a temporary note:
idhypothesisplanned changewhy this should affect the metricfiles expected to changepredicted direction and rough magnituderollback planIf you cannot explain why the change should help, do not run the experiment yet.
Use this default unless the user specifies something stricter:
Warm up first
2 warmup trials that are not used for the decision.Measure repeatedly
5 measured trials for the baseline and for each serious candidate.Summarize robustly
Apply a pre-declared threshold
Reset baseline when needed
keep
discard
checks_failed
crash
Never upgrade an ambiguous result to keep because it feels promising.
Run autonomously, but do not run forever without learning. Continue until one of these is true:
During the loop:
When discarding a candidate, restore back to the last committed good state in the isolated worktree before starting the next idea.
Autoresearch is disciplined empiricism, but several failure modes emerge in long-running optimization loops. Knowing these patterns helps prevent wasted experiments and broken workflows.
Symptom:
Root Cause:
Mitigation:
autoresearch.md: stop after N consecutive experiments with no improvement, e.g., "Stop if 3 experiments in a row show <0.5% gain"autoresearch.md when exiting earlySymptom:
Root Cause:
Mitigation:
autoresearch.md and explicitly write a "What We've Learned" section with 2–3 key insights. If that section is empty, you're spinningSymptom:
.autoresearch/ directory balloons to 10s or 100s of MB after 50+ experimentspython3 scripts/render_report.py slows down measurably over timeRoot Cause:
Mitigation:
.autoresearch/archive/YYYYMMDD-run-N.jsonl and restart with a fresh results.jsonlautoresearch.sh does not emit verbose logs; capture only METRIC linesresults.jsonl, record {"summary": {...}, "median": X} instead of all raw trial values if space is criticalautoresearch.md: note when the ledger was archived and how to access previous runsdu -sh .autoresearch/ | grep -qE '[0-9]+M' && echo "WARNING: artifact directory >100MB"Symptom:
.autoresearch/report.html after many experimentsRoot Cause:
render_report.py generates a single monolithic HTML file with all data inlinedMitigation:
report-01.html, report-02.html, etc.) with navigation linkslen(results) > 100, render a condensed report that includes:
.autoresearch/data.json with the full result set and reference it from the HTML with client-side rendering, so the HTML itself stays under 1 MBrender_report.py, if result count > 100, log: "WARNING: report contains >100 experiments; consider archiving old runs or using --summary-mode"Generate or refresh .autoresearch/report.html periodically and always at the end of the session:
python3 scripts/render_report.py
The report should answer:
If rich charting is cumbersome, generate a simple static HTML file with inline SVG charts or lightweight local assets. The report matters more than polish.
When resuming:
autoresearch.md.autoresearch/results.jsonl.autoresearch/report.html if presentIf the user sends a message while an experiment is running, finish the current measurement cycle if it is short and safe to do so, then respond. If the run is long or risky, stop at the nearest safe checkpoint and reply.
documentation
Create or expand an Idea.md / IDEA.md file from a rough description, existing repo, conversation history, notes, or other early-stage product inputs. Use when the user asks to "write an Idea.md", "turn this into an idea file", "capture this product idea", "expand this concept", or wants a repo-grounded concept brief before validation, PRD, or implementation work.
development
Write structured implementation plans from specs or requirements before touching code. Use when given a spec, requirements doc, or feature description, when user says "plan this out", "write a plan for", "how should we implement", or before starting any multi-step coding task.
testing
Expert guidance for video editing with ffmpeg, encoding best practices, and quality optimization. Use when working with video files, transcoding, remuxing, encoding settings, color spaces, or troubleshooting video quality issues.
development
Opinionated constraints for building better interfaces with agents. Use when building UI components, implementing animations, designing layouts, reviewing frontend accessibility, or working with Tailwind CSS, motion/react, or accessible primitives like Radix/Base UI.