autoresearch/SKILL.md
Autonomous experiment loop — modify code, measure, keep/discard, repeat forever. Based on Karpathy's autoresearch pattern. Use when there's working code + a measurable metric to optimize. Agent works while you sleep.
npx skillsauth add snqb/my-skills autoresearchInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Autonomous experiment loop. You have working code and a metric. You modify, measure, keep or discard, repeat. Forever. Until the human stops you.
Based on karpathy/autoresearch.
Before starting, the project MUST have:
program.md in the project root — the research strategyIf any of these are missing, help the user create them first. Don't start the loop without all three.
This is the ONLY file the human writes. It tells you everything:
# Autoresearch: <Project Name>
## Scope
- Files you CAN modify: <list>
- Files you CANNOT modify: <list>
- You CANNOT install new dependencies
## Eval
Command: `<command that outputs the metric>`
Metric: `<name>` (lower is better | higher is better)
Time budget: <N> minutes per experiment
Parse: `grep "^<metric>:" run.log` or similar
## Strategy
<What to try. Research directions. Constraints. Philosophy.>
## Notes
<Anything else the agent should know>
A good program.md is specific. Bad: "make it faster". Good: "try SIMD for the hot loop in parse.rs, experiment with different batch sizes in process(), consider replacing HashMap with BTreeMap for sorted iteration".
When the user says "start autoresearch" or points you at a program.md:
apr2, v3-perf)git checkout -b autoresearch/<tag>Run in tmux (background, survives disconnects):
LOOP FOREVER:
1. Review state
- Current best metric
- What's been tried (scan results.jsonl)
- What directions remain from program.md
2. Form hypothesis
- One clear idea, one change
- Write it as a commit message BEFORE coding
3. Modify code
- Only files in scope
- Keep changes minimal and reviewable
4. Commit
git add -A && git commit -m "exp: <description>"
5. Run experiment
<eval command> > run.log 2>&1
- Redirect everything. Do NOT flood context with training output.
- If exceeds 2× time budget, kill it → treat as crash
6. Read result
Parse metric from run.log
If empty → crash. Read `tail -50 run.log` for stack trace.
7. Decide
IMPROVED (metric better):
→ Log as "keep" in results.jsonl
→ This commit becomes new baseline
→ Continue from here
WORSE OR EQUAL:
→ Log as "discard" in results.jsonl
→ git reset --hard HEAD~1
→ Back to previous best
CRASH:
→ If trivial fix (typo, import): fix and retry once
→ Otherwise: log as "crash", git reset --hard HEAD~1, move on
8. GOTO 1
NEVER STOP. NEVER ASK "should I continue?".
The human may be asleep. Work until interrupted.
Append one JSON line per experiment to results.jsonl (untracked by git):
{"n":0,"commit":"a1b2c3d","metric":0.998,"status":"keep","description":"baseline","timestamp":"2026-04-02T03:10:00Z"}
{"n":1,"commit":"b2c3d4e","metric":0.993,"status":"keep","description":"increase LR to 0.04","timestamp":"2026-04-02T03:15:00Z"}
{"n":2,"commit":"c3d4e5f","metric":1.005,"status":"discard","description":"switch to GeLU activation","timestamp":"2026-04-02T03:20:00Z"}
{"n":3,"commit":"d4e5f6g","metric":null,"status":"crash","description":"double model width (OOM)","timestamp":"2026-04-02T03:25:00Z"}
Add results.jsonl and run.log to .gitignore if not already there.
From Karpathy:
All else being equal, simpler is better. A small improvement that adds ugly complexity is not worth it. Removing something and getting equal or better results is a great outcome. Weigh the complexity cost against the improvement magnitude.
If you run out of ideas, in order:
Do NOT stop. Think harder.
Start the loop in tmux so it survives terminal disconnects:
# Create or attach to tmux session
tmux new-session -d -s autoresearch 2>/dev/null || true
tmux send-keys -t autoresearch "cd <project-dir>" Enter
The agent runs inside the tmux session. User can:
tmux attach -t autoresearch — watch liveresults.jsonl anytime — see progressCtrl+C or kill the agent — stop the loopWhen the user asks "how's it going?" or comes back in the morning:
# Summary
echo "Experiments: $(wc -l < results.jsonl)"
echo "Keeps: $(grep '"keep"' results.jsonl | wc -l)"
echo "Best: $(grep '"keep"' results.jsonl | jq -r '.metric' | sort -n | head -1)"
echo "---"
# Last 5 experiments
tail -5 results.jsonl | jq -r '"\(.n). \(.status) \(.metric // "crash") — \(.description)"'
Or a proper summary:
# Full experiment history
jq -r '"#\(.n) [\(.status)] \(.metric // "crash") — \(.description)"' results.jsonl
## Scope
- CAN modify: train.py
- CANNOT modify: prepare.py, pyproject.toml
## Eval
Command: `uv run train.py`
Metric: val_bpb (lower is better)
Time budget: 5 minutes
Parse: `grep "^val_bpb:" run.log`
## Scope
- CAN modify: src/handlers/, src/db/queries/
## Eval
Command: `cargo build --release && wrk -t4 -c100 -d30s http://localhost:8080/api/search`
Metric: Req/Sec (higher is better)
Time budget: 2 minutes
Parse: `grep "Req/Sec" run.log | awk '{print $2}'`
## Scope
- CAN modify: prompts/system.txt
## Eval
Command: `python eval_prompt.py --dataset eval_set.jsonl`
Metric: accuracy (higher is better)
Time budget: 3 minutes
Parse: `grep "^accuracy:" run.log`
## Scope
- CAN modify: src/, package.json (deps only)
## Eval
Command: `npm run build && du -sb dist/ | cut -f1`
Metric: bytes (lower is better)
Time budget: 1 minute
Parse: the entire stdout is the number
## Scope
- CAN modify: src/components/, src/styles/
## Eval
Command: `npm run build && npx lighthouse http://localhost:3000 --output=json --chrome-flags="--headless" | jq '.categories.performance.score'`
Metric: performance score (higher is better)
Time budget: 2 minutes
Parse: the entire stdout is the number
documentation
Enrich Markdown articles with inline Wikipedia links. First mention of each notable entity gets a hyperlink. Use when asked to add wiki links, enrich, or add references to .md files.
development
Structured visual QA: screenshot → batch issues → fix all → verify. Replaces the 300-cycle screenshot→edit death spiral. Optional bishkek review as exit gate. Use when building/polishing UI with browser testing, or when user asks for N iterations/reviews.
development
Find complex code, analyze intent, recommend battle-tested library replacements. Uses radon/eslint for detection, GitHub quality search for alternatives.
research
Research real-world UI patterns from curated galleries (Collect UI, Component Gallery, Mobbin). Use when exploring what exists: dropdowns, accordions, inputs, navigation, cards, modals, etc.