skills/autoresearch/SKILL.md
This skill should be used when the user asks to "run autoresearch", "optimize X in a loop", "set up autonomous experiments", "start autoresearch", "optimize X overnight", or "experiment loop". Sets up and runs an autonomous experiment loop for any optimization target.
npx skillsauth add paulrberg/agent-skills autoresearchInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Autonomous experiment loop: try ideas, measure results, keep what works, discard what doesn't, never stop.
Works for any optimization target: test speed, bundle size, LLM training, build times, Lighthouse scores, binary size, latency, memory usage.
If autoresearch.md already exists in the working directory, skip setup and resume the loop — read autoresearch.md, autoresearch.jsonl, and git log, then continue experimenting.
Otherwise:
$ARGUMENTS and conversation) the Goal, Command to benchmark, Primary metric (name + direction), Files in scope, and Constraints.git checkout -b autoresearch/<goal>-<date> (e.g. autoresearch/test-speed-2026-03-21).autoresearch.md and autoresearch.sh (see templates below). If constraints require correctness validation (tests must pass, types must check), also create autoresearch.checks.sh. Commit all.autoresearch.mdThe heart of the session. A fresh agent with no context should be able to read this file alone and run the loop effectively. Invest time making it excellent.
# Autoresearch: <goal>
## Objective
<Specific description of what we're optimizing and the workload.>
## Metrics
- **Primary**: <name> (<unit>, lower/higher is better)
- **Secondary**: <name>, <name>, ...
## How to Run
`./autoresearch.sh` — outputs `METRIC name=value` lines.
## Files in Scope
<Every file the agent may modify, with a brief note on what it does.>
## Off Limits
<What must NOT be touched — evaluation harness, data prep, etc.>
## Constraints
<Hard rules: tests must pass, no new deps, fixed time budget, etc.>
## What's Been Tried
<Update this section as experiments accumulate. Note key wins, dead ends,
and architectural insights so the agent doesn't repeat failed approaches.>
Update autoresearch.md periodically — especially "What's Been Tried" — so resuming agents have full context.
autoresearch.shBash script that runs the benchmark and outputs structured metrics.
#!/bin/bash
set -euo pipefail
# Pre-checks (fast, <1s — catch syntax errors early)
python3 -c "import ast; ast.parse(open('train.py').read())"
# Run benchmark
uv run train.py > /tmp/autoresearch-output.log 2>&1
# Extract and output metrics as METRIC lines
val_bpb=$(grep "^val_bpb:" /tmp/autoresearch-output.log | awk '{print $2}')
echo "METRIC val_bpb=$val_bpb"
Rules:
set -euo pipefail.METRIC name=value lines to stdout (one per metric). The primary metric name must match what's documented in autoresearch.md.µ (e.g. val_bpb, total_µs, bundle.size_kb).autoresearch.checks.sh (optional)Backpressure checks: tests, types, lint. Only create when constraints require correctness validation.
#!/bin/bash
set -euo pipefail
pnpm test --run --reporter=dot 2>&1 | tail -50
pnpm typecheck 2>&1 | grep -i error || true
When this file exists:
checks_failed and revert.When this file does not exist, skip checks entirely.
LOOP FOREVER. Never ask "should I continue?" — the user expects autonomous work.
Each iteration:
autoresearch.ideas.md, choose what to try next.git add -A && git commit -m "<short description of what this experiment tries>"timeout 600 ./autoresearch.sh > run.log 2>&1
If the command times out or crashes, treat it as a failure.METRIC lines from the output:
grep '^METRIC ' run.log
If no METRIC lines found, the run crashed — read tail -50 run.log for the error.autoresearch.checks.sh exists and benchmark passed):
timeout 300 ./autoresearch.checks.sh > checks.log 2>&1
keep. The commit stays.discard. Revert: stage autoresearch files first, then reset.crash. Fix if trivial, otherwise revert and move on.checks_failed. Revert.autoresearch.jsonl:
{"run":1,"commit":"a1b2c3d","metric":0.9979,"metrics":{"val_bpb":0.9979,"peak_vram_mb":45060.2},"status":"keep","description":"baseline","timestamp":1711036800000,"confidence":null}
# Preserve autoresearch session files, revert everything else
git add autoresearch.jsonl autoresearch.md autoresearch.sh autoresearch.ideas.md autoresearch.checks.sh 2>/dev/null || true
git checkout -- .
git clean -fd
bash "$(dirname "$(readlink -f "$0")")/scripts/confidence.sh"
Interpret the score:
autoresearch.md "What's Been Tried" section and run the summary script to review progress.Repeat forever until interrupted.
Each line in autoresearch.jsonl is a JSON object:
| Field | Type | Description |
| ------------- | -------------- | ---------------------------------------------- |
| run | number | 1-indexed experiment count |
| commit | string | Short git SHA (7 chars) |
| metric | number | Primary metric value |
| metrics | object | All metrics dict (primary + secondary) |
| status | string | keep, discard, crash, or checks_failed |
| description | string | What this experiment tried |
| timestamp | number | Unix timestamp (ms) |
| confidence | number or null | MAD-based confidence score (null if <3 runs) |
When autoresearch.md exists in the working directory:
autoresearch.md for full context (objective, what's been tried, constraints).autoresearch.jsonl to reconstruct state (best metric, run count, last segment).git log --oneline -20 for recent commit history.autoresearch.ideas.md if it exists — prune stale entries, experiment with promising ones.When you discover complex but promising optimizations you won't pursue right now, append them as bullets to autoresearch.ideas.md. Don't let good ideas get lost.
On resume, check this file — prune stale/tried entries, experiment with the rest. When all paths are exhausted, delete the file and write a final summary to autoresearch.md.
See references/loop-rules.md for the full reference. Key rules:
If the user sends a message while an experiment is running, finish the current run-evaluate-log cycle first, then incorporate their feedback in the next iteration.
development
This skill should be used when the user asks to "debrief", "debrief this task", "debrief the session", "save findings", "save analysis", "save this as a report", "create an HTML report from the transcript", or wants to persist the current task's findings as a self-contained interactive HTML playground at `./.ai/reports/<slug>/index.html`. Flag: --md emits a plain Markdown report at `./.ai/reports/<slug>/index.md` and skips the playground dependency.
documentation
This skill should be used when the user asks to create or update a GitHub PR, file or update an issue, post a comment, or start a discussion. Trigger phrases include "create PR", "open PR", "file an issue", "update issue", "yeet a PR/issue/discussion", "comment on an issue".
development
This skill should be used when the user asks to resolve an EVM chain name or chain ID; find chain metadata such as a default public RPC, native currency symbol, or block explorer URL; determine whether a chain is supported by RouteMesh; or read on-chain account data for any EVM chain — "check ETH balance", "query ERC-20 balance", "get wallet balance", "check token holdings", "fetch NFT transfers", "ERC-721 or ERC-1155 transfer history", "transaction history", "find first funding transaction", "trace fund origin", "who funded this address", "query Etherscan", "query Blockscout", or "look up a chain on Chainscout". It routes each data query through Etherscan API V2 (preferred) or the Blockscout/Chainscout APIs (fallback for chains Etherscan doesn't serve), with direct JSON-RPC as a last resort. Also use it for chain resolution before fetching data from or interacting with an EVM chain.
development
This skill should be used when the user asks to commit changes, craft a commit message, or run a commit workflow. Creates atomic git commits with conventional-commit formatting and optional deep analysis or push. Flags: --all, --deep, --close, --push.