skills/autoresearch/SKILL.md
Use for autoresearch, autonomous experiments, optimization loops, "optimize X overnight/in a loop", or "experiment loop"; sets up iterative trials for an optimization target.
npx skillsauth add paulrberg/dot-agents autoresearchInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Autonomous experiment loop: try ideas, measure results, keep what works, discard what doesn't, never stop.
Works for any optimization target: test speed, bundle size, LLM training, build times, Lighthouse scores, binary size, latency, memory usage.
If autoresearch.md already exists in the working directory, skip setup and resume the loop — read autoresearch.md, autoresearch.jsonl, and git log, then continue experimenting.
Otherwise:
$ARGUMENTS and conversation) the Goal, Command to benchmark, Primary metric (name + direction), Files in scope, and Constraints.git checkout -b autoresearch/<goal>-<date> (e.g. autoresearch/test-speed-2026-03-21).autoresearch.md and autoresearch.sh (see templates below). If constraints require correctness validation (tests must pass, types must check), also create autoresearch.checks.sh. Commit all.autoresearch.mdThe heart of the session. A fresh agent with no context should be able to read this file alone and run the loop effectively. Invest time making it excellent.
# Autoresearch: <goal>
## Objective
<Specific description of what we're optimizing and the workload.>
## Metrics
- **Primary**: <name> (<unit>, lower/higher is better)
- **Secondary**: <name>, <name>, ...
## How to Run
`./autoresearch.sh` — outputs `METRIC name=value` lines.
## Files in Scope
<Every file the agent may modify, with a brief note on what it does.>
## Off Limits
<What must NOT be touched — evaluation harness, data prep, etc.>
## Constraints
<Hard rules: tests must pass, no new deps, fixed time budget, etc.>
## What's Been Tried
<Update this section as experiments accumulate. Note key wins, dead ends,
and architectural insights so the agent doesn't repeat failed approaches.>
Update autoresearch.md periodically — especially "What's Been Tried" — so resuming agents have full context.
autoresearch.shBash script that runs the benchmark and outputs structured metrics.
#!/bin/bash
set -euo pipefail
# Pre-checks (fast, <1s — catch syntax errors early)
python3 -c "import ast; ast.parse(open('train.py').read())"
# Run benchmark
uv run train.py > /tmp/autoresearch-output.log 2>&1
# Extract and output metrics as METRIC lines
val_bpb=$(grep "^val_bpb:" /tmp/autoresearch-output.log | awk '{print $2}')
echo "METRIC val_bpb=$val_bpb"
Rules:
set -euo pipefail.METRIC name=value lines to stdout (one per metric). The primary metric name must match what's documented in autoresearch.md.µ (e.g. val_bpb, total_µs, bundle.size_kb).autoresearch.checks.sh (optional)Backpressure checks: tests, types, lint. Only create when constraints require correctness validation.
#!/bin/bash
set -euo pipefail
pnpm test --run --reporter=dot 2>&1 | tail -50
pnpm typecheck 2>&1 | grep -i error || true
When this file exists:
checks_failed and revert.When this file does not exist, skip checks entirely.
LOOP FOREVER. Never ask "should I continue?" — the user expects autonomous work.
Each iteration:
autoresearch.ideas.md, choose what to try next.git add -A && git commit -m "<short description of what this experiment tries>"timeout 600 ./autoresearch.sh > run.log 2>&1
If the command times out or crashes, treat it as a failure.METRIC lines from the output:
grep '^METRIC ' run.log
If no METRIC lines found, the run crashed — read tail -50 run.log for the error.autoresearch.checks.sh exists and benchmark passed):
timeout 300 ./autoresearch.checks.sh > checks.log 2>&1
keep. The commit stays.discard. Revert: stage autoresearch files first, then reset.crash. Fix if trivial, otherwise revert and move on.checks_failed. Revert.autoresearch.jsonl:
{"run":1,"commit":"a1b2c3d","metric":0.9979,"metrics":{"val_bpb":0.9979,"peak_vram_mb":45060.2},"status":"keep","description":"baseline","timestamp":1711036800000,"confidence":null}
# Preserve autoresearch session files, revert everything else
git add autoresearch.jsonl autoresearch.md autoresearch.sh autoresearch.ideas.md autoresearch.checks.sh 2>/dev/null || true
git checkout -- .
git clean -fd
bash "$(dirname "$(readlink -f "$0")")/scripts/confidence.sh"
Interpret the score:
autoresearch.md "What's Been Tried" section and run the summary script to review progress.Repeat forever until interrupted.
Each line in autoresearch.jsonl is a JSON object:
| Field | Type | Description |
| ------------- | -------------- | ---------------------------------------------- |
| run | number | 1-indexed experiment count |
| commit | string | Short git SHA (7 chars) |
| metric | number | Primary metric value |
| metrics | object | All metrics dict (primary + secondary) |
| status | string | keep, discard, crash, or checks_failed |
| description | string | What this experiment tried |
| timestamp | number | Unix timestamp (ms) |
| confidence | number or null | MAD-based confidence score (null if <3 runs) |
When autoresearch.md exists in the working directory:
autoresearch.md for full context (objective, what's been tried, constraints).autoresearch.jsonl to reconstruct state (best metric, run count, last segment).git log --oneline -20 for recent commit history.autoresearch.ideas.md if it exists — prune stale entries, experiment with promising ones.When you discover complex but promising optimizations you won't pursue right now, append them as bullets to autoresearch.ideas.md. Don't let good ideas get lost.
On resume, check this file — prune stale/tried entries, experiment with the rest. When all paths are exhausted, delete the file and write a final summary to autoresearch.md.
See references/loop-rules.md for the full reference. Key rules:
If the user sends a message while an experiment is running, finish the current run-evaluate-log cycle first, then incorporate their feedback in the next iteration.
testing
Use ONLY to check or update the project-scoped agent skills installed under .agents/skills so they match the current state of the repo. Do not trigger for creating, finding, or installing skills, or for README/AGENTS.md updates.
testing
Use when CSV, TSV, or Excel (.xlsx) is the primary input/output: inspect, clean, transform, dedupe, merge, validate, convert, recalc formulas, or create/fix spreadsheets. Do not trigger when tabular data is incidental.
testing
Use only when explicitly asked to archive/prune/compact/roll over checked tasks from TODO.md into `.ai/todos/TODO_UNTIL_YYYY_MM_DD.md`, leaving unchecked tasks.
development
Use this skill any time a spreadsheet file is the primary input or output. This means any task where the user wants to: open, read, edit, or fix an existing .xlsx, .xlsm, .csv, or .tsv file (e.g., adding columns, computing formulas, formatting, charting, cleaning messy data); create a new spreadsheet from scratch or from other data sources; or convert between tabular file formats. Trigger especially when the user references a spreadsheet file by name or path — even casually (like "the xlsx in my downloads") — and wants something done to it or produced from it. Also trigger for cleaning or restructuring messy tabular data files (malformed rows, misplaced headers, junk data) into proper spreadsheets. The deliverable must be a spreadsheet file. Do NOT trigger when the primary deliverable is a Word document, HTML report, standalone Python script, database pipeline, or Google Sheets API integration, even if tabular data is involved.