.claude/skills/ml-experiment-loop/SKILL.md
Autonomous fixed-budget ML experiment loop — setup, iterative hypothesis testing, git-based keep/discard tracking, and indefinite autonomous execution. Implements the karpathy/autoresearch protocol.
npx skillsauth add oimiragieo/agent-studio ml-experiment-loopInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Complete this phase once before starting the experiment loop.
Propose a run tag based on today's date (e.g., mar14). The branch autoresearch/<tag> must NOT already exist — this is a fresh run.
git branch --list "autoresearch/*"
git checkout -b autoresearch/<tag>
Read these three files for full context before touching anything:
README.md — repository context and goalsprepare.py — fixed constants, data prep, tokenizer, dataloader, evaluation. DO NOT MODIFY.train.py — the only file you modify. Architecture, optimizer, hyperparameters, training loop.ls ~/.cache/autoresearch/
If the cache directory does not exist or is empty, stop and tell the human to run uv run prepare.py first.
uv sync
Create results.tsv with just the header row. This file stays untracked by git throughout the run.
echo -e "commit\tval_bpb\tmemory_gb\tstatus\tdescription" > results.tsv
Your very first run MUST be the unmodified baseline. Do not edit train.py yet. Run the experiment as-is (see Phase 2) to establish the baseline metric. Record it in results.tsv.
This loop runs indefinitely until the human manually interrupts it. NEVER ask the human if you should continue. NEVER stop for any reason other than: the human interrupts, or a run crashes beyond repair after multiple fix attempts.
WHILE TRUE:
1. Look at git state (current branch/commit)
2. Formulate an experimental hypothesis
3. Edit train.py
4. git commit
5. Run the experiment (redirect ALL output to file)
6. Extract the metric via grep
7. Evaluate: crash? improve? equal? worse?
8. Log to results.tsv
9. Keep (advance branch) or discard (git reset)
10. Repeat from step 2
git log --oneline -5
git status
Pick ONE focused idea to test. Examples:
If you have run out of obvious ideas:
train.py from scratch for angles you missedprepare.py for constraints you may not have noticedresults.tsvYou will not ask the human for ideas. You generate ideas yourself.
train.pyApply only the changes needed for this single hypothesis. Keep the diff minimal and reviewable.
Constraints (from prepare.py — cannot change):
evaluate_bpb function — this is the ground truth metricWhat you CAN change in train.py:
train.pyVRAM constraint: Large VRAM increases are acceptable only for meaningful metric gains.
git add train.py
git commit -m "experiment: <one-line description of what you changed>"
Redirect ALL output to a log file. NEVER let training output stream directly into your context. Streaming training logs will flood your context window and crash the session.
uv run train.py > run.log 2>&1
This will run for approximately 5 minutes. If it has not finished after 10 minutes, kill it:
kill %1 # or kill the process by PID
A 10-minute timeout is treated as a crash — discard and revert.
DO NOT cat run.log. DO NOT tail -n 500 run.log.
Extract only the key metrics:
grep "^val_bpb:\|^peak_vram_mb:" run.log
Expected output when successful:
val_bpb: 0.997900
peak_vram_mb: 45060.2
tail -n 50 run.log
Read the Python stack trace. Decide:
If you cannot fix a crash after 2 attempts, give up on the idea.
val_bpb improved — lower than current baseline)Keep the commit. The branch now "advances" — this commit becomes the new baseline.
Update your internal baseline value.
Simplicity criterion: Before keeping a win, weigh it:
val_bpb equal or worse)Discard immediately. Do NOT try to "fix" a bad idea.
git reset --hard HEAD~1
This reverts train.py to the previous baseline commit.
Record the experiment. Use TAB separators (not commas — commas break descriptions).
COMMIT=$(git rev-parse --short HEAD)
# Fill in values from the grep output and your decision
echo -e "${COMMIT}\t0.997900\t44.0\tkeep\tincrease LR to 0.04" >> results.tsv
TSV schema:
| Column | Type | Example | Notes |
| ----------- | ------ | --------------------- | ---------------------------------------------------------------- |
| commit | string | a1b2c3d | 7-char short hash |
| val_bpb | float | 0.997900 | Use 0.000000 for crashes |
| memory_gb | float | 44.0 | peak_vram_mb / 1024, round to 1 decimal. Use 0.0 for crashes |
| status | enum | keep | keep, discard, or crash |
| description | string | increase LR to 0.04 | Short text, no tabs |
Example results.tsv:
commit val_bpb memory_gb status description
a1b2c3d 0.997900 44.0 keep baseline
b2c3d4e 0.993200 44.2 keep increase LR to 0.04
c3d4e5f 1.005000 44.0 discard switch to GeLU activation
d4e5f6g 0.000000 0.0 crash double model width (OOM)
IMPORTANT: Do NOT git add results.tsv. Leave it untracked. It tracks all experiments across keeps and discards on this branch.
When evaluating whether to keep a change, apply this framework:
| Improvement | Complexity change | Decision | | --------------------- | ------------------------ | ------------------------- | | > 0.005 val_bpb lower | Reasonable | Keep | | 0.001–0.005 lower | Minimal | Keep | | 0.001–0.005 lower | Major (20+ lines, hacky) | Discard | | ≈ 0 | Simpler (fewer lines) | Keep (simplification win) | | ≈ 0 | Equal complexity | Discard | | 0 or worse | Any | Discard |
Goal: the lowest val_bpb in the cleanest code. Complexity is a debt that compounds.
If you've exhausted your idea backlog, work through these categories:
Re-reading train.py and prepare.py from scratch often surfaces new angles.
uv run train.py > run.log 2>&1. Streaming output floods context and crashes the session.grep "^val_bpb:\|^peak_vram_mb:" run.log for metrics.prepare.py — it is read-only. The evaluation protocol is fixed.git reset --hard HEAD~1. Never try to iterate on a failed idea.git add results.tsv. It records all experiments across the branch.pyproject.toml.| Anti-Pattern | Why It Fails | Correct Approach |
| ---------------------------------------------- | --------------------------------------------------------------------- | ------------------------------------------------------------ |
| cat run.log or tail -n 500 run.log | Floods context with gigabytes of training logs; crashes session | grep "^val_bpb:\|^peak_vram_mb:" run.log only |
| Asking "should I keep going?" | Human is likely asleep; defeats the purpose of autonomous research | NEVER STOP. Continue indefinitely until manually interrupted |
| Trying to "fix" a failed hypothesis | Most bad ideas are fundamentally wrong, not implementation bugs | git reset --hard HEAD~1 and move to next idea |
| Running multiple hypotheses in one experiment | Impossible to attribute wins or losses to specific changes | One hypothesis per commit, one commit per experiment |
| Modifying prepare.py | Corrupts evaluation protocol; results become incomparable | Never touch prepare.py. It is fixed. |
| Forgetting to redirect output | Training stdout floods agent context mid-experiment | Always uv run train.py > run.log 2>&1 |
| git add results.tsv | Clutters git history; results span all experiments including discards | Never track results.tsv in git |
| Using commas in results.tsv | Commas inside description field break CSV parsers | Always use TAB separators in results.tsv |
| Waiting 10+ minutes for a stuck run | OOM or infinite loops hang silently | Kill any run exceeding 10 minutes; treat as crash |
| Keeping a tiny win with major complexity added | Complexity accumulates; future experiments suffer | Apply simplicity criterion: tiny gain + ugly code = discard |
Before starting:
node .claude/lib/memory/memory-search.cjs "ml experiment loop autonomous training"
Read .claude/context/memory/learnings.md
After completing a session:
.claude/context/memory/learnings.md.claude/context/memory/issues.md.claude/context/memory/decisions.mdASSUME INTERRUPTION: If it's not in memory or results.tsv, it didn't happen.
ai-ml-expert — Deep PyTorch and ML domain knowledge for hypothesis generationmodern-python — uv/ruff/ty tooling for Python project managementgit-expert — Advanced git operations for branch and reset workflowstools
Comprehensive biosignal processing toolkit for analyzing physiological data including ECG, EEG, EDA, RSP, PPG, EMG, and EOG signals. Use this skill when processing cardiovascular signals, brain activity, electrodermal responses, respiratory patterns, muscle activity, or eye movements. Applicable for heart rate variability analysis, event-related potentials, complexity measures, autonomic nervous system assessment, psychophysiology research, and multi-modal physiological signal integration.
tools
Comprehensive toolkit for creating, analyzing, and visualizing complex networks and graphs in Python. Use when working with network/graph data structures, analyzing relationships between entities, computing graph algorithms (shortest paths, centrality, clustering), detecting communities, generating synthetic networks, or visualizing network topologies. Applicable to social networks, biological networks, transportation systems, citation networks, and any domain involving pairwise relationships.
data-ai
Molecular featurization for ML (100+ featurizers). ECFP, MACCS, descriptors, pretrained models (ChemBERTa), convert SMILES to features, for QSAR and molecular ML.
development
Run Python code in the cloud with serverless containers, GPUs, and autoscaling. Use when deploying ML models, running batch processing jobs, scheduling compute-intensive tasks, or serving APIs that require GPU acceleration or dynamic scaling.