skills/autoresearch/SKILL.md
# autoresearch skill > Adapted from [karpathy/autoresearch](https://github.com/karpathy/autoresearch) program.md. > This skill teaches Research Loop's Empirical agent how to run autonomous > nanochat/GPT training experiments using the autoresearch setup. ## What this skill is for You are the Empirical Agent operating on the `karpathy/autoresearch` codebase. Your job is to autonomously experiment with `train.py` to minimize `val_bpb` (validation bits per byte — lower is better). ## Repository
npx skillsauth add moralespanitz/research-loop skills/autoresearchInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Adapted from karpathy/autoresearch program.md. This skill teaches Research Loop's Empirical agent how to run autonomous nanochat/GPT training experiments using the autoresearch setup.
You are the Empirical Agent operating on the karpathy/autoresearch codebase.
Your job is to autonomously experiment with train.py to minimize val_bpb
(validation bits per byte — lower is better).
prepare.py — fixed constants, data prep, tokenizer, dataloader, evaluation. DO NOT MODIFY.
train.py — the ONLY file you edit. Model architecture, optimizer, hyperparameters.
program.md — agent instructions (this skill supersedes it)
run.log — benchmark output (written by: uv run train.py > run.log 2>&1)
results.tsv — your experiment log (tab-separated, not tracked by git)
You CAN:
train.py — this is the only file you touchtrain.py — all constants at the top are fair gameYou CANNOT:
prepare.py — it is read-only (evaluation harness, dataloader, constants)pyproject.tomlevaluate_bpb function — it is the ground truth metricuv run train.py > run.log 2>&1
Training runs for a fixed 5-minute wall-clock time budget regardless of what you change.
Read the result:
grep "^val_bpb:\|^peak_vram_mb:" run.log
The summary block looks like:
---
val_bpb: 0.997900
training_seconds: 300.1
total_seconds: 325.9
peak_vram_mb: 45060.2
When proposing a mutation, always specify:
train.pyval_bpbGood first experiments (in rough priority order):
MATRIX_LR, EMBEDDING_LR) — high leverage, low riskDEPTH increase (8 → 10 or 12) — more capacity, higher VRAMWARMDOWN_RATIO adjustment — often undertunedWINDOW_PATTERN change (e.g. "SSLL" or "L") — architecturalTOTAL_BATCH_SIZE increase — may improve generalizationWEIGHT_DECAY tuning — regularizationADAM_BETAS, SCALAR_LR)| Delta | Action |
|-------|--------|
| val_bpb improved (lower) | Keep — advance the branch |
| val_bpb equal or worse | Discard — git reset --hard HEAD |
| Crash (OOM / NaN / exit 1) | Discard — check tail -n 50 run.log for the error |
Simplicity criterion: A 0.001 improvement that adds 20 lines of hacky code is probably not worth it. A 0.001 improvement from deleting code is always worth it.
After every run (keep or discard), append to results.tsv (tab-separated):
commit val_bpb memory_gb status description
commit: 7-char git hashval_bpb: metric value (0.000000 for crashes)memory_gb: peak_vram_mb / 1024 rounded to 1 decimal (0.0 for crashes)status: keep, discard, or crashdescription: short description of what you triedThe default train.py requires an NVIDIA H100. For MacOS (MPS) or smaller GPUs:
DEPTH to 4, TOTAL_BATCH_SIZE to 2**14, DEVICE_BATCH_SIZE to 16WINDOW_PATTERN = "L" (banded attention is slow on non-CUDA)testing
Plan and execute a structured replication workflow for a paper, claim, or benchmark with environment selection and integrity checks.
testing
End-to-end paper generation pipeline ported from AutoResearchClaw (Aiming Lab). 14 phases covering topic initiation through export/publish, with human- in-the-loop gates and quality gating at each handoff. Use this when the user wants a full paper pipeline run — topic to submission-ready manuscript. Delegates to researcher/reviewer/writer/verifier subagents for stage execution and to autonomous-iteration for experiment optimization loops.
testing
Run a structured literature review on a topic using parallel search, evidence tables with quality scoring, and primary-source synthesis.
development
Publication-quality figure generation for research papers. Decision agent selects figure type (code plot vs architecture diagram). Generates Matplotlib/Seaborn code for quantitative figures with iterative improvement loop. Style-matches conference templates (NeurIPS, ICML, ICLR). Use when the paper-pipeline reaches the figure generation phase, or when a user requests figures for an existing draft.