autoresearch/SKILL.md
Autonomous research loop inspired by karpathy/autoresearch. Iteratively modify a target file (config, prompt, code), run an experiment with a fixed evaluation metric, keep improvements, discard regressions, and log everything to a TSV. Use when optimizing prompts, tuning hyperparameters, evolving agent configurations, or running overnight autonomous improvement loops.
npx skillsauth add jswortz/my-skills autoresearchInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Autonomous experiment loop for iterative research. Inspired by karpathy/autoresearch: modify code, run experiment, measure metric, keep or discard, repeat.
LOOP:
1. Hypothesize — form an idea based on past results
2. Modify — edit the target file(s)
3. Commit — snapshot the change
4. Run — execute the experiment (fixed budget)
5. Measure — extract the key metric
6. Decide — keep (metric improved) or revert (metric worse/equal)
7. Log — record result in results.tsv
GOTO 1
Before starting the loop, establish:
train.py, prompt.md, config.yaml, swarm_manager.py)uv run train.py, novastorm deploy --epochs 5)val_bpb lower-is-better, avg_score higher-is-better, bootstrapping_rate higher-is-better)grep "^val_bpb:" run.log)autoresearch/<tag> branch for the runCreate results.tsv with header:
commit metric status description
Review past results in results.tsv. Consider:
Make ONE focused change per experiment. Examples:
git add <target-file>
git commit -m "experiment: <brief description>"
<run-command> > run.log 2>&1
Redirect all output — do NOT flood context. If using a long-running pipeline, use background execution and poll for completion.
<extraction-command>
If empty output → experiment crashed. Run tail -n 50 run.log to diagnose.
git reset --hard HEAD~1Simplicity criterion (from Karpathy): A tiny improvement that adds ugly complexity is not worth it. Removing something and getting equal or better results is a great outcome.
Append to results.tsv:
<commit> <metric> <keep|discard|crash> <description>
NEVER STOP unless manually interrupted. If out of ideas:
Target: src/swarm_manager.py (or .env)
Run: novastorm deploy --epochs 5 --concurrency 10
Metric: bootstrapping_rate (higher is better), avg_score (higher is better)
Extract: Parse progress.json from GCS after pipeline completes
Budget: ~30 minutes per experiment
Variables to tune:
RANDOM_MUTATION_RATE (currently 0.25)ELITE_RETENTION_RATE (currently 0.2)FITNESS_SCALING_FACTOR (currently 0.08)CROSSOVER_RATE (currently 0.25)ADAPTIVE_THRESHOLD_FLOOR (currently 0.25)DIVERSITY_SCALING (currently 0.4)Target: src/swarm_manager.py (skill_architect prompt)
Run: novastorm deploy --epochs 5 --concurrency 10
Metric: tool_diversity (% of insights using 3+ distinct tools), bootstrapping_rate
Extract: Parse JSONL logs from GCS
Budget: ~30 minutes per experiment
Target: src/critic_agent.py
Run: novastorm deploy --epochs 5 --concurrency 10
Metric: score_variance (lower variance = more consistent), threshold_convergence
Extract: Parse JSONL logs for score distribution
Budget: ~30 minutes per experiment
For quick iterations without deploying a full pipeline:
Target: src/swarm_manager.py
Run: uv run pytest tests/test_swarm_manager.py -v
Metric: test pass rate, phase diversity in test output
Extract: pytest exit code + grep test output
Budget: ~30 seconds per experiment
Tab-separated. Do NOT use commas in descriptions.
commit metric status description
a1b2c3d 0.380 keep baseline: 5 epochs default params
b2c3d4e 0.420 keep increase MUTATION_RATE from 0.25 to 0.35
c3d4e5f 0.390 discard decrease ELITE_RETENTION to 0.05
d4e5f6g 0.000 crash set CROSSOVER_RATE to 1.0 (division by zero)
e5f6g7h 0.570 keep combined: MUTATION=0.35 + SCALING=0.06
research
Constant-Time Analyzer (ct-analyzer)
testing
--- name: condition-based-waiting description: -- name: Condition-Based Waiting description: Replace arbitrary timeouts with condition polling for reliable async tests when_to_use: when tests have ... --- -- name: Condition-Based Waiting description: Replace arbitrary timeouts with condition polling for reliable async tests when_to_use: when tests have race conditions, timing dependencies, or inconsistent pass/fail behavior version: 1.1.0 languages: all --- # Condition-Based Waiting ## Overvi
testing
--- name: collision-zone-thinking description: -- name: Collision-Zone Thinking description: Force unrelated concepts together to discover emergent properties - "What if we treated X like Y?" when_... --- -- name: Collision-Zone Thinking description: Force unrelated concepts together to discover emergent properties - "What if we treated X like Y?" when_to_use: when conventional approaches feel inadequate and you need breakthrough innovation by forcing unrelated concepts together version: 1.1.0
documentation
--- name: canvas-design description: -- name: canvas-design description: Create beautiful visual art in .png and .pdf documents using design philosophy. You should use this skill when the... --- -- name: canvas-design description: Create beautiful visual art in .png and .pdf documents using design philosophy. You should use this skill when the user asks to create a poster, piece of art, design, or other static piece. Create original visual designs, never copying existing artists' work to avoid