Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

moralespanitz/skills/autoresearch

Name: skills/autoresearch
Author: moralespanitz

skills/autoresearch/SKILL.md

npx skillsauth add moralespanitz/research-loop skills/autoresearch

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

autoresearch skill

Adapted from karpathy/autoresearch program.md. This skill teaches Research Loop's Empirical agent how to run autonomous nanochat/GPT training experiments using the autoresearch setup.

What this skill is for

You are the Empirical Agent operating on the karpathy/autoresearch codebase. Your job is to autonomously experiment with train.py to minimize val_bpb (validation bits per byte — lower is better).

Repository layout

prepare.py   — fixed constants, data prep, tokenizer, dataloader, evaluation. DO NOT MODIFY.
train.py     — the ONLY file you edit. Model architecture, optimizer, hyperparameters.
program.md   — agent instructions (this skill supersedes it)
run.log      — benchmark output (written by: uv run train.py > run.log 2>&1)
results.tsv  — your experiment log (tab-separated, not tracked by git)

The rules

You CAN:

Modify train.py — this is the only file you touch
Change model architecture, optimizer, hyperparameters, batch size, model size
Change anything in train.py — all constants at the top are fair game

You CANNOT:

Modify prepare.py — it is read-only (evaluation harness, dataloader, constants)
Install new packages — only what's in pyproject.toml
Modify the evaluate_bpb function — it is the ground truth metric

Running an experiment

uv run train.py > run.log 2>&1

Training runs for a fixed 5-minute wall-clock time budget regardless of what you change.

Read the result:

grep "^val_bpb:\|^peak_vram_mb:" run.log

The summary block looks like:

---
val_bpb:          0.997900
training_seconds: 300.1
total_seconds:    325.9
peak_vram_mb:     45060.2

Proposing mutations

When proposing a mutation, always specify:

What to change: exact constant name(s) or code block in train.py
Why: the theoretical reason it should improve val_bpb
Risk: VRAM impact, instability risk

Good first experiments (in rough priority order):

Learning rate tuning (MATRIX_LR, EMBEDDING_LR) — high leverage, low risk
DEPTH increase (8 → 10 or 12) — more capacity, higher VRAM
WARMDOWN_RATIO adjustment — often undertuned
WINDOW_PATTERN change (e.g. "SSLL" or "L") — architectural
TOTAL_BATCH_SIZE increase — may improve generalization
WEIGHT_DECAY tuning — regularization
Optimizer hyperparameters (ADAM_BETAS, SCALAR_LR)

Deciding keep vs discard

| Delta | Action | |-------|--------| | val_bpb improved (lower) | Keep — advance the branch | | val_bpb equal or worse | Discard — git reset --hard HEAD | | Crash (OOM / NaN / exit 1) | Discard — check tail -n 50 run.log for the error |

Simplicity criterion: A 0.001 improvement that adds 20 lines of hacky code is probably not worth it. A 0.001 improvement from deleting code is always worth it.

Logging to results.tsv

After every run (keep or discard), append to results.tsv (tab-separated):

commit	val_bpb	memory_gb	status	description

commit: 7-char git hash
val_bpb: metric value (0.000000 for crashes)
memory_gb: peak_vram_mb / 1024 rounded to 1 decimal (0.0 for crashes)
status: keep, discard, or crash
description: short description of what you tried

MacOS / small GPU notes

The default train.py requires an NVIDIA H100. For MacOS (MPS) or smaller GPUs:

Use fork miolini/autoresearch-macos or trevin-creator/autoresearch-mlx
Lower DEPTH to 4, TOTAL_BATCH_SIZE to 2**14, DEVICE_BATCH_SIZE to 16
Use WINDOW_PATTERN = "L" (banded attention is slow on non-CUDA)
Consider TinyStories dataset for faster convergence on small models

moralespanitz/skills/autoresearch

skills/autoresearch/SKILL.md

# autoresearch skill > Adapted from [karpathy/autoresearch](https://github.com/karpathy/autoresearch) program.md. > This skill teaches Research Loop's Empirical agent how to run autonomous > nanochat/GPT training experiments using the autoresearch setup. ## What this skill is for You are the Empirical Agent operating on the `karpathy/autoresearch` codebase. Your job is to autonomously experiment with `train.py` to minimize `val_bpb` (validation bits per byte — lower is better). ## Repository

4 stars

development

Updated Apr 20, 2026

$ install --global

skillsauth

npx skillsauth add moralespanitz/research-loop skills/autoresearch

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 20, 2026, 3:03 PM13.2s1 file scanned

SKILL.md

autoresearch skill

Adapted from karpathy/autoresearch program.md. This skill teaches Research Loop's Empirical agent how to run autonomous nanochat/GPT training experiments using the autoresearch setup.

What this skill is for

Repository layout

prepare.py   — fixed constants, data prep, tokenizer, dataloader, evaluation. DO NOT MODIFY.
train.py     — the ONLY file you edit. Model architecture, optimizer, hyperparameters.
program.md   — agent instructions (this skill supersedes it)
run.log      — benchmark output (written by: uv run train.py > run.log 2>&1)
results.tsv  — your experiment log (tab-separated, not tracked by git)

The rules

You CAN:

Modify train.py — this is the only file you touch
Change model architecture, optimizer, hyperparameters, batch size, model size
Change anything in train.py — all constants at the top are fair game

You CANNOT:

Modify prepare.py — it is read-only (evaluation harness, dataloader, constants)
Install new packages — only what's in pyproject.toml
Modify the evaluate_bpb function — it is the ground truth metric

Running an experiment

uv run train.py > run.log 2>&1

Training runs for a fixed 5-minute wall-clock time budget regardless of what you change.

Read the result:

grep "^val_bpb:\|^peak_vram_mb:" run.log

The summary block looks like:

---
val_bpb:          0.997900
training_seconds: 300.1
total_seconds:    325.9
peak_vram_mb:     45060.2

Proposing mutations

When proposing a mutation, always specify:

What to change: exact constant name(s) or code block in train.py
Why: the theoretical reason it should improve val_bpb
Risk: VRAM impact, instability risk

Good first experiments (in rough priority order):

Learning rate tuning (MATRIX_LR, EMBEDDING_LR) — high leverage, low risk
DEPTH increase (8 → 10 or 12) — more capacity, higher VRAM
WARMDOWN_RATIO adjustment — often undertuned
WINDOW_PATTERN change (e.g. "SSLL" or "L") — architectural
TOTAL_BATCH_SIZE increase — may improve generalization
WEIGHT_DECAY tuning — regularization
Optimizer hyperparameters (ADAM_BETAS, SCALAR_LR)

Deciding keep vs discard

Simplicity criterion: A 0.001 improvement that adds 20 lines of hacky code is probably not worth it. A 0.001 improvement from deleting code is always worth it.

Logging to results.tsv

After every run (keep or discard), append to results.tsv (tab-separated):

commit	val_bpb	memory_gb	status	description

commit: 7-char git hash
val_bpb: metric value (0.000000 for crashes)
memory_gb: peak_vram_mb / 1024 rounded to 1 decimal (0.0 for crashes)
status: keep, discard, or crash
description: short description of what you tried

MacOS / small GPU notes

The default train.py requires an NVIDIA H100. For MacOS (MPS) or smaller GPUs:

Use fork miolini/autoresearch-macos or trevin-creator/autoresearch-mlx
Lower DEPTH to 4, TOTAL_BATCH_SIZE to 2**14, DEVICE_BATCH_SIZE to 16
Use WINDOW_PATTERN = "L" (banded attention is slow on non-CUDA)
Consider TinyStories dataset for faster convergence on small models

Related Skills

moralespanitz/replication

testing

VerifiedTrustedCommunity

Plan and execute a structured replication workflow for a paper, claim, or benchmark with environment selection and integrity checks.

4SKILL.mdUpdated May 5, 2026

moralespanitz/replication

moralespanitz/paper-pipeline

testing

VerifiedTrustedCommunity

End-to-end paper generation pipeline ported from AutoResearchClaw (Aiming Lab). 14 phases covering topic initiation through export/publish, with human- in-the-loop gates and quality gating at each handoff. Use this when the user wants a full paper pipeline run — topic to submission-ready manuscript. Delegates to researcher/reviewer/writer/verifier subagents for stage execution and to autonomous-iteration for experiment optimization loops.

4SKILL.mdUpdated May 5, 2026

moralespanitz/paper-pipeline

moralespanitz/literature-review

testing

VerifiedTrustedCommunity

Run a structured literature review on a topic using parallel search, evidence tables with quality scoring, and primary-source synthesis.

4SKILL.mdUpdated May 5, 2026

moralespanitz/literature-review

moralespanitz/figure-agent

development

VerifiedTrustedCommunity

Publication-quality figure generation for research papers. Decision agent selects figure type (code plot vs architecture diagram). Generates Matplotlib/Seaborn code for quantitative figures with iterative improvement loop. Style-matches conference templates (NeurIPS, ICML, ICLR). Use when the paper-pipeline reaches the figure generation phase, or when a user requests figures for an existing draft.

4SKILL.mdUpdated May 5, 2026

moralespanitz/figure-agent

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/moralespanitz/research-loop.git

# Copy into Claude Code skills folder (global)
cp -r research-loop/skills/autoresearch ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

moralespanitz/research-loop

4 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT