Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

snqb/autoresearch

Name: autoresearch
Author: snqb

autoresearch/SKILL.md

npx skillsauth add snqb/my-skills autoresearch

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Autoresearch

Autonomous experiment loop. You have working code and a metric. You modify, measure, keep or discard, repeat. Forever. Until the human stops you.

Based on karpathy/autoresearch.

Requirements

Before starting, the project MUST have:

Working code that produces a measurable result
An eval command that outputs a number (lower or higher = better, defined in program.md)
A program.md in the project root — the research strategy

If any of these are missing, help the user create them first. Don't start the loop without all three.

program.md

This is the ONLY file the human writes. It tells you everything:

# Autoresearch: <Project Name>

## Scope
- Files you CAN modify: <list>
- Files you CANNOT modify: <list>
- You CANNOT install new dependencies

## Eval
Command: `<command that outputs the metric>`
Metric: `<name>` (lower is better | higher is better)
Time budget: <N> minutes per experiment
Parse: `grep "^<metric>:" run.log` or similar

## Strategy
<What to try. Research directions. Constraints. Philosophy.>

## Notes
<Anything else the agent should know>

A good program.md is specific. Bad: "make it faster". Good: "try SIMD for the hot loop in parse.rs, experiment with different batch sizes in process(), consider replacing HashMap with BTreeMap for sorted iteration".

Setup

When the user says "start autoresearch" or points you at a program.md:

Read program.md — understand scope, eval, strategy
Read all in-scope files — full context
Agree on a run tag with the user (e.g. apr2, v3-perf)
Create branch: git checkout -b autoresearch/<tag>
Run baseline — execute eval command on unmodified code, record result
Initialize results.jsonl — first entry is the baseline
Confirm with user — show baseline, confirm everything works
Start the loop — from here, fully autonomous

The Loop

Run in tmux (background, survives disconnects):

LOOP FOREVER:

1. Review state
   - Current best metric
   - What's been tried (scan results.jsonl)
   - What directions remain from program.md

2. Form hypothesis
   - One clear idea, one change
   - Write it as a commit message BEFORE coding

3. Modify code
   - Only files in scope
   - Keep changes minimal and reviewable

4. Commit
   git add -A && git commit -m "exp: <description>"

5. Run experiment
   <eval command> > run.log 2>&1
   - Redirect everything. Do NOT flood context with training output.
   - If exceeds 2× time budget, kill it → treat as crash

6. Read result
   Parse metric from run.log
   If empty → crash. Read `tail -50 run.log` for stack trace.

7. Decide
   IMPROVED (metric better):
     → Log as "keep" in results.jsonl
     → This commit becomes new baseline
     → Continue from here

   WORSE OR EQUAL:
     → Log as "discard" in results.jsonl
     → git reset --hard HEAD~1
     → Back to previous best

   CRASH:
     → If trivial fix (typo, import): fix and retry once
     → Otherwise: log as "crash", git reset --hard HEAD~1, move on

8. GOTO 1

NEVER STOP. NEVER ASK "should I continue?".
The human may be asleep. Work until interrupted.

Logging

Append one JSON line per experiment to results.jsonl (untracked by git):

{"n":0,"commit":"a1b2c3d","metric":0.998,"status":"keep","description":"baseline","timestamp":"2026-04-02T03:10:00Z"}
{"n":1,"commit":"b2c3d4e","metric":0.993,"status":"keep","description":"increase LR to 0.04","timestamp":"2026-04-02T03:15:00Z"}
{"n":2,"commit":"c3d4e5f","metric":1.005,"status":"discard","description":"switch to GeLU activation","timestamp":"2026-04-02T03:20:00Z"}
{"n":3,"commit":"d4e5f6g","metric":null,"status":"crash","description":"double model width (OOM)","timestamp":"2026-04-02T03:25:00Z"}

Add results.jsonl and run.log to .gitignore if not already there.

Simplicity Criterion

From Karpathy:

All else being equal, simpler is better. A small improvement that adds ugly complexity is not worth it. Removing something and getting equal or better results is a great outcome. Weigh the complexity cost against the improvement magnitude.

0.001 improvement + 20 lines of hack → probably not worth it
0.001 improvement from DELETING code → definitely keep
~0 change but simpler code → keep

When Stuck

If you run out of ideas, in order:

Re-read program.md strategy section
Re-read the in-scope files — look for patterns you missed
Try combining two previous near-misses
Try the opposite of what's been working
Try something radical — different algorithm, different approach entirely
Review discarded experiments — was something promising that you gave up on too early?

Do NOT stop. Think harder.

Running in tmux

Start the loop in tmux so it survives terminal disconnects:

# Create or attach to tmux session
tmux new-session -d -s autoresearch 2>/dev/null || true
tmux send-keys -t autoresearch "cd <project-dir>" Enter

The agent runs inside the tmux session. User can:

tmux attach -t autoresearch — watch live
Check results.jsonl anytime — see progress
Ctrl+C or kill the agent — stop the loop

Checking Progress

When the user asks "how's it going?" or comes back in the morning:

# Summary
echo "Experiments: $(wc -l < results.jsonl)"
echo "Keeps: $(grep '"keep"' results.jsonl | wc -l)"
echo "Best: $(grep '"keep"' results.jsonl | jq -r '.metric' | sort -n | head -1)"
echo "---"
# Last 5 experiments
tail -5 results.jsonl | jq -r '"\(.n). \(.status) \(.metric // "crash") — \(.description)"'

Or a proper summary:

# Full experiment history
jq -r '"#\(.n) [\(.status)] \(.metric // "crash") — \(.description)"' results.jsonl

Examples

ML Training (Karpathy's original)

## Scope
- CAN modify: train.py
- CANNOT modify: prepare.py, pyproject.toml

## Eval
Command: `uv run train.py`
Metric: val_bpb (lower is better)
Time budget: 5 minutes
Parse: `grep "^val_bpb:" run.log`

API Latency

## Scope
- CAN modify: src/handlers/, src/db/queries/

## Eval
Command: `cargo build --release && wrk -t4 -c100 -d30s http://localhost:8080/api/search`
Metric: Req/Sec (higher is better)
Time budget: 2 minutes
Parse: `grep "Req/Sec" run.log | awk '{print $2}'`

Prompt Optimization

## Scope
- CAN modify: prompts/system.txt

## Eval
Command: `python eval_prompt.py --dataset eval_set.jsonl`
Metric: accuracy (higher is better)
Time budget: 3 minutes
Parse: `grep "^accuracy:" run.log`

Bundle Size

## Scope
- CAN modify: src/, package.json (deps only)

## Eval
Command: `npm run build && du -sb dist/ | cut -f1`
Metric: bytes (lower is better)
Time budget: 1 minute
Parse: the entire stdout is the number

Lighthouse

## Scope
- CAN modify: src/components/, src/styles/

## Eval
Command: `npm run build && npx lighthouse http://localhost:3000 --output=json --chrome-flags="--headless" | jq '.categories.performance.score'`
Metric: performance score (higher is better)
Time budget: 2 minutes
Parse: the entire stdout is the number

snqb/autoresearch

autoresearch/SKILL.md

Autonomous experiment loop — modify code, measure, keep/discard, repeat forever. Based on Karpathy's autoresearch pattern. Use when there's working code + a measurable metric to optimize. Agent works while you sleep.

1 stars

development

Updated Apr 14, 2026

$ install --global

skillsauth

npx skillsauth add snqb/my-skills autoresearch

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 14, 2026, 2:30 AM29.8s1 file scanned

SKILL.md

name:: autoresearch
description:: Autonomous experiment loop — modify code, measure, keep/discard, repeat forever. Based on Karpathy's autoresearch pattern. Use when there's working code + a measurable metric to optimize. Agent works while you sleep.
user-invocable:: true
argument-hint:: [path/to/program.md]

Autoresearch

Autonomous experiment loop. You have working code and a metric. You modify, measure, keep or discard, repeat. Forever. Until the human stops you.

Based on karpathy/autoresearch.

Requirements

Before starting, the project MUST have:

Working code that produces a measurable result
An eval command that outputs a number (lower or higher = better, defined in program.md)
A program.md in the project root — the research strategy

If any of these are missing, help the user create them first. Don't start the loop without all three.

program.md

This is the ONLY file the human writes. It tells you everything:

# Autoresearch: <Project Name>

## Scope
- Files you CAN modify: <list>
- Files you CANNOT modify: <list>
- You CANNOT install new dependencies

## Eval
Command: `<command that outputs the metric>`
Metric: `<name>` (lower is better | higher is better)
Time budget: <N> minutes per experiment
Parse: `grep "^<metric>:" run.log` or similar

## Strategy
<What to try. Research directions. Constraints. Philosophy.>

## Notes
<Anything else the agent should know>

Setup

When the user says "start autoresearch" or points you at a program.md:

Read program.md — understand scope, eval, strategy
Read all in-scope files — full context
Agree on a run tag with the user (e.g. apr2, v3-perf)
Create branch: git checkout -b autoresearch/<tag>
Run baseline — execute eval command on unmodified code, record result
Initialize results.jsonl — first entry is the baseline
Confirm with user — show baseline, confirm everything works
Start the loop — from here, fully autonomous

The Loop

Run in tmux (background, survives disconnects):

LOOP FOREVER:

1. Review state
   - Current best metric
   - What's been tried (scan results.jsonl)
   - What directions remain from program.md

2. Form hypothesis
   - One clear idea, one change
   - Write it as a commit message BEFORE coding

3. Modify code
   - Only files in scope
   - Keep changes minimal and reviewable

4. Commit
   git add -A && git commit -m "exp: <description>"

5. Run experiment
   <eval command> > run.log 2>&1
   - Redirect everything. Do NOT flood context with training output.
   - If exceeds 2× time budget, kill it → treat as crash

6. Read result
   Parse metric from run.log
   If empty → crash. Read `tail -50 run.log` for stack trace.

7. Decide
   IMPROVED (metric better):
     → Log as "keep" in results.jsonl
     → This commit becomes new baseline
     → Continue from here

   WORSE OR EQUAL:
     → Log as "discard" in results.jsonl
     → git reset --hard HEAD~1
     → Back to previous best

   CRASH:
     → If trivial fix (typo, import): fix and retry once
     → Otherwise: log as "crash", git reset --hard HEAD~1, move on

8. GOTO 1

NEVER STOP. NEVER ASK "should I continue?".
The human may be asleep. Work until interrupted.

Logging

Append one JSON line per experiment to results.jsonl (untracked by git):

{"n":0,"commit":"a1b2c3d","metric":0.998,"status":"keep","description":"baseline","timestamp":"2026-04-02T03:10:00Z"}
{"n":1,"commit":"b2c3d4e","metric":0.993,"status":"keep","description":"increase LR to 0.04","timestamp":"2026-04-02T03:15:00Z"}
{"n":2,"commit":"c3d4e5f","metric":1.005,"status":"discard","description":"switch to GeLU activation","timestamp":"2026-04-02T03:20:00Z"}
{"n":3,"commit":"d4e5f6g","metric":null,"status":"crash","description":"double model width (OOM)","timestamp":"2026-04-02T03:25:00Z"}

Add results.jsonl and run.log to .gitignore if not already there.

Simplicity Criterion

From Karpathy:

All else being equal, simpler is better. A small improvement that adds ugly complexity is not worth it. Removing something and getting equal or better results is a great outcome. Weigh the complexity cost against the improvement magnitude.

0.001 improvement + 20 lines of hack → probably not worth it
0.001 improvement from DELETING code → definitely keep
~0 change but simpler code → keep

When Stuck

If you run out of ideas, in order:

Re-read program.md strategy section
Re-read the in-scope files — look for patterns you missed
Try combining two previous near-misses
Try the opposite of what's been working
Try something radical — different algorithm, different approach entirely
Review discarded experiments — was something promising that you gave up on too early?

Do NOT stop. Think harder.

Running in tmux

Start the loop in tmux so it survives terminal disconnects:

# Create or attach to tmux session
tmux new-session -d -s autoresearch 2>/dev/null || true
tmux send-keys -t autoresearch "cd <project-dir>" Enter

The agent runs inside the tmux session. User can:

tmux attach -t autoresearch — watch live
Check results.jsonl anytime — see progress
Ctrl+C or kill the agent — stop the loop

Checking Progress

When the user asks "how's it going?" or comes back in the morning:

# Summary
echo "Experiments: $(wc -l < results.jsonl)"
echo "Keeps: $(grep '"keep"' results.jsonl | wc -l)"
echo "Best: $(grep '"keep"' results.jsonl | jq -r '.metric' | sort -n | head -1)"
echo "---"
# Last 5 experiments
tail -5 results.jsonl | jq -r '"\(.n). \(.status) \(.metric // "crash") — \(.description)"'

Or a proper summary:

# Full experiment history
jq -r '"#\(.n) [\(.status)] \(.metric // "crash") — \(.description)"' results.jsonl

Examples

ML Training (Karpathy's original)

## Scope
- CAN modify: train.py
- CANNOT modify: prepare.py, pyproject.toml

## Eval
Command: `uv run train.py`
Metric: val_bpb (lower is better)
Time budget: 5 minutes
Parse: `grep "^val_bpb:" run.log`

API Latency

## Scope
- CAN modify: src/handlers/, src/db/queries/

## Eval
Command: `cargo build --release && wrk -t4 -c100 -d30s http://localhost:8080/api/search`
Metric: Req/Sec (higher is better)
Time budget: 2 minutes
Parse: `grep "Req/Sec" run.log | awk '{print $2}'`

Prompt Optimization

## Scope
- CAN modify: prompts/system.txt

## Eval
Command: `python eval_prompt.py --dataset eval_set.jsonl`
Metric: accuracy (higher is better)
Time budget: 3 minutes
Parse: `grep "^accuracy:" run.log`

Bundle Size

## Scope
- CAN modify: src/, package.json (deps only)

## Eval
Command: `npm run build && du -sb dist/ | cut -f1`
Metric: bytes (lower is better)
Time budget: 1 minute
Parse: the entire stdout is the number

Lighthouse

## Scope
- CAN modify: src/components/, src/styles/

## Eval
Command: `npm run build && npx lighthouse http://localhost:3000 --output=json --chrome-flags="--headless" | jq '.categories.performance.score'`
Metric: performance score (higher is better)
Time budget: 2 minutes
Parse: the entire stdout is the number

Related Skills

snqb/wiki-enrich

documentation

VerifiedTrustedCommunity

Enrich Markdown articles with inline Wikipedia links. First mention of each notable entity gets a hyperlink. Use when asked to add wiki links, enrich, or add references to .md files.

1SKILL.mdUpdated Apr 14, 2026

snqb/visual-qa-loop

development

VerifiedTrustedCommunity

Structured visual QA: screenshot → batch issues → fix all → verify. Replaces the 300-cycle screenshot→edit death spiral. Optional bishkek review as exit gate. Use when building/polishing UI with browser testing, or when user asks for N iterations/reviews.

1SKILL.mdUpdated Apr 14, 2026

snqb/uncomplex-analyzer

development

VerifiedTrustedCommunity

Find complex code, analyze intent, recommend battle-tested library replacements. Uses radon/eslint for detection, GitHub quality search for alternatives.

1SKILL.mdUpdated Apr 14, 2026

snqb/uncomplex-analyzer

snqb/ui-patterns

research

VerifiedTrustedCommunity

Research real-world UI patterns from curated galleries (Collect UI, Component Gallery, Mobbin). Use when exploring what exists: dropdowns, accordions, inputs, navigation, cards, modals, etc.

1SKILL.mdUpdated Apr 14, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/snqb/my-skills.git

# Copy into Claude Code skills folder (global)
cp -r my-skills/autoresearch ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

snqb/my-skills

1 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT