skills/autoresearch/SKILL.md
Autonomously optimize any Claude Code skill by running it repeatedly, scoring outputs against binary evals, mutating the prompt, and keeping improvements. Based on Karpathy's autoresearch methodology. Use when: optimize this skill, improve this skill, run autoresearch on, make this skill better, self-improve skill, benchmark skill, eval my skill, run evals on. Outputs: an improved SKILL.md, a results log, and a changelog of every mutation tried.
npx skillsauth add pedronauck/skills autoresearchInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Most skills work about 70% of the time. The other 30% you get garbage. The fix isn't to rewrite the skill from scratch. It's to let an agent run it dozens of times, score every output, and tighten the prompt until that 30% disappears.
This skill adapts Andrej Karpathy's autoresearch methodology (autonomous experimentation loops) to Claude Code skills. Instead of optimizing ML training code, we optimize skill prompts.
Take any existing skill, define what "good output" looks like as binary yes/no checks, then run an autonomous loop that:
Output: An improved SKILL.md + results.tsv log + changelog.md of every mutation attempted + a live HTML dashboard you can watch in your browser.
STOP. Do not run any experiments until all fields below are confirmed with the user. Ask for any missing fields before proceeding.
Before changing anything, read and understand the target skill completely.
references/ that the skill links toDo NOT skip this. You need to understand what the skill does before you can improve it.
Convert the user's eval criteria into a structured test. Every check must be binary — pass or fail, no scales.
Format each eval as:
EVAL [number]: [Short name]
Question: [Yes/no question about the output]
Pass condition: [What "yes" looks like — be specific]
Fail condition: [What triggers a "no"]
Rules for good evals:
See references/eval-guide.md for detailed examples of good vs bad evals.
Max score calculation:
max_score = [number of evals] × [runs per experiment]
Example: 4 evals × 5 runs = max score of 20.
Before running any experiments, create a live HTML dashboard at autoresearch-[skill-name]/dashboard.html and open it in the browser.
The dashboard must:
Generate the dashboard as a single self-contained HTML file with inline CSS and JavaScript. Use Chart.js loaded from CDN for the line chart. The JS should fetch results.json (which you update after each experiment alongside results.tsv) and re-render.
Open it immediately after creating it: open dashboard.html (macOS) so the user can see it in their browser.
Update results.json after every experiment so the dashboard stays current. The JSON format:
{
"skill_name": "[name]",
"status": "running",
"current_experiment": 3,
"baseline_score": 70.0,
"best_score": 90.0,
"experiments": [
{
"id": 0,
"score": 14,
"max_score": 20,
"pass_rate": 70.0,
"status": "baseline",
"description": "original skill — no changes"
}
],
"eval_breakdown": [
{ "name": "Text legibility", "pass_count": 8, "total": 10 },
{ "name": "Pastel colors", "pass_count": 9, "total": 10 }
]
}
When the run finishes (user stops it or ceiling hit), update status to "complete" so the dashboard shows a "Done" state with final summary.
Run the skill AS-IS before changing anything. This is experiment #0.
autoresearch-[skill-name]/ inside the skill's folderresults.tsv with the header rowresults.json and dashboard.html, then open the dashboard in the browserSKILL.md.baselineresults.tsv format (tab-separated):
experiment score max_score pass_rate status description
0 14 20 70.0% baseline original skill — no changes
IMPORTANT: After establishing baseline, confirm the score with the user before proceeding. If baseline is already 90%+, the skill may not need optimization — ask the user if they want to continue.
This is the core autoresearch loop. Once started, run autonomously until stopped.
LOOP:
Analyze failures. Look at which evals are failing most. Read the actual outputs that failed. Identify the pattern — is it a formatting issue? A missing instruction? An ambiguous directive?
Form a hypothesis. Pick ONE thing to change. Don't change 5 things at once — you won't know what helped.
Good mutations:
Bad mutations:
Make the change. Edit SKILL.md with ONE targeted mutation.
Run the experiment. Execute the skill [N] times with the same test inputs.
Score it. Run every output through every eval. Calculate total score.
Decide: keep or discard.
Log the result in results.tsv.
Repeat. Go back to step 1 of the loop.
NEVER STOP. Once the loop starts, do not pause to ask the user if you should continue. They may be away from the computer. Run autonomously until:
If you run out of ideas: Re-read the failing outputs. Try combining two previous near-miss mutations. Try a completely different approach to the same problem. Try removing things instead of adding them. Simplification that maintains the score is a win.
After each experiment (whether kept or discarded), append to changelog.md:
## Experiment [N] — [keep/discard]
**Score:** [X]/[max] ([percent]%)
**Change:** [One sentence describing what was changed]
**Reasoning:** [Why this change was expected to help]
**Result:** [What actually happened — which evals improved/declined]
**Failing outputs:** [Brief description of what still fails, if anything]
This changelog is the most valuable artifact. It's a research log that any future agent (or smarter future model) can pick up and continue from.
When the user returns or the loop stops, present:
The skill produces four files in autoresearch-[skill-name]/:
autoresearch-[skill-name]/
├── dashboard.html # live browser dashboard (auto-refreshes)
├── results.json # data file powering the dashboard
├── results.tsv # score log for every experiment
├── changelog.md # detailed mutation log
└── SKILL.md.baseline # original skill before optimization
Plus the improved SKILL.md saved back to its original location.
results.tsv example:
experiment score max_score pass_rate status description
0 14 20 70.0% baseline original skill — no changes
1 16 20 80.0% keep added explicit instruction to avoid numbering in diagrams
2 16 20 80.0% discard tried enforcing left-to-right layout — no improvement
3 18 20 90.0% keep added color palette hex codes instead of vague "pastel" description
4 18 20 90.0% discard added anti-pattern for neon colors — no improvement
5 19 20 95.0% keep added worked example showing correct label formatting
Context gathered:
~/.claude/skills/diagram-generator/SKILL.mdBaseline run (experiment 0): Generated 10 diagrams. Scored each against 4 evals. Result: 32/40 (80%). Common failures: 3 diagrams had numbered steps, 2 had bright red elements, 3 had illegible small text.
Experiment 1 — KEEP (35/40, 87.5%): Change: Added "NEVER include step numbers, ordinal numbers (1st, 2nd), or any numerical ordering in diagrams" to the anti-patterns section. Result: Numbering failures dropped from 3 to 1. Other evals held steady.
Experiment 2 — DISCARD (34/40, 85%): Change: Added "All text must be minimum 14px font size." Result: Legibility improved by 1, but color compliance dropped by 2. Reverted.
Experiment 3 — KEEP (37/40, 92.5%):
Change: Replaced vague "pastel colors" instruction with specific hex codes: #A8D8EA, #AA96DA, #FCBAD3, #FFFFD2, #B5EAD7.
Result: Color eval went from 8/10 to 10/10. Other evals held.
Experiment 4 — DISCARD (37/40, 92.5%): Change: Added anti-pattern "Do NOT use red (#FF0000), orange (#FF8C00), or neon green (#39FF14)." Result: No change. The hex codes from experiment 3 already solved the color problem. Reverted to keep skill simpler.
Experiment 5 — KEEP (39/40, 97.5%): Change: Added a worked example showing a correct diagram with properly formatted labels (no numbers, pastel fills, left-to-right flow, legible text). Result: Hit 39/40. One remaining failure: a complex diagram with overlapping labels. Diminishing returns — stopped.
Final delivery:
What feeds into autoresearch:
What autoresearch feeds into:
A good autoresearch run:
If the skill "passes" all evals but the actual output quality hasn't improved — the evals are bad, not the skill. Go back to step 2 and write better evals.
tools
Plans real-user QA deliverables: personas, journey maps, exploratory charters, persona/journey/tour/CFR test cases, regression suites, Figma validation checks, automation intent, and user-impact bug reports. Writes artifacts under <qa-output-path>/qa/ for qa-execution to consume. Use when planning QA before execution, documenting journey-driven test strategy, marking flows that need E2E follow-up, or filing structured bug reports. Do not use for live execution, AI implementation audits, CI gate ownership, or technical integration/security/performance suites; use qa-execution or agent-output-audit instead.
development
Executes real-user QA sessions through public interfaces using personas, journeys, exploratory charters, test tours, edge-case probes, CFR checks, and browser evidence. Reads qa-report artifacts from <qa-output-path>/qa/ when present, captures issues/screenshots/reports under the same output tree, and classifies bugs by user impact. Use when validating a release candidate, migration, refactor, or user-facing change against production-like behavior. Do not use for AI implementation audits, task-status reconciliation, CI gate runs, integration/security/performance templates, or flaky-test triage; use agent-output-audit for those.
development
Transform outside-of-diff review files into properly formatted issue files for a given PR. Use when converting review files from ai-docs/reviews-pr-<PR>/outside/ into issue format in ai-docs/reviews-pr-<PR>/issues/. Automatically determines starting issue number and preserves all metadata (file path, date, status) from original review files. Don't use for inline-diff review files, non-PR review artifacts, or creating GitHub issues directly.
development
Enforce root-cause fixes over workarounds, hacks, and symptom patches in all software engineering tasks. Use when debugging issues, fixing bugs, resolving test failures, planning solutions, making architectural decisions, or reviewing code changes. Activates gate functions that detect and reject common workaround patterns such as type assertions, lint suppressions, error swallowing, timing hacks, and monkey patches. Don't use for trivial formatting changes or documentation-only edits.