codex/skills/karpathy-loop/SKILL.md
Use when the user wants to improve, optimize, debug, test, or iterate on a prompt, agent instruction, Claude Skill, or workflow through a measured eval loop. Runs baseline tests, creates binary success checks, changes one thing at a time, retests, keeps/reverts changes, and returns an optimized final prompt or skill.
npx skillsauth add tkersey/dotfiles karpathy-loopInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use this skill to improve a target prompt, agent instruction, skill, or workflow by running controlled experiments.
The loop is:
Baseline → Diagnose → Change one thing → Retest → Keep or reject → Repeat → Validate
The purpose is to improve measured behavior, not to rewrite by vibes.
When the user asks to “run a Karpathy loop,” “optimize this prompt,” “improve this skill,” or similar, proceed with this default workflow unless the user gives different constraints.
If the user provides too little information, make reasonable synthetic test cases and label them synthetic. Do not block on clarification unless the task, target, or success condition is genuinely ambiguous.
Prefer this structure, but adapt to whatever the user provides.
Target:
[prompt, skill, agent instruction, or workflow to improve]
Goal:
[what the target should reliably accomplish]
Test cases:
[example inputs and what good outputs should satisfy]
Success checks:
[3–6 yes/no checks]
Budget:
[number of experiments]
If Test cases are missing, create 5 synthetic cases.
If Success checks are missing, create 4 checks based on the goal.
If Budget is missing, run 3 experiments.
For normal user-facing runs, produce this final structure:
# Karpathy Loop Result
## Setup
- Target:
- Goal:
- Test cases:
- Success checks:
- Budget:
## Baseline Score
...
## Experiments
### Experiment 1
- Hypothesis:
- Change made:
- Before:
- After:
- Decision: KEEP / REJECT
- Reason:
## Final Score
...
## Changes Kept
1.
## Changes Rejected
1.
## Remaining Weaknesses
...
## Final Optimized Target
[paste full final prompt, skill, or instruction]
When the user asks for a downloadable package, create files using the companion templates in templates/ and examples in examples/.
Success checks must be binary, observable, and specific.
Good checks:
Does the output follow the requested format?
Does the output answer the actual user request?
Does the output avoid unsupported claims?
Does the output include exactly one concrete next step?
Does the output stay under the word limit?
Bad checks:
Is it good?
Is it high quality?
Is it compelling?
Score each check as:
PASS = 1
FAIL = 0
UNCERTAIN = 0
Treat uncertainty as failure unless the user explicitly asks for a softer scoring method.
Before editing the target:
Use this table:
Case | Check 1 | Check 2 | Check 3 | Check 4 | Score | Notes
Calculate:
score = passed_checks / total_checks
Example:
Baseline: 14 / 25 = 56%
Do not change the target until the baseline is scored.
Before each mutation, write:
Main failure:
Evidence:
Likely cause:
Proposed one-change fix:
Risk of the fix:
Prioritize failures by:
Each experiment may make one meaningful change.
Allowed changes:
Avoid:
After a mutation:
Use this decision rule:
KEEP if the candidate improves the score and introduces no serious regression.
REJECT if the score is flat, worse, noisy, overfit, bloated, or less safe.
Default keep threshold:
At least +10 percentage points for small test sets, or at least +2 passed checks.
If scores tie, keep the simpler version.
If there are at least 7 total cases, reserve 20–30% as holdout.
Rules:
Default acceptable gap:
holdout_score is within 10 percentage points of optimization_score
For each experiment, use:
## Experiment N
### Hypothesis
[what this change should improve]
### Change Made
[the one change]
### Score
Before: X / Y = Z%
After: X / Y = Z%
### Improvements
...
### Regressions
...
### Decision
KEEP or REJECT
### Lesson
...
The final answer must include:
Be concise enough that the user can immediately copy and use the final target.
Read examples/simple-invocation.md when the user wants the easiest way to use the skill.
Read examples/sales-email-example.md when the user wants a concrete business prompt optimization example.
Read examples/skill-optimization-example.md when the target itself is a Claude Skill or long agent instruction.
Use these when the user wants artifacts, repeatable process files, or a downloadable skill package:
templates/cases.yamltemplates/evals.yamltemplates/experiment-log.mdtemplates/final-report.mdtemplates/results.tsvEvery improvement must connect to this chain:
test case → observed failure → one prompt change → retest → measured improvement
If that chain is missing, do not claim the prompt improved.
tools
Convert markdown plans into beads with dependencies using br CLI. Use when creating task graphs, polishing beads before implementation, or bridging planning to agent swarm execution.
development
Orchestrate Codex skill optimization during active sessions through $cas goal control, $shadow single-session evidence, $tune diagnosis/refinement briefs, and the skill-optimizer custom subagent. Trigger for $opt, skill optimization loops, session-driven skill tuning, meta-skill audits, or explicit validated skill edits. Do not use for general code optimization, product optimization, or performance tuning.
development
Run a targeted fresh-eyes blunder pass over code, specs, plans, adjudications, closure gates, skill edits, or negative-evidence ledgers. Trigger when asked to reread with fresh eyes, find obvious bugs, catch mistakes/oversights/omissions, check for embarrassing misses, or perform a second independent blunder pass before closure. Do not use as a substitute for implementation, adjudication, or verification; use it as the final falsification/check pass for those workflows.
development
Explicitly shadow, tail, watch, follow, monitor, supervise, or companion exactly one Codex session id/path through `$seq`, then apply a named target skill as an interpretation/reporting/proposal/action lens until the watched session stops.