Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

tkersey/karpathy-loop

Name: karpathy-loop
Author: tkersey

codex/skills/karpathy-loop/SKILL.md

npx skillsauth add tkersey/dotfiles karpathy-loop

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Karpathy Loop

Use this skill to improve a target prompt, agent instruction, skill, or workflow by running controlled experiments.

The loop is:

Baseline → Diagnose → Change one thing → Retest → Keep or reject → Repeat → Validate

The purpose is to improve measured behavior, not to rewrite by vibes.

Fast Path

When the user asks to “run a Karpathy loop,” “optimize this prompt,” “improve this skill,” or similar, proceed with this default workflow unless the user gives different constraints.

Identify the target prompt, instruction, skill, or workflow.
Define what success means in one paragraph.
Create or use 3–6 binary success checks.
Create or use 5–10 test cases.
Reserve 1–3 holdout cases if there are enough cases.
Score the original target as the baseline.
Run the requested number of experiments, defaulting to 3.
In each experiment, change exactly one meaningful thing.
Keep the change only if it improves the score without meaningful regressions.
Return the final optimized target and a concise experiment report.

If the user provides too little information, make reasonable synthetic test cases and label them synthetic. Do not block on clarification unless the task, target, or success condition is genuinely ambiguous.

Inputs

Prefer this structure, but adapt to whatever the user provides.

Target:
[prompt, skill, agent instruction, or workflow to improve]

Goal:
[what the target should reliably accomplish]

Test cases:
[example inputs and what good outputs should satisfy]

Success checks:
[3–6 yes/no checks]

Budget:
[number of experiments]

If Test cases are missing, create 5 synthetic cases.

If Success checks are missing, create 4 checks based on the goal.

If Budget is missing, run 3 experiments.

Output Contract

For normal user-facing runs, produce this final structure:

# Karpathy Loop Result

## Setup
- Target:
- Goal:
- Test cases:
- Success checks:
- Budget:

## Baseline Score
...

## Experiments
### Experiment 1
- Hypothesis:
- Change made:
- Before:
- After:
- Decision: KEEP / REJECT
- Reason:

## Final Score
...

## Changes Kept
1.

## Changes Rejected
1.

## Remaining Weaknesses
...

## Final Optimized Target
[paste full final prompt, skill, or instruction]

When the user asks for a downloadable package, create files using the companion templates in templates/ and examples in examples/.

Success Checks

Success checks must be binary, observable, and specific.

Good checks:

Does the output follow the requested format?
Does the output answer the actual user request?
Does the output avoid unsupported claims?
Does the output include exactly one concrete next step?
Does the output stay under the word limit?

Bad checks:

Is it good?
Is it high quality?
Is it compelling?

Score each check as:

PASS = 1
FAIL = 0
UNCERTAIN = 0

Treat uncertainty as failure unless the user explicitly asks for a softer scoring method.

Baseline Procedure

Before editing the target:

Run the original target against every optimization test case.
Score each output against every success check.
Calculate the baseline score.
Identify the largest recurring failure.

Use this table:

Case | Check 1 | Check 2 | Check 3 | Check 4 | Score | Notes

Calculate:

score = passed_checks / total_checks

Example:

Baseline: 14 / 25 = 56%

Do not change the target until the baseline is scored.

Failure Analysis

Before each mutation, write:

Main failure:
Evidence:
Likely cause:
Proposed one-change fix:
Risk of the fix:

Prioritize failures by:

Safety, factuality, or compliance failures.
Format failures that break usability.
Missing required content.
Generic or low-specificity content.
Tone, polish, or concision issues.

Mutation Rules

Each experiment may make one meaningful change.

Allowed changes:

Add one missing constraint.
Clarify one ambiguous instruction.
Add or tighten an output format.
Add one high-signal example.
Remove one conflicting instruction.
Move an important rule earlier.
Add one “do not” rule for a repeated failure.
Add one self-check before final output.
Simplify a confusing section.

Avoid:

Rewriting the whole target at once.
Adding many rules in one experiment.
Changing test cases after seeing failures.
Changing success checks to make the score look better.
Copying test cases into the prompt as examples.
Making the target much longer without a measurable benefit.
Removing safety, accuracy, or compliance constraints.

Retest Procedure

After a mutation:

Run the candidate against the same optimization cases.
Score with the same success checks.
Compare against the current best version.
Note improvements and regressions.

Use this decision rule:

KEEP if the candidate improves the score and introduces no serious regression.
REJECT if the score is flat, worse, noisy, overfit, bloated, or less safe.

Default keep threshold:

At least +10 percentage points for small test sets, or at least +2 passed checks.

If scores tie, keep the simpler version.

Holdout Validation

If there are at least 7 total cases, reserve 20–30% as holdout.

Rules:

Do not use holdout failures to design early mutations.
Run holdout after the final kept change or after every 3 kept changes.
If holdout score is much worse than optimization score, warn that the target may be overfit.
Do not silently tune to the holdout set.

Default acceptable gap:

holdout_score is within 10 percentage points of optimization_score

Experiment Log Format

For each experiment, use:

## Experiment N

### Hypothesis
[what this change should improve]

### Change Made
[the one change]

### Score
Before: X / Y = Z%
After: X / Y = Z%

### Improvements
...

### Regressions
...

### Decision
KEEP or REJECT

### Lesson
...

Final Report Rules

The final answer must include:

Baseline score.
Final score.
Number of experiments run.
Kept changes.
Rejected changes.
Remaining weaknesses.
The full final optimized target.

Be concise enough that the user can immediately copy and use the final target.

Practical Invocation Examples

Read examples/simple-invocation.md when the user wants the easiest way to use the skill.

Read examples/sales-email-example.md when the user wants a concrete business prompt optimization example.

Read examples/skill-optimization-example.md when the target itself is a Claude Skill or long agent instruction.

Template Files

Use these when the user wants artifacts, repeatable process files, or a downloadable skill package:

templates/cases.yaml
templates/evals.yaml
templates/experiment-log.md
templates/final-report.md
templates/results.tsv

Core Principle

Every improvement must connect to this chain:

test case → observed failure → one prompt change → retest → measured improvement

If that chain is missing, do not claim the prompt improved.

tkersey/karpathy-loop

codex/skills/karpathy-loop/SKILL.md

Use when the user wants to improve, optimize, debug, test, or iterate on a prompt, agent instruction, Claude Skill, or workflow through a measured eval loop. Runs baseline tests, creates binary success checks, changes one thing at a time, retests, keeps/reverts changes, and returns an optimized final prompt or skill.

51 stars

development

Updated Apr 19, 2026

$ install --global

skillsauth

npx skillsauth add tkersey/dotfiles karpathy-loop

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 19, 2026, 2:06 AM27.7s11 files scanned

SKILL.md

name:: karpathy-loop
description:: Use when the user wants to improve, optimize, debug, test, or iterate on a prompt, agent instruction, Claude Skill, or workflow through a measured eval loop. Runs baseline tests, creates binary success checks, changes one thing at a time, retests, keeps/reverts changes, and returns an optimized final prompt or skill.

Karpathy Loop

Use this skill to improve a target prompt, agent instruction, skill, or workflow by running controlled experiments.

The loop is:

Baseline → Diagnose → Change one thing → Retest → Keep or reject → Repeat → Validate

The purpose is to improve measured behavior, not to rewrite by vibes.

Fast Path

When the user asks to “run a Karpathy loop,” “optimize this prompt,” “improve this skill,” or similar, proceed with this default workflow unless the user gives different constraints.

Identify the target prompt, instruction, skill, or workflow.
Define what success means in one paragraph.
Create or use 3–6 binary success checks.
Create or use 5–10 test cases.
Reserve 1–3 holdout cases if there are enough cases.
Score the original target as the baseline.
Run the requested number of experiments, defaulting to 3.
In each experiment, change exactly one meaningful thing.
Keep the change only if it improves the score without meaningful regressions.
Return the final optimized target and a concise experiment report.

Inputs

Prefer this structure, but adapt to whatever the user provides.

Target:
[prompt, skill, agent instruction, or workflow to improve]

Goal:
[what the target should reliably accomplish]

Test cases:
[example inputs and what good outputs should satisfy]

Success checks:
[3–6 yes/no checks]

Budget:
[number of experiments]

If Test cases are missing, create 5 synthetic cases.

If Success checks are missing, create 4 checks based on the goal.

If Budget is missing, run 3 experiments.

Output Contract

For normal user-facing runs, produce this final structure:

# Karpathy Loop Result

## Setup
- Target:
- Goal:
- Test cases:
- Success checks:
- Budget:

## Baseline Score
...

## Experiments
### Experiment 1
- Hypothesis:
- Change made:
- Before:
- After:
- Decision: KEEP / REJECT
- Reason:

## Final Score
...

## Changes Kept
1.

## Changes Rejected
1.

## Remaining Weaknesses
...

## Final Optimized Target
[paste full final prompt, skill, or instruction]

When the user asks for a downloadable package, create files using the companion templates in templates/ and examples in examples/.

Success Checks

Success checks must be binary, observable, and specific.

Good checks:

Does the output follow the requested format?
Does the output answer the actual user request?
Does the output avoid unsupported claims?
Does the output include exactly one concrete next step?
Does the output stay under the word limit?

Bad checks:

Is it good?
Is it high quality?
Is it compelling?

Score each check as:

PASS = 1
FAIL = 0
UNCERTAIN = 0

Treat uncertainty as failure unless the user explicitly asks for a softer scoring method.

Baseline Procedure

Before editing the target:

Run the original target against every optimization test case.
Score each output against every success check.
Calculate the baseline score.
Identify the largest recurring failure.

Use this table:

Case | Check 1 | Check 2 | Check 3 | Check 4 | Score | Notes

Calculate:

score = passed_checks / total_checks

Example:

Baseline: 14 / 25 = 56%

Do not change the target until the baseline is scored.

Failure Analysis

Before each mutation, write:

Main failure:
Evidence:
Likely cause:
Proposed one-change fix:
Risk of the fix:

Prioritize failures by:

Safety, factuality, or compliance failures.
Format failures that break usability.
Missing required content.
Generic or low-specificity content.
Tone, polish, or concision issues.

Mutation Rules

Each experiment may make one meaningful change.

Allowed changes:

Add one missing constraint.
Clarify one ambiguous instruction.
Add or tighten an output format.
Add one high-signal example.
Remove one conflicting instruction.
Move an important rule earlier.
Add one “do not” rule for a repeated failure.
Add one self-check before final output.
Simplify a confusing section.

Avoid:

Rewriting the whole target at once.
Adding many rules in one experiment.
Changing test cases after seeing failures.
Changing success checks to make the score look better.
Copying test cases into the prompt as examples.
Making the target much longer without a measurable benefit.
Removing safety, accuracy, or compliance constraints.

Retest Procedure

After a mutation:

Run the candidate against the same optimization cases.
Score with the same success checks.
Compare against the current best version.
Note improvements and regressions.

Use this decision rule:

KEEP if the candidate improves the score and introduces no serious regression.
REJECT if the score is flat, worse, noisy, overfit, bloated, or less safe.

Default keep threshold:

At least +10 percentage points for small test sets, or at least +2 passed checks.

If scores tie, keep the simpler version.

Holdout Validation

If there are at least 7 total cases, reserve 20–30% as holdout.

Rules:

Do not use holdout failures to design early mutations.
Run holdout after the final kept change or after every 3 kept changes.
If holdout score is much worse than optimization score, warn that the target may be overfit.
Do not silently tune to the holdout set.

Default acceptable gap:

holdout_score is within 10 percentage points of optimization_score

Experiment Log Format

For each experiment, use:

## Experiment N

### Hypothesis
[what this change should improve]

### Change Made
[the one change]

### Score
Before: X / Y = Z%
After: X / Y = Z%

### Improvements
...

### Regressions
...

### Decision
KEEP or REJECT

### Lesson
...

Final Report Rules

The final answer must include:

Baseline score.
Final score.
Number of experiments run.
Kept changes.
Rejected changes.
Remaining weaknesses.
The full final optimized target.

Be concise enough that the user can immediately copy and use the final target.

Practical Invocation Examples

Read examples/simple-invocation.md when the user wants the easiest way to use the skill.

Read examples/sales-email-example.md when the user wants a concrete business prompt optimization example.

Read examples/skill-optimization-example.md when the target itself is a Claude Skill or long agent instruction.

Template Files

Use these when the user wants artifacts, repeatable process files, or a downloadable skill package:

templates/cases.yaml
templates/evals.yaml
templates/experiment-log.md
templates/final-report.md
templates/results.tsv

Core Principle

Every improvement must connect to this chain:

test case → observed failure → one prompt change → retest → measured improvement

If that chain is missing, do not claim the prompt improved.

Related Skills

tkersey/fm

tools

VerifiedTrustedCommunity

Invokes Apple's macOS 27 fm command-line tool from a local Mac to use the on-device system model or Private Cloud Compute, including instructions, image prompts, schema-constrained JSON, and noninteractive automation. Use when the user asks to run Apple Foundation Models through fm, compare system versus pcc, generate structured output, or automate fm without Swift or an app.

64SKILL.mdUpdated Jul 20, 2026

tkersey/hylo

development

VerifiedTrustedCommunity

Compile historical Codex sessions into governed counterfactual evidence, evaluate an existing owner-applied candidate through blinded paired HCTP trials, and fold observable evidence into RUN, OBSERVE, or STOP. Use for `$hylo`, CRF extraction, counterfactual replay, source-governed direct or historical trials, sealed evidence, paired baseline/candidate evaluation, causal frontiers, or evidence-governed improvement.

64SKILL.mdUpdated Jul 13, 2026

tkersey/ledger

testing

VerifiedTrustedCommunity

Ensure a `ledger` command is available on PATH; materialize, validate, record, replay, and project requested Actuating artifacts without taking semantic or execution authority; coordinate the shared Learnings/Synesthesia/Negative Ledger lifecycle checkpoint and repo-local source-memory reconciliation; address Universalist plans and receipts; and perform pure artifact validation.

64SKILL.mdUpdated Jun 29, 2026

tkersey/review-fold

testing

VerifiedTrustedCommunity

Classify and quotient review findings, failing tests, incidents, bug reports, migration failures, and other witnessed falsifiers against accepted intent and the current Construction. Author counterexample-set/v1 without selecting repairs, counting review credit, or granting mutation.

64SKILL.mdUpdated Jun 28, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/tkersey/dotfiles.git

# Copy into Claude Code skills folder (global)
cp -r dotfiles/codex/skills/karpathy-loop ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

tkersey/dotfiles

51 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT