Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

phrazzld/karpathy-guidelines

Name: karpathy-guidelines
Author: phrazzld

skills/karpathy-guidelines/SKILL.md

npx skillsauth add phrazzld/spellbook karpathy-guidelines

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

/karpathy — behavioral guidelines

Four principles for avoiding the specific LLM-agent failure modes Karpathy has repeatedly called out on Twitter/X: silent assumption-picking, premature abstraction, scope creep into adjacent code, and vague success criteria. Use as a self-check before acting, or dispatch when an agent seems to be drifting.

Tradeoff. These bias toward caution over speed. For mechanical tasks (renames, find/replace, dep bumps) they're overhead — use judgment. The threshold is "does this need a judgment call?", same as for delegation.

Delegation Judgment

This is a reference/self-check skill, so reading it does not satisfy the roster floor. When the underlying task is substantive, use the caller's Delegation Floor and commission specialized lanes for assumptions, simplicity, surgical scope, and verification risk as needed. Direct solo is fine for mechanical tasks where these guidelines are only a quick checklist.

Local lane guidance: If dispatched, use one narrowly scoped critic lane against the live artifact and the relevant principle; do not turn the four principles into a standing review bench.

1. Think before coding

Don't assume. Don't hide confusion. Surface tradeoffs.

Before implementing:

State your assumptions explicitly. If uncertain, ask.
If multiple interpretations exist, present them — don't pick silently.
If a simpler approach exists, say so. Push back when warranted.
If something is unclear, stop. Name what's confusing. Ask.

The failure this prevents: the model picks one interpretation of a vague request, runs three-hundred lines, and the user only discovers the mismatch after reviewing the diff. Surface the fork before the work, not after.

Example. Ticket: "add validation to the signup form."

Wrong: silently pick "validate email format + password length," ship that, discover the user wanted reCAPTCHA.

Right: before coding, state "I'm going to add client-side format validation for email and minimum-length for password. I'm not planning to wire up reCAPTCHA, rate limiting, or server-side checks. Confirm or redirect."

2. Simplicity first

Minimum code that solves the problem. Nothing speculative.

No features beyond what was asked.
No abstractions for single-use code.
No "flexibility" or "configurability" that wasn't requested.
No error handling for impossible scenarios.
If you write 200 lines and it could be 50, rewrite it.

Ask: "Would a senior engineer say this is overcomplicated?" If yes, simplify.

The failure this prevents: over-engineered scaffolding that obscures the actual change. Three design patterns, a new abstraction layer, and a config file to solve a problem that needed five lines.

Example. Ticket: "cache the user's timezone on the session."

Wrong: introduce a CacheProvider interface, implement it for memory + Redis, add a config toggle, write factory functions.

Right: session.timezone = user.timezone; session.save().

3. Surgical changes

Touch only what you must. Clean up only your own mess.

When editing existing code:

Don't "improve" adjacent code, comments, or formatting.
Don't refactor things that aren't broken.
Match existing style, even if you'd do it differently.
If you notice unrelated dead code, mention it — don't delete it.

When your changes create orphans:

Remove imports, variables, functions that YOUR changes made unused.
Don't remove pre-existing dead code unless asked.

The test: every changed line should trace directly to the request.

Reconciling with "fix what you touch"

This principle sits in productive tension with the broader doctrine's "Fix what you touch — including pre-existing issues in the same area." The tension resolves cleanly:

Broken things in your working area — fix or file. No "pre-existing, not my problem" dodges.
Non-broken things in your working area — leave alone, even if you'd write them differently.

Broken means: wrong output, missing guard, actually-hit bug, fails the acceptance criteria. Not: "I'd name this differently", "this could be a helper", "this comment is stale-ish."

4. Goal-driven execution

Define success criteria. Loop until verified.

Transform tasks into verifiable goals:

"Add validation" → "Write tests for invalid inputs, then make them pass."
"Fix the bug" → "Write a test that reproduces it, then make it pass."
"Refactor X" → "Ensure tests pass before and after. List what X's behavior is; preserve it bit-for-bit."

For multi-step tasks, state a brief plan:

1. [Step] → verify: [check]
2. [Step] → verify: [check]
3. [Step] → verify: [check]

Strong success criteria let the agent loop independently. Weak criteria ("make it work") require constant clarification and drift toward "works on my machine" endings.

This principle also unlocks delegation: a subagent given a vague goal produces vague work; a subagent given a verifiable oracle produces work you can check in 30 seconds.

Self-check

These guidelines are working if:

Fewer unnecessary changes appear in diffs.
Fewer rewrites happen because the first pass was overcomplicated.
Clarifying questions land before implementation, not after the mismatch is discovered in review.

Attribution

Derived from forrestchang/andrej-karpathy-skills (MIT), a community compilation of Andrej Karpathy's observations on LLM coding pitfalls. Rewritten here with harness-neutral wording and examples drawn from Harness Kit's own repo shape; the four principles and their framing are Karpathy's.

Verification

Semantic waiver: this is a reference guardrail, not an executable workflow. Validate catalog/trigger shape with cargo run --locked -p harness-kit-checks -- check-frontmatter --repo .; behavioral proof appears when a consuming skill cites the guardrail and the resulting diff stays scoped and verifiable.

phrazzld/karpathy-guidelines

skills/karpathy-guidelines/SKILL.md

Four LLM-agent guardrails: surface assumptions, prefer simplicity, make surgical changes, and drive by verifiable goals. Reference for scope, simplicity, assumptions, or success-criteria judgment calls. Use when: "am I overcomplicating this", "what are my assumptions", "how do I verify success", "/karpathy", "/principles". Trigger: /karpathy, /principles.

13 stars

development

Updated Jun 10, 2026

$ install --global

skillsauth

npx skillsauth add phrazzld/spellbook karpathy-guidelines

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Jun 10, 2026, 6:23 AM25.9s1 file scanned

SKILL.md

name:: karpathy-guidelines
description:: |
Four LLM-agent guardrails:: surface assumptions, prefer simplicity, make
Trigger:: /karpathy, /principles.
argument-hint:: [think | simple | surgical | goal]

/karpathy — behavioral guidelines

Delegation Judgment

Local lane guidance: If dispatched, use one narrowly scoped critic lane against the live artifact and the relevant principle; do not turn the four principles into a standing review bench.

1. Think before coding

Don't assume. Don't hide confusion. Surface tradeoffs.

Before implementing:

State your assumptions explicitly. If uncertain, ask.
If multiple interpretations exist, present them — don't pick silently.
If a simpler approach exists, say so. Push back when warranted.
If something is unclear, stop. Name what's confusing. Ask.

Example. Ticket: "add validation to the signup form."

Wrong: silently pick "validate email format + password length," ship that, discover the user wanted reCAPTCHA.

2. Simplicity first

Minimum code that solves the problem. Nothing speculative.

No features beyond what was asked.
No abstractions for single-use code.
No "flexibility" or "configurability" that wasn't requested.
No error handling for impossible scenarios.
If you write 200 lines and it could be 50, rewrite it.

Ask: "Would a senior engineer say this is overcomplicated?" If yes, simplify.

The failure this prevents: over-engineered scaffolding that obscures the actual change. Three design patterns, a new abstraction layer, and a config file to solve a problem that needed five lines.

Example. Ticket: "cache the user's timezone on the session."

Wrong: introduce a CacheProvider interface, implement it for memory + Redis, add a config toggle, write factory functions.

Right: session.timezone = user.timezone; session.save().

3. Surgical changes

Touch only what you must. Clean up only your own mess.

When editing existing code:

Don't "improve" adjacent code, comments, or formatting.
Don't refactor things that aren't broken.
Match existing style, even if you'd do it differently.
If you notice unrelated dead code, mention it — don't delete it.

When your changes create orphans:

Remove imports, variables, functions that YOUR changes made unused.
Don't remove pre-existing dead code unless asked.

The test: every changed line should trace directly to the request.

Reconciling with "fix what you touch"

This principle sits in productive tension with the broader doctrine's "Fix what you touch — including pre-existing issues in the same area." The tension resolves cleanly:

Broken things in your working area — fix or file. No "pre-existing, not my problem" dodges.
Non-broken things in your working area — leave alone, even if you'd write them differently.

Broken means: wrong output, missing guard, actually-hit bug, fails the acceptance criteria. Not: "I'd name this differently", "this could be a helper", "this comment is stale-ish."

4. Goal-driven execution

Define success criteria. Loop until verified.

Transform tasks into verifiable goals:

"Add validation" → "Write tests for invalid inputs, then make them pass."
"Fix the bug" → "Write a test that reproduces it, then make it pass."
"Refactor X" → "Ensure tests pass before and after. List what X's behavior is; preserve it bit-for-bit."

For multi-step tasks, state a brief plan:

1. [Step] → verify: [check]
2. [Step] → verify: [check]
3. [Step] → verify: [check]

Strong success criteria let the agent loop independently. Weak criteria ("make it work") require constant clarification and drift toward "works on my machine" endings.

This principle also unlocks delegation: a subagent given a vague goal produces vague work; a subagent given a verifiable oracle produces work you can check in 30 seconds.

Self-check

These guidelines are working if:

Fewer unnecessary changes appear in diffs.
Fewer rewrites happen because the first pass was overcomplicated.
Clarifying questions land before implementation, not after the mismatch is discovered in review.

Attribution

Verification

Related Skills

phrazzld/compound

testing

VerifiedTrustedCommunity

Capture one compounding repo-technical learning while a solved problem is still fresh. Use when: after a bug fix, diagnosis, delivery, review, or incident reveals a reusable pattern worth adding to `docs/solutions/`. Trigger: /compound, /capture-learning, /learning.

13SKILL.mdUpdated Jul 5, 2026

phrazzld/factory-apps

testing

VerifiedTrustedCommunity

Route Misty Step factory application capabilities. Use when choosing, auditing, integrating, or operating Canary, Powder, Landmark, Aesthetic, or Bitterblossom: production observability, incidents, health checks, error logging, backlog/work-card state, release intelligence, UI/UX system adoption, or supervised/unsupervised agent dispatch. Trigger: /factory-apps, /factory-stack.

13SKILL.mdUpdated Jul 4, 2026

phrazzld/factory-apps

phrazzld/skill-eval

testing

VerifiedTrustedCommunity

Prove a skill beats no-skill with a falsifiable A/B eval, or retire it. Design, generate, run, and maintain a skill-specific eval: name the one claim the skill must earn, run it skill-on vs raw same-model, grade blind with objective checks first, return a keep/adapt/cut verdict. Use when: "eval this skill", "does this skill help", "prove the skill beats no skill", "write an eval for", "benchmark a skill", "is this skill worth it", "skill A/B", "skill regression test", "generate skill evals". Trigger: /skill-eval, /eval-skill, /prove-skill.

13SKILL.mdUpdated Jul 2, 2026

phrazzld/skills/harness-engineering/templates/repo-local-skill

tools

VerifiedTrustedCommunity

> Template. Copy to `<target-repo>/.agents/skills/<repo>-<domain>/SKILL.md` > and fill every bracketed placeholder from the live target repo. Delete this > line and every other `> ` guidance line before committing. See > `../../references/repo-local-skill-generation.md` for the full process. --- name: <repo>-<domain> description: | [One paragraph: what this skill verifies/runs/operates for <repo>, stated in terms of the repo's real shape (service/CLI/library/etc.), not generic process. En

13SKILL.mdUpdated Jul 2, 2026

phrazzld/skills/harness-engineering/templates/repo-local-skill

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/phrazzld/spellbook.git

# Copy into Claude Code skills folder (global)
cp -r spellbook/skills/karpathy-guidelines ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

phrazzld/spellbook

13 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT