Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

phrazzld/code-review

Name: code-review
Author: phrazzld

skills/code-review/SKILL.md

npx skillsauth add phrazzld/spellbook code-review

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

/code-review

You are the marshal. Read the diff, dispatch diverse fresh-context reviewers, synthesize, fix blockers, loop until clean. The authoring agent never ships on its own review — that is the one hard rule.

Dispatch

Scope = the diff. git diff <base>...HEAD (default base: the repo's default branch). Classify what changed: API, UI, security surface, data model, infra, docs.
Learnings audit. Grep matched learnings before dispatch: rg -n --glob '*.md' '^(title|tags|applies_when):|<module>|<failure-mode>' docs/solutions. For each applicable match, the synthesis includes followed|violated <learning title> <file:line> <why>; no file:line means the verdict is not anchored.
Choose the risk tier. Use harnesses/shared/references/quality-system.md to decide whether the diff needs a tiny, substantive, high-stakes, or Mode B review topology. Scale to the failure cost, not to habit.
Fan out in parallel, decorrelated. Native subagents for focused lenses (pick 2–4 that fit the diff: correctness, security, simplicity, tests); peer harness CLIs (/roster) for cross-model judgment — a different model family has decorrelated failure modes. If the harness can run a large-scale background orchestration where reviewers adversarially cross-check each other's findings before reporting, a substantive diff is a natural fit — that scale costs tokens, so routine diffs don't get it. Reviewers get the diff, the acceptance oracle, and a risk lens — never the author's reasoning trail. When the delivery logged deviations, hand reviewers the deviation sites — where the plan bent is where plausible-but-wrong concentrates — but never the author's justifications for them. Add harnesses/shared/references/works-critique.md when the diff touches public API, CLI, UI, performance, compatibility, migration, or operator workflow. Add harnesses/shared/references/delete-first.md when the diff adds abstraction, automation, dependencies, modes, or optimization; pair it with the synced Ponytail skill (skills/.external/dietrich-ponytail/SKILL.md) when the main risk is bloat, boilerplate, or speculative engineering. Add the synced Thermo-Nuclear skill (skills/.external/cursor-thermo-nuclear-code-quality-review/SKILL.md) whenever the diff changes meaningful implementation structure, grows large files, adds wrappers, or risks spaghetti branching; this is the default harsh maintainability lens, not a last resort. Add harnesses/shared/references/verification-system-first.md when the diff's proof story is missing, weak, eval/benchmark-shaped, or depends on QA/manual judgment.
Aim reviewers at production embarrassment, not nitpicks. Tell each one what to ignore (style, naming, speculative "consider…") as explicitly as what to find.

What reviewers hunt

Plausible-but-wrong is the failure mode of model-written code:

Stub or specification-shaped implementations that pass tests but don't work
Wrong complexity (O(n²) hiding behind a clean interface)
Tests that never invoke the changed entrypoint (adjacent green lanes)
Missing verification system: no claim, falsifier, driver, grader, evidence packet, or cadence for a substantive change
Missing invariant checks that only matter at scale or under concurrency
Unnecessary abstraction — wrappers, modes, layers that don't earn their keep
Swallowed errors, magic fallbacks, internal mocks

If the diff adds or changes an executable path (CLI, script, migration, job), someone must run it once or cite the gate that does — otherwise it's an unverified runtime path and blocks Ship. If the diff touches a visual or user-facing surface, at least one reviewer exercises it live. If the diff claims eval, benchmark, QA, or agent-behavior improvement, reviewers must inspect the driver and grader, not just the report prose.

Synthesize and verdict

Dedupe across reviewers; rank blocking (correctness, security, unverified runtime path) > important (architecture, test strength) > advisory (everything else). Blocking findings get fixed and the fix re-reviewed — full pass, not a spot-check. Max 3 fix-review iterations, then escalate to the operator with the open findings. Ship / Don't-ship is the lead's call on the reviewers' evidence; advisory findings never block. When the receipt pile is large, use julius-caveman compression for the synthesis only. Findings must stay precise; PR comments and code suggestions stay normal English unless the operator explicitly asks otherwise.

Gotchas

Monoculture. Same-model subagents alone are groupthink with extra steps. Substantive diffs get at least one other model family.
Reviewing the repo instead of the diff. Scope discipline keeps findings actionable.
Treating all findings equally. Severity ranking is the marshal's job; a wall of undifferentiated comments is review theater.
Skipping re-review after fixes. A fix can introduce the next bug; blockers get a fresh pass.

phrazzld/code-review

skills/code-review/SKILL.md

Dispatch-shaped code review: fan the diff out to fresh-context reviewers across diverse providers and model families, synthesize, fix blockers, re-review until clean. Use when: "review this", "code review", "is this ready to ship", "second-model review". Trigger: /code-review, /review.

13 stars

development

Updated Jul 5, 2026

$ install --global

skillsauth

npx skillsauth add phrazzld/spellbook code-review

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Jul 5, 2026, 5:16 AM54.6s2 files scanned

SKILL.md

name:: code-review
description:: |
Dispatch-shaped code review:: fan the diff out to fresh-context reviewers
re-review until clean. Use when:: review this", "code review", "is this
ready to ship", "second-model review". Trigger:: /code-review, /review.
argument-hint:: [branch|diff|files]

/code-review

Dispatch

Scope = the diff. git diff <base>...HEAD (default base: the repo's default branch). Classify what changed: API, UI, security surface, data model, infra, docs.
Learnings audit. Grep matched learnings before dispatch: rg -n --glob '*.md' '^(title|tags|applies_when):|<module>|<failure-mode>' docs/solutions. For each applicable match, the synthesis includes followed|violated <learning title> <file:line> <why>; no file:line means the verdict is not anchored.
Choose the risk tier. Use harnesses/shared/references/quality-system.md to decide whether the diff needs a tiny, substantive, high-stakes, or Mode B review topology. Scale to the failure cost, not to habit.
Fan out in parallel, decorrelated. Native subagents for focused lenses (pick 2–4 that fit the diff: correctness, security, simplicity, tests); peer harness CLIs (/roster) for cross-model judgment — a different model family has decorrelated failure modes. If the harness can run a large-scale background orchestration where reviewers adversarially cross-check each other's findings before reporting, a substantive diff is a natural fit — that scale costs tokens, so routine diffs don't get it. Reviewers get the diff, the acceptance oracle, and a risk lens — never the author's reasoning trail. When the delivery logged deviations, hand reviewers the deviation sites — where the plan bent is where plausible-but-wrong concentrates — but never the author's justifications for them. Add harnesses/shared/references/works-critique.md when the diff touches public API, CLI, UI, performance, compatibility, migration, or operator workflow. Add harnesses/shared/references/delete-first.md when the diff adds abstraction, automation, dependencies, modes, or optimization; pair it with the synced Ponytail skill (skills/.external/dietrich-ponytail/SKILL.md) when the main risk is bloat, boilerplate, or speculative engineering. Add the synced Thermo-Nuclear skill (skills/.external/cursor-thermo-nuclear-code-quality-review/SKILL.md) whenever the diff changes meaningful implementation structure, grows large files, adds wrappers, or risks spaghetti branching; this is the default harsh maintainability lens, not a last resort. Add harnesses/shared/references/verification-system-first.md when the diff's proof story is missing, weak, eval/benchmark-shaped, or depends on QA/manual judgment.
Aim reviewers at production embarrassment, not nitpicks. Tell each one what to ignore (style, naming, speculative "consider…") as explicitly as what to find.

What reviewers hunt

Plausible-but-wrong is the failure mode of model-written code:

Stub or specification-shaped implementations that pass tests but don't work
Wrong complexity (O(n²) hiding behind a clean interface)
Tests that never invoke the changed entrypoint (adjacent green lanes)
Missing verification system: no claim, falsifier, driver, grader, evidence packet, or cadence for a substantive change
Missing invariant checks that only matter at scale or under concurrency
Unnecessary abstraction — wrappers, modes, layers that don't earn their keep
Swallowed errors, magic fallbacks, internal mocks

Synthesize and verdict

Gotchas

Monoculture. Same-model subagents alone are groupthink with extra steps. Substantive diffs get at least one other model family.
Reviewing the repo instead of the diff. Scope discipline keeps findings actionable.
Treating all findings equally. Severity ranking is the marshal's job; a wall of undifferentiated comments is review theater.
Skipping re-review after fixes. A fix can introduce the next bug; blockers get a fresh pass.

Related Skills

phrazzld/compound

testing

VerifiedTrustedCommunity

Capture one compounding repo-technical learning while a solved problem is still fresh. Use when: after a bug fix, diagnosis, delivery, review, or incident reveals a reusable pattern worth adding to `docs/solutions/`. Trigger: /compound, /capture-learning, /learning.

13SKILL.mdUpdated Jul 5, 2026

phrazzld/factory-apps

testing

VerifiedTrustedCommunity

Route Misty Step factory application capabilities. Use when choosing, auditing, integrating, or operating Canary, Powder, Landmark, Aesthetic, or Bitterblossom: production observability, incidents, health checks, error logging, backlog/work-card state, release intelligence, UI/UX system adoption, or supervised/unsupervised agent dispatch. Trigger: /factory-apps, /factory-stack.

13SKILL.mdUpdated Jul 4, 2026

phrazzld/factory-apps

phrazzld/skill-eval

testing

VerifiedTrustedCommunity

Prove a skill beats no-skill with a falsifiable A/B eval, or retire it. Design, generate, run, and maintain a skill-specific eval: name the one claim the skill must earn, run it skill-on vs raw same-model, grade blind with objective checks first, return a keep/adapt/cut verdict. Use when: "eval this skill", "does this skill help", "prove the skill beats no skill", "write an eval for", "benchmark a skill", "is this skill worth it", "skill A/B", "skill regression test", "generate skill evals". Trigger: /skill-eval, /eval-skill, /prove-skill.

13SKILL.mdUpdated Jul 2, 2026

phrazzld/skills/harness-engineering/templates/repo-local-skill

tools

VerifiedTrustedCommunity

> Template. Copy to `<target-repo>/.agents/skills/<repo>-<domain>/SKILL.md` > and fill every bracketed placeholder from the live target repo. Delete this > line and every other `> ` guidance line before committing. See > `../../references/repo-local-skill-generation.md` for the full process. --- name: <repo>-<domain> description: | [One paragraph: what this skill verifies/runs/operates for <repo>, stated in terms of the repo's real shape (service/CLI/library/etc.), not generic process. En

13SKILL.mdUpdated Jul 2, 2026

phrazzld/skills/harness-engineering/templates/repo-local-skill

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/phrazzld/spellbook.git

# Copy into Claude Code skills folder (global)
cp -r spellbook/skills/code-review ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

phrazzld/spellbook

13 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT