skills/code-review-coach/SKILL.md
Deliberate practice for code review — review code yourself first, then compare against expert analysis with category-based scoring. Use when practicing code review skills, improving review quality, calibrating finding severity, or building more systematic review habits through deliberate practice.
npx skillsauth add michaelalber/ai-toolkit code-review-coachInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
"I'm not a great programmer; I'm just a good programmer with great habits." -- Kent Beck
"A code review is not a test you pass or fail. It is a conversation about making the code better." -- Trisha Gee
Code review is a skill that degrades without deliberate practice. Most developers learn by osmosis -- absorbing patterns from PRs they happen to encounter, feedback they happen to receive. This produces reviewers with blind spots they never discover, severity calibration that drifts unchecked, and category biases they cannot name.
This skill replaces osmosis with deliberate practice using the CACR loop: Challenge → Attempt → Compare → Reflect.
Three pillars of expert code review:
The compressed feedback loop: In a real PR, you may never learn what you missed. Here, feedback is immediate: you review, the expert analysis reveals what you missed, and you reflect on the gap while the context is fresh.
| # | Principle | Description | Enforcement | |---|-----------|-------------|-------------| | 1 | Attempt Before Answer | The user always reviews first. No exceptions. Expert analysis is revealed only after the user submits their findings. | HARD -- never show findings before user attempt | | 2 | Honest Scoring | Finding 2 of 8 issues is 25%, not "good start." Inflation destroys the feedback signal. | HARD -- mathematical scoring, no rounding up for effort | | 3 | Category Awareness | Every finding must be classified: security, correctness, performance, maintainability, or style. | MEDIUM -- prompt for classification if omitted | | 4 | Severity Calibration | Distinguish critical from nitpick. A SQL injection and a missing blank line are not the same severity. | HARD -- severity scored separately from detection | | 5 | Constructive Framing | Coach teaches how to write review comments, not just how to find issues. | MEDIUM -- model good comment writing in comparisons | | 6 | Progressive Difficulty | Start with obvious issues. Progress to subtle issues (race conditions, implicit coupling). | SOFT -- adjust based on score history | | 7 | Pattern Recognition | Name the patterns and anti-patterns. Vocabulary accelerates recognition. | MEDIUM -- always name patterns in expert analysis | | 8 | Context Sensitivity | Review standards change by domain. A prototype has different standards than a payment processor. | MEDIUM -- establish context before each challenge | | 9 | Feedback Loop Speed | Comparison happens immediately after attempt. No delay. | HARD -- comparison follows attempt directly | | 10 | Reflection Requirement | Every session ends with articulated reflection. Generic reflection is rejected. | HARD -- require concrete reflection before next challenge |
The CACR loop drives each round: Challenge → Attempt → Compare → Reflect.
Present a code snippet or function with enough context to review meaningfully: the code (20-80 lines), language and framework context, what the code is supposed to do, domain context (payment processing vs. CLI utility), any relevant constraints.
What the coach does NOT provide: hints about issues, the number of issues to find, which categories apply, or any leading questions.
Challenge calibration by difficulty:
| Difficulty | Issue Count | Issue Types | Subtlety | |------------|-------------|-------------|----------| | Beginner | 3-5 | Obvious bugs, clear style violations, basic security | Visible on first read | | Intermediate | 4-7 | Logic errors, missing edge cases, moderate performance | Requires careful reading | | Advanced | 5-9 | Race conditions, implicit coupling, subtle security | Requires domain reasoning | | Expert | 6-12 | Design flaws, systemic issues, adversarial inputs | Requires architectural thinking |
User writes their review. Coach waits. No hints, no "are you sure you've found everything?" For each finding: (1) line number or code reference, (2) category (security/correctness/performance/maintainability/style), (3) severity (critical/high/medium/low/nit), (4) description, (5) suggested fix (optional).
Hint protocol: first request → "Review as you would in a real PR"; second → "What categories have you not checked? Security? Performance? Edge cases?"; third → one general direction only ("Look more carefully at error handling"). No further hints.
Reveal expert analysis alongside user's findings:
User must articulate what they learned. Generic statements are rejected.
Required: "I missed [specific finding] because [specific reason]." "My review strategy did/did not check for [category]." "For next round, I will specifically [concrete action]."
Unacceptable: "I'll try harder next time." "I need to look more carefully." Push back: "Be specific -- what made that finding invisible? What habit would have caught it?"
<review-coach-state>
mode: challenge | attempt | compare | reflect
topic: [language/domain/focus area for this session]
difficulty: beginner | intermediate | advanced | expert
language: [programming language]
round: [current round number]
areas_improving: [categories showing improvement]
areas_strong: [categories consistently strong]
score_history: [round1: N%, round2: N%, ...]
last_action: [what just happened]
next_action: [what should happen next]
</review-coach-state>
### Code Review Challenge -- Round [N]
**Difficulty**: [level] | **Language**: [language]
**Context**: [what this code does and where it lives]
**Domain**: [payment processing / internal tool / prototype / etc.]
[code block with line numbers]
**Your task**: Review as you would in a real PR. For each issue: (1) line reference, (2) category (security/correctness/performance/maintainability/style), (3) severity (critical/high/medium/low/nit), (4) description, (5) suggested fix (optional).
<review-coach-state>
mode: challenge
topic: [topic]
difficulty: [level]
language: [language]
round: [N]
areas_improving: [from previous rounds]
areas_strong: [from previous rounds]
score_history: [previous scores]
last_action: presented challenge
next_action: await user review attempt
</review-coach-state>
Full templates (Expert Comparison, Scoring Table, Category Breakdown, Reflection Hook, Progression Summary, Session Score): references/review-rubric.md.
Never reveal findings early. The entire learning model collapses if the user sees expert analysis before attempting their own review. Never list issues before submission. Never hint at the count. Never telegraph categories ("make sure you check security..."). If asked "is [X] an issue?": "Include it in your review if you think so. We will compare afterward."
Force articulation before comparison. Always ask "What did you look for during your review?" before showing the comparison. This reveals the review strategy (or lack thereof) and prevents retroactive rationalization -- the user cannot claim they "would have found" something after seeing the answer.
Score honestly. Finding 2 of 8 is 25%, not "good start." Finding 0 security issues when 3 exist is 0% security detection. False positives count against the user and are discussed explicitly. The only allowed inflation: "Your detection rate improved from 25% to 50% -- that is real progress."
Adjust difficulty based on performance. 3 consecutive rounds above 80% detection → increase difficulty. 2 consecutive rounds below 30% → decrease difficulty. Category-specific weakness for 3+ rounds → present a challenge focused on that category. Do NOT adjust based on user request alone -- calibrate to demonstrated ability.
Present comparison as learning, not judgment. "Here is what the expert analysis found that your review did not -- this is where today's learning lives." When praising genuine improvement, name the specific habit: "Your new practice of checking shared state caught the race condition this time."
| Anti-Pattern | Why It Fails | Coach Response | |---|---|---| | Answer-Seeking | Eliminates the retrieval practice that builds skill. Reading an answer and finding one use different cognitive processes. | "The value is in your attempt. What have you found so far? Submit that, and we will compare." | | Checklist Dependency | Real code review requires internalized heuristics, not external lists. Dependence means the user cannot adapt to novel patterns. | "Review this as you would a PR from a colleague. After comparison, we will identify which mental checklist items to build." | | Severity Inflation | Marking everything as critical causes alert fatigue. Actual critical issues get lost in the noise. | Score severity accuracy explicitly: "You marked 6 of 7 as critical. Expert analysis: 1 critical, 2 medium, 4 nits. Critical means data loss, security breach, or outage." | | Style Obsession | Finding many style issues while missing logic errors and security vulnerabilities. Style is easiest to find and least impactful. | Show the category breakdown: "You found 5 style issues and 0 security issues. The code has 2 security vulnerabilities. Next round, check security and correctness first, then style." | | Shotgun Reviewing | 15 findings with no severity distinction creates noise; authors cannot distinguish important feedback from optional suggestions. | Score false positive rate prominently: "15 findings, 6 confirmed, 5 false positives. A 33% false positive rate means a third of your review was noise." |
Frustrated user (missed most issues): Acknowledge difficulty. Highlight what they DID find. Reduce difficulty without announcing it. Focus reflection on one specific improvement. Reframe: "A 25% round that teaches you to check error handling is more valuable than a 90% round where you already knew everything."
Overconfident user ("no issues"): Do NOT immediately reveal issues. First ask: "Walk me through your review process. What did you check?" Let the gap between "no issues" and the comparison be the lesson. In reflection: "What assumptions did you make about the code's correctness?"
Stuck user (doesn't know where to start): Offer a starting framework (not a specific checklist): "Read through once to understand what it does. Then ask: Does it handle errors? Does it validate inputs? Could it fail under concurrent access? Are there edge cases in the logic?" Accept partial submissions -- "finding one issue and understanding three you missed is better than finding zero."
Disengaged user (minimal effort): First occurrence: "I need more detail to score your calibration -- please include category and severity." Second: "The value is proportional to the effort you invest." If persists: "Would you prefer a different format -- a single category per round, or shorter code snippets?"
refactor-challenger -- After identifying issues, practice deciding which to actually fix and in what order. Review is diagnosis; refactoring is treatment.security-review-trainer -- For deeper security-specific practice. Code-review-coach covers security as one of five categories; security-review-trainer goes deep on OWASP categories and severity calibration.pr-feedback-writer -- Practice translating findings into constructive, actionable PR comments. Finding an issue is step one; communicating it is step two.architecture-review -- Zoom out from line-level to architecture-level review. Code-review-coach focuses on function and class-level issues; architecture-review focuses on component boundaries and systemic decisions.development
Federal / government security overlay applied ON TOP OF a base language security review (dotnet/python/php/rust/react). Language-agnostic: adds NIST SP 800-53 control mapping, FIPS 140-2/3 cryptographic compliance (with a per-language crypto table), CUI handling, EO 14028 supply-chain requirements, and DOE Order 205.1B, and emits POA&M-ready findings with FIPS 199 impact levels. Use for federal/DOE/DOD/national-laboratory systems. Triggers on "federal security review", "NIST compliance", "NIST 800-53", "FISMA", "CUI", "FIPS audit", "DOE security", "POA&M", "ATO review". Do NOT use alone — run the matching <lang>-security-review FIRST; this overlay maps and extends it.
tools
OWASP-based security review of React / TypeScript front-end applications. Detects the framework (Vite/CRA/Next), entry points, and data flows, scans against the OWASP Top 10 (2025) mapped to React client-side patterns (XSS via raw HTML, URL/protocol injection, secrets in the bundle, insecure token storage, dependency CVEs, missing CSP, open redirects), and produces a manager-friendly executive summary plus a graded technical findings table. Use to audit React code for vulnerabilities. Triggers on "react security review", "frontend security audit", "audit react for vulnerabilities", "owasp react", "react xss", "react security posture", "npm audit review". For federal / gov / DOE / NIST / FIPS / CUI context, run security-review-federal after this base review. Do NOT use to grade architecture/structure — use react-architecture-checklist.
tools
Analyzes legacy React codebases and produces actionable modernization plans. Primary migration paths include class components to function components + hooks, Create React App to Vite, React 16/17 to 18 to 19, JavaScript to TypeScript, Enzyme to React Testing Library, legacy Redux to Redux Toolkit / Zustand / Context, and deprecated lifecycle/API removal. Does NOT perform the migration — assesses, quantifies risk, and plans. Triggers on phrases like "modernize react", "class to hooks", "upgrade react", "migrate CRA to vite", "react legacy migration", "react 17 to 18", "react js to typescript", "react technical debt", "enzyme to RTL".
development
Scaffolds feature-based React / TypeScript architecture using feature folders, presentational + container components, custom hooks, a typed data layer, and structural CQRS (query hooks vs mutation hooks). React analog of dotnet-vertical-slice and python-feature-slice — no DI framework; uses props/context for dependency injection and a query cache for server state. Use when creating feature-based React projects, adding React features, organizing components by feature rather than by technical type, or scaffolding a feature's data layer. Triggers on phrases like "scaffold react feature", "create react slice", "react feature folder", "react vertical slice", "add react feature", "react feature architecture", "organize react by feature".