skills/tdd-verify/SKILL.md
Verify AI-generated code follows TDD discipline. Use when auditing commits for TDD discipline, checking test coverage quality, detecting TDD anti-patterns, or generating compliance scorecards. Do NOT use when reviewing legacy code written before TDD was applied without first establishing a baseline; Do NOT use when you have not reviewed project history.
npx skillsauth add michaelalber/ai-toolkit tdd-verifyInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
"Any fool can write code that a computer can understand. Good programmers write code that humans can understand." — Martin Fowler
TDD verification ensures that the discipline was followed, not just that tests exist. Tests written after implementation feel different, test different things, and provide different value than tests written first.
The Gatekeeper's Role:
Use search_knowledge (grounded-code-mcp) to ground decisions in authoritative references.
| Query | When to Call |
|-------|--------------|
| search_knowledge("TDD anti-patterns test after implementation coverage theater") | During verification — authoritative anti-pattern catalog to check against |
| search_knowledge("test quality desiderata behavioral isolated deterministic") | When scoring test quality — confirms the 12 properties and their verification questions |
| search_knowledge("code coverage mutation testing quality metrics") | When assessing coverage quality vs. coverage theater |
| search_knowledge("TDD discipline red green refactor commit order") | When auditing commit history — confirms expected TDD commit sequence |
Protocol: Search at verification start to load the authoritative compliance criteria. Cite the source path in your scorecard.
Use these properties to evaluate test quality:
| Property | Verification Question | |----------|----------------------| | Isolated | Can each test run independently? | | Composable | Can tests be run in any subset? | | Deterministic | Do tests always give the same result? | | Specific | Do failures point to the exact cause? | | Behavioral | Do tests verify behavior, not implementation? | | Structure-insensitive | Would refactoring break these tests? | | Fast | Is feedback loop quick enough? | | Writable | Are tests easy to create? | | Readable | Can you understand intent quickly? | | Automated | Do tests run without intervention? | | Predictive | Does passing mean it works? | | Inspiring | Do tests give confidence to change? |
Examine git history to verify test-first development:
# Check if tests were committed before implementation
git log --oneline --name-only | less
# Expected pattern:
# abc1234 Add test for user login
# tests/test_auth.py
# def5678 Implement user login
# src/auth.py
Look beyond coverage percentage to coverage quality:
Coverage Theater Signs:
- 100% coverage with no assertions
- Tests that only call methods
- Happy path only, no edge cases
- Implementation details tested
Examine individual tests for TDD characteristics:
Test Quality Checklist:
□ Test name describes behavior
□ Arrange-Act-Assert structure clear
□ Single concept per test
□ Assertions are specific
□ No implementation details exposed
□ Failure message would be helpful
Generate a TDD compliance score:
## TDD Compliance Scorecard
| Category | Score | Notes |
|----------|-------|-------|
| Test-First Evidence | 3/5 | Some tests added with impl |
| Behavioral Tests | 4/5 | Minor impl coupling |
| Minimal Implementation | 5/5 | No over-engineering |
| Refactoring Discipline | 4/5 | Most refactors preserved green |
| Coverage Quality | 3/5 | Some coverage theater |
**Overall Score**: 19/25 (76%)
**Rating**: Good (with improvement areas)
Collect information for verification:
Evidence Collection:
1. Git commit history (chronological)
2. Test file contents
3. Implementation file contents
4. Coverage report (if available)
5. Test execution results
Check if tests preceded implementation:
Commit Order Analysis:
| Commit | Type | TDD Compliant? |
|--------|------|----------------|
| abc123 | Test | N/A (first) |
| def456 | Impl | Yes (test first) |
| ghi789 | Both | No (should be separate) |
| jkl012 | Impl | No (no preceding test) |
Evaluate each test against quality criteria:
Test Quality Analysis:
**test_user_can_login**
- Behavioral: Yes (tests login outcome)
- Specific: Yes (checks user object)
- Isolated: Yes (no shared state)
- Implementation-coupled: No
- Quality: Good
**test_database_insert_called**
- Behavioral: No (tests internal call)
- Specific: Yes
- Isolated: Yes
- Implementation-coupled: Yes (mock verification)
- Quality: Poor (should test outcome)
Look beyond the percentage:
Coverage Quality Check:
**High-quality coverage indicators:**
- Tests fail when behavior breaks
- Edge cases are covered
- Error paths are tested
- Assertions verify outcomes
**Coverage theater indicators:**
- Tests pass even with broken behavior
- No assertions (just coverage)
- Only exercises code paths
- Happy path only
Compile findings into actionable report:
## TDD Verification Report
### Summary
[Overall assessment]
### Strengths
- [What was done well]
### Improvement Areas
- [What could be better]
### Recommendations
1. [Specific action item]
2. [Specific action item]
### Detailed Findings
[Section for each category]
Signs:
Detection:
Look for:
- Commit with both test and impl
- Test that seems to "document" rather than "specify"
- No failing test commit before implementation
Signs:
Detection:
# Suspicious: Testing internal calls
def test_save_user(self):
mock_db = Mock()
service = UserService(mock_db)
service.save(user)
mock_db.execute.assert_called_with(
"INSERT INTO users ...",
(user.id, user.name)
)
Signs:
Detection:
Test inventory check:
- test_add_positive_numbers ✓
- test_add_zero ✗ missing
- test_add_negative ✗ missing
- test_add_overflow ✗ missing
Signs:
Detection:
# Suspicious: No assertions
def test_process_data(self):
processor = DataProcessor()
processor.process(data)
# No assertion!
Signs:
Detection:
# Suspicious: Testing internal structure
def test_user_has_internal_state(self):
user = User("alice")
assert user._internal_cache is not None
assert user._validate_called == True
Signs:
Detection:
# Suspicious: Copy-paste tests
def test_add_1_and_2(self):
assert add(1, 2) == 3
def test_add_3_and_4(self):
assert add(3, 4) == 7
def test_add_5_and_6(self):
assert add(5, 6) == 11
## TDD Quick Check
**Repository/Branch**: [info]
**Period**: [date range]
**Commits Analyzed**: N
### Traffic Light Summary
- Test-First: [GREEN | YELLOW | RED]
- Test Quality: [GREEN | YELLOW | RED]
- Coverage Quality: [GREEN | YELLOW | RED]
### Key Findings
1. [Most important finding]
2. [Second finding]
3. [Third finding]
### Recommended Actions
- [Immediate action]
- [Short-term improvement]
## TDD Verification Report
**Date**: [date]
**Scope**: [what was analyzed]
**Auditor**: Claude Code (tdd-verify)
---
### Executive Summary
[1-2 paragraph overall assessment]
---
### Scoring
| Category | Score | Status |
|----------|-------|--------|
| Test-First Development | X/5 | [status] |
| Behavioral Testing | X/5 | [status] |
| Minimal Implementation | X/5 | [status] |
| Refactoring Discipline | X/5 | [status] |
| Coverage Quality | X/5 | [status] |
**Overall**: X/25 ([percentage]%)
---
### Category Details
#### Test-First Development
[Analysis and evidence]
#### Behavioral Testing
[Analysis and evidence]
[... etc for each category ...]
---
### Anti-Patterns Detected
| Anti-Pattern | Occurrences | Severity | Examples |
|--------------|-------------|----------|----------|
| [pattern] | N | [H/M/L] | [file:line] |
---
### Recommendations
**Immediate (This Sprint)**
1. [Action item]
**Short-term (This Month)**
1. [Action item]
**Long-term (Ongoing)**
1. [Action item]
---
### Appendix: Detailed Findings
[Supporting details, code snippets, etc.]
Maintain state across conversation turns during a verification session:
<tdd-verify-state>
scope: [repo path | branch | commit range | "pending"]
commits_analyzed: [N | "none yet"]
current_category: [test-first | behavioral | minimal-impl | refactor | coverage | "complete"]
score_so_far: [e.g., "12/20 — 3 categories complete"]
anti_patterns_found: [comma-separated list or "none"]
findings_pending_review: [N items]
last_action: [what was just done]
next_action: [what should happen next]
</tdd-verify-state>
Never claim TDD compliance without evidence:
Verification is for improvement, not punishment:
Consider the situation:
Separate intentional choices from mistakes:
tdd-cycle — Use this skill to audit a session orchestrated by tdd-cycle; commit history from a full cycle provides the richest evidencetdd-agent — Run tdd-verify after an autonomous tdd-agent session to confirm the agent followed TDD disciplinetdd-pair — Run tdd-verify at the end of a pair session to score compliance and surface improvement areastdd-refactor — If tdd-verify finds implementation-coupled tests, invoke tdd-refactor to decouple them safelytdd-implementer — If tdd-verify finds over-engineering or over-mocking, trace findings back to the GREEN phase for root-causeSee reference files for verification tools:
development
Federal / government security overlay applied ON TOP OF a base language security review (dotnet/python/php/rust/react). Language-agnostic: adds NIST SP 800-53 control mapping, FIPS 140-2/3 cryptographic compliance (with a per-language crypto table), CUI handling, EO 14028 supply-chain requirements, and DOE Order 205.1B, and emits POA&M-ready findings with FIPS 199 impact levels. Use for federal/DOE/DOD/national-laboratory systems. Triggers on "federal security review", "NIST compliance", "NIST 800-53", "FISMA", "CUI", "FIPS audit", "DOE security", "POA&M", "ATO review". Do NOT use alone — run the matching <lang>-security-review FIRST; this overlay maps and extends it.
tools
OWASP-based security review of React / TypeScript front-end applications. Detects the framework (Vite/CRA/Next), entry points, and data flows, scans against the OWASP Top 10 (2025) mapped to React client-side patterns (XSS via raw HTML, URL/protocol injection, secrets in the bundle, insecure token storage, dependency CVEs, missing CSP, open redirects), and produces a manager-friendly executive summary plus a graded technical findings table. Use to audit React code for vulnerabilities. Triggers on "react security review", "frontend security audit", "audit react for vulnerabilities", "owasp react", "react xss", "react security posture", "npm audit review". For federal / gov / DOE / NIST / FIPS / CUI context, run security-review-federal after this base review. Do NOT use to grade architecture/structure — use react-architecture-checklist.
tools
Analyzes legacy React codebases and produces actionable modernization plans. Primary migration paths include class components to function components + hooks, Create React App to Vite, React 16/17 to 18 to 19, JavaScript to TypeScript, Enzyme to React Testing Library, legacy Redux to Redux Toolkit / Zustand / Context, and deprecated lifecycle/API removal. Does NOT perform the migration — assesses, quantifies risk, and plans. Triggers on phrases like "modernize react", "class to hooks", "upgrade react", "migrate CRA to vite", "react legacy migration", "react 17 to 18", "react js to typescript", "react technical debt", "enzyme to RTL".
development
Scaffolds feature-based React / TypeScript architecture using feature folders, presentational + container components, custom hooks, a typed data layer, and structural CQRS (query hooks vs mutation hooks). React analog of dotnet-vertical-slice and python-feature-slice — no DI framework; uses props/context for dependency injection and a query cache for server state. Use when creating feature-based React projects, adding React features, organizing components by feature rather than by technical type, or scaffolding a feature's data layer. Triggers on phrases like "scaffold react feature", "create react slice", "react feature folder", "react vertical slice", "add react feature", "react feature architecture", "organize react by feature".