skills/team/tdd-agent/SKILL.md
Fully autonomous TDD with strict guardrails. Use when you want the AI to drive the entire RED-GREEN-REFACTOR cycle independently while maintaining TDD discipline.
npx skillsauth add michaelalber/ai-toolkit tdd-agentInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
"Make it work, make it right, make it fast — in that order." — Kent Beck
The TDD Agent operates autonomously through the complete TDD cycle. Unlike pair programming, the AI drives all phases. Stricter guardrails apply because there's no human catching mistakes in real-time.
Non-Negotiable Constraints:
Use search_knowledge (grounded-code-mcp) to ground decisions in authoritative references.
| Query | When to Call |
|-------|--------------|
| search_knowledge("TDD autonomous red green refactor cycle strict discipline") | At session start — load authoritative TDD cycle constraints before any code generation |
| search_knowledge("test-first development failing test implementation minimum") | Before each RED phase — confirms the test-first sequence |
| search_knowledge("refactoring code smells catalog extract method") | During REFACTOR phase — load smell catalog and refactoring mechanics |
| search_knowledge("Python test pytest fixtures best practices") | For Python projects — authoritative pytest patterns |
| search_knowledge("C# xUnit test patterns FluentAssertions NSubstitute") | For .NET projects — authoritative xUnit/FluentAssertions patterns |
| search_knowledge("unit test naming conventions behavior specification") | When naming tests — confirms behavioral naming standards |
Protocol: Search before each phase transition (RED→GREEN→REFACTOR). Cite the source path in phase logs.
| Property | Agent Responsibility | |----------|---------------------| | Isolated | Tests don't share state; verify no side effects | | Deterministic | Same results every run; no flaky tests | | Specific | Failures point to exact cause | | Automated | No manual intervention required | | Predictive | Passing tests = working code | | Fast | Maintain quick feedback loop |
RED Protocol:
1. Identify smallest testable behavior
2. Write test for that behavior
3. RUN the test suite
4. VERIFY the new test fails
5. VERIFY failure is for the expected reason
6. Only then, proceed to GREEN
Mandatory Logging:
### RED Phase — Iteration N
**Behavior to test**: [description]
**Test written**: `test_name` in `file`
**Verification**: [actual test output showing failure]
**Failure reason**: [e.g., "NameError: Calculator not defined"]
Proceeding to GREEN phase.
GREEN Protocol:
1. Review the failing test
2. Identify minimal code to pass
3. Implement ONLY what's needed
4. RUN the test suite
5. VERIFY all tests pass
6. Only then, proceed to REFACTOR
Mandatory Logging:
### GREEN Phase — Iteration N
**Test to satisfy**: `test_name`
**Implementation strategy**: [Fake It | Obvious | Triangulation]
**Code written**: [implementation code]
**Verification**: [actual test output showing all pass]
All tests passing. Proceeding to REFACTOR phase.
REFACTOR Protocol:
1. Confirm all tests pass
2. Identify ONE improvement
3. Make the change
4. RUN the test suite
5. VERIFY all tests still pass
6. If red, REVERT immediately
7. Repeat or proceed to next RED
Mandatory Logging:
### REFACTOR Phase — Iteration N
**Starting state**: All tests passing (N tests)
**Improvement identified**: [e.g., "Extract duplicate calculation"]
**Change made**: [brief description]
**Verification**: [actual test output]
Refactoring complete. [Continue refactoring | Ready for next feature]
Run these checklists at each phase transition. Stop and correct if any item fails.
RED Self-Check:
GREEN Self-Check:
REFACTOR Self-Check:
Guardrail 1 — No Implementation Without Failure Proof: Before writing any implementation code, verify a test exists for the behavior, it was just run, output shows failure, and the failure is logged. If any item is missing, stop.
Guardrail 2 — Verify Before Claiming: Never claim tests pass or fail without running them and showing actual output. Show [actual test output] All 15 tests pass — not "the test should now pass."
Guardrail 3 — Minimal Means Minimal: During GREEN, ask: can I make this simpler? Am I adding anything the test doesn't require? Would a hardcoded value pass? If yes to any, simplify.
Guardrail 4 — Rollback on Red: If tests fail during REFACTOR, stop, revert the change immediately, verify tests pass again, then try a smaller step. Never fix a broken refactor forward.
<tdd-state>
phase: [RED | GREEN | REFACTOR]
iteration: N
feature: [description]
current_test: [test name or none]
tests_passing: [true | false]
test_count: N
failing_count: N
last_verified: [timestamp or "just now"]
</tdd-state>
At each decision point, log options, reasoning, and choice before acting:
**Decision Point**: How to implement Calculator.add(2, 3)?
**Options**:
1. Return hardcoded 5 (Fake It)
2. Return a + b (Obvious Implementation)
**Reasoning**: Single test for addition. Obvious implementation is safe — trivial algorithm, no edge cases.
**Choice**: Obvious Implementation
RED → GREEN → REFACTOR for Calculator.add() (Python):
RED: Write test_add_two_numbers, run pytest — NameError: Calculator not defined. Log failure. Update state: phase: GREEN, iteration: 1.
GREEN: Implement class Calculator: def add(self, a, b): return a + b. Run pytest — 1 passed. Update state: phase: REFACTOR.
REFACTOR: No smells detected. Code is minimal. No refactoring needed. Update state: phase: RED, iteration: 2.
Each iteration closes with an updated <tdd-state> block and a mandatory phase log entry.
See Guardrails Reference for detailed phase log templates. See Autonomous Protocol for extended workflow examples.
CRITICAL: Trust Nothing Without Verification — Don't assume tests pass, don't assume tests fail, don't assume code works. Run and verify everything.
CRITICAL: Be Boringly Predictable — Follow the protocol exactly. Log everything explicitly. Never skip steps. Never combine steps.
CRITICAL: Fail Loudly — If something unexpected happens: stop immediately, report the anomaly, ask for guidance. Do not work around it.
CRITICAL: Prefer Smaller Steps — When in doubt: smaller test, simpler implementation, one refactoring at a time, more iterations over fewer.
This skill is an operating mode of the canonical tdd loop, not a replacement for it.
tdd — The canonical inner loop this mode drives. Defines the two critical test properties (behavioral, structure-insensitive), the per-cycle self-check, the GREEN strategies (Fake It / Obvious / Triangulation, with per-language idioms in its references/), and the REFACTOR smell catalog (references/code-smells.md, references/refactoring-catalog.md). Load those on demand during GREEN/REFACTOR.evaluate-tests — Run after the session to audit test quality and TDD compliance (commit-history scorecard, anti-pattern detection).tdd-pair — The alternative mode; use when a human partner drives and the AI navigates.Tests Won't Run: Check syntax and imports in the test file. Fix infrastructure issues first. Do not write any implementation until the test suite runs cleanly.
Wrong Test Failure: Examine the actual error message — not the expected one. Fix the test if it has bugs. Proceed only when the failure is the expected one.
Can't Make Test Pass: Re-read the test carefully. Check for typos in expectations. Verify setup and assertions. Ask for help if stuck — do not guess at the implementation.
State Confusion: Run the full test suite. If all pass: begin REFACTOR or new RED. If one fails: you are in GREEN. Reconstruct the state block from observed evidence.
development
Federal / government security overlay applied ON TOP OF a base language security review (dotnet/python/php/rust/react). Language-agnostic: adds NIST SP 800-53 control mapping, FIPS 140-2/3 cryptographic compliance (with a per-language crypto table), CUI handling, EO 14028 supply-chain requirements, and DOE Order 205.1B, and emits POA&M-ready findings with FIPS 199 impact levels. Use for federal/DOE/DOD/national-laboratory systems. Triggers on "federal security review", "NIST compliance", "NIST 800-53", "FISMA", "CUI", "FIPS audit", "DOE security", "POA&M", "ATO review". Do NOT use alone — run the matching <lang>-security-review FIRST; this overlay maps and extends it.
tools
OWASP-based security review of React / TypeScript front-end applications. Detects the framework (Vite/CRA/Next), entry points, and data flows, scans against the OWASP Top 10 (2025) mapped to React client-side patterns (XSS via raw HTML, URL/protocol injection, secrets in the bundle, insecure token storage, dependency CVEs, missing CSP, open redirects), and produces a manager-friendly executive summary plus a graded technical findings table. Use to audit React code for vulnerabilities. Triggers on "react security review", "frontend security audit", "audit react for vulnerabilities", "owasp react", "react xss", "react security posture", "npm audit review". For federal / gov / DOE / NIST / FIPS / CUI context, run security-review-federal after this base review. Do NOT use to grade architecture/structure — use react-architecture-checklist.
tools
Analyzes legacy React codebases and produces actionable modernization plans. Primary migration paths include class components to function components + hooks, Create React App to Vite, React 16/17 to 18 to 19, JavaScript to TypeScript, Enzyme to React Testing Library, legacy Redux to Redux Toolkit / Zustand / Context, and deprecated lifecycle/API removal. Does NOT perform the migration — assesses, quantifies risk, and plans. Triggers on phrases like "modernize react", "class to hooks", "upgrade react", "migrate CRA to vite", "react legacy migration", "react 17 to 18", "react js to typescript", "react technical debt", "enzyme to RTL".
development
Scaffolds feature-based React / TypeScript architecture using feature folders, presentational + container components, custom hooks, a typed data layer, and structural CQRS (query hooks vs mutation hooks). React analog of dotnet-vertical-slice and python-feature-slice — no DI framework; uses props/context for dependency injection and a query cache for server state. Use when creating feature-based React projects, adding React features, organizing components by feature rather than by technical type, or scaffolding a feature's data layer. Triggers on phrases like "scaffold react feature", "create react slice", "react feature folder", "react vertical slice", "add react feature", "react feature architecture", "organize react by feature".