skills/testing-quality/SKILL.md
Use when bugs keep slipping through despite high test coverage, when suspecting tests are giving false confidence, before a major refactor that will depend on the existing test suite, or when coverage metrics don't match incident rates. User phrases like "do these tests actually catch bugs?", "is this suite any good?", "why didn't the tests catch this?".
npx skillsauth add joshsymonds/gambit testing-qualityInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Audit test suites for real effectiveness, not vanity metrics. Identify tests that provide false confidence and missing corner cases. Create Tasks for improvements.
Core principle: Tests must catch bugs, not inflate coverage metrics. Coverage measures execution, not assertion quality.
Announce at start: "I'm using gambit:testing-quality to audit these tests with SRE-level scrutiny."
MEDIUM FREEDOM — Follow analysis phases exactly. RED/YELLOW/GREEN criteria are rigid. Corner case discovery adapts to the codebase.
| Phase | Action | Output | |-------|--------|--------| | 1 | Inventory all test files | Test catalog | | 2 | Read production code | Context for analysis | | 3 | Categorize (skeptical default) | RED/YELLOW/GREEN per test | | 4 | Self-review all classifications | Validated categories | | 5 | Discover missing corner cases | Gap analysis | | 6 | Prioritize by business impact | Priority matrix | | 7 | Create Tasks for improvements | Tracked improvement plan |
Iron Law: Read production code BEFORE categorizing ANY test.
CRITICAL MINDSET: Assume tests were written by junior engineers optimizing for coverage metrics. A test is RED or YELLOW until proven GREEN.
Don't use when:
gambit:test-driven-developmentCreate complete catalog of tests to analyze. Use Glob and Grep to find all test files and count tests per module. Adapt file patterns to the language.
MANDATORY before categorizing ANY test.
For each test file:
Why: Without reading production code, you WILL miscategorize tests as GREEN when they're YELLOW or RED. Junior engineers commonly create test utilities and test THOSE instead of production code, or set up mocks that determine test outcomes.
Assume every test is RED or YELLOW until you have concrete evidence it's GREEN.
For EACH test, answer these four questions:
!= nil, testing fixtures → weak)Tests that pass by definition or test mocks instead of production code:
See REFERENCE.md for detailed code examples of each RED pattern.
Tests with real value but significant gaps:
!= nil or > 0 when exact values are availableSee REFERENCE.md for detailed code examples of each YELLOW pattern.
GREEN is the EXCEPTION, not the rule. A test is GREEN only if ALL four conditions are true:
!= nil)Before marking ANY test GREEN, you MUST state:
If you cannot fill in those blanks, the test is YELLOW at best.
Before finalizing ANY categorization, verify:
For each GREEN test:
For each YELLOW test:
If you have ANY doubt about a GREEN, downgrade to YELLOW.
MANDATORY for every RED or YELLOW classification.
This forces verification that your classification is correct by explaining exactly WHY the test is problematic.
Required format:
### [Test Name] - RED/YELLOW
**Test code (file:lines):**
- Line X: `code` - [what this line does]
- Line Y: `assertion` - [what this asserts]
**Production code it claims to test (file:lines):**
- [Brief description of what production code does]
**Why RED/YELLOW:**
- [Specific reason with line references]
- [What bug could slip through despite this test passing]
If you cannot write this justification, you haven't done the analysis properly.
For each module, identify missing corner case tests across these categories:
See REFERENCE.md for the complete corner case tables with specific examples and recommended test names.
| Priority | Criteria | Action Timeline | |----------|----------|-----------------| | P0 - Critical | Auth, payments, data integrity | This sprint | | P1 - High | Core business logic, user-facing | Next sprint | | P2 - Medium | Internal tools, admin features | Backlog | | P3 - Low | Utilities, non-critical paths | As time permits |
Create epic Task for test quality improvement, then subtasks for each action group (remove RED tests, strengthen YELLOW tests, add missing corner cases).
Each subtask must be:
Set dependencies so removal happens before additions.
See REFERENCE.md for epic and subtask templates.
Present results as a structured report. See REFERENCE.md for the complete output template.
Executive summary table:
| Metric | Count | % | |--------|-------|---| | Total tests analyzed | N | 100% | | RED (remove/replace) | N | X% | | YELLOW (strengthen) | N | X% | | GREEN (keep) | N | X% | | Missing corner cases | N | - |
Overall Assessment: CRITICAL / NEEDS WORK / ACCEPTABLE / GOOD
| Excuse | Reality | |--------|---------| | "Test looks reasonable" | Looking reasonable ≠ catching bugs. Read production code. | | "High coverage = good tests" | Coverage measures execution, not assertion quality | | "Mock is necessary here" | Mock is fine, but assert on production behavior, not mock returns | | "Test exercises the function" | Calling a function without meaningful assertions is a line hitter | | "It would catch obvious bugs" | Name the specific bug. If you can't, it's YELLOW at best. | | "Too many tests to justify each" | Unjustified classifications are wrong classifications |
Don't:
Do:
Analysis Quality (MANDATORY):
Per module:
Task Integration:
Called by:
/gambit:testing-qualityCreates:
Workflow:
gambit:testing-quality → Analyze → Create improvement Tasks
gambit:executing-plans → Implement improvements with TDD
gambit:verification → Verify improvements complete
testing
Use when creating a new skill, modifying an existing skill, writing or rewriting a SKILL.md file, auditing a skill's description for discoverability, or when user mentions "create a skill", "write a skill", "new skill", "modify skill", "improve skill", "edit the skill".
development
Use before any completion claim, success statement, or marking a task done. Triggers when about to say "Great!", "Perfect!", "Done", "All set", "Ready to commit", before creating a PR, before moving to the next task, or when code has changed since the last test run.
data-ai
Use when starting an isolated feature branch, when working on multiple features simultaneously, when experimenting without affecting the main workspace, or when a clean workspace is needed before beginning implementation. User phrases like "start a new branch", "set up a worktree", "isolated workspace", "work on feature X separately".
development
Use at the start of every session before any response or action. Also invoke whenever uncertain which gambit skill applies, when about to implement / debug / refactor / test / plan / brainstorm, or when a user request could match any gambit skill even at 1% probability.