skills/fixing-tests/SKILL.md
Use when tests themselves are broken, test quality is poor, or user wants to fix/improve tests. Triggers: 'test is broken', 'test is wrong', 'test is flaky', 'make tests pass', 'tests need updating', 'green mirage', 'tests pass but shouldn't', 'audit report findings', 'run and fix tests'. NOT for: bugs in production code caught by correct tests (use debugging).
npx skillsauth add axiomantic/spellbook fixing-testsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
| Input | Required | Description |
|-------|----------|-------------|
| test_output | No | Test failure output to analyze (for run_and_fix mode) |
| audit_report | No | Green mirage audit findings with patterns and YAML block |
| target_tests | No | Specific test files or functions to fix (for general_instructions mode) |
| test_command | No | Command to run tests; defaults to project standard |
Detect mode from user input, build work items accordingly.
| Mode | Detection | Action |
|------|-----------|--------|
| audit_report | Structured findings with patterns 1-10, "GREEN MIRAGE" verdicts, YAML block | Parse YAML, extract findings. Read patterns/assertion-quality-standard.md for assertion quality gate and Full Assertion Principle. |
| general_instructions | "Fix tests in X", "test_foo is broken", specific test references | Extract target tests/files |
| run_and_fix | "Run tests and fix failures", "get suite green" | Run tests, parse failures |
If unclear: ask user to clarify target.
interface WorkItem {
id: string; // "finding-1", "failure-1", etc.
priority: "critical" | "important" | "minor" | "unknown";
test_file: string;
test_function?: string;
line_number?: number;
pattern?: number; // 1-10 from green mirage
pattern_name?: string;
current_code?: string; // Problematic test code
blind_spot?: string; // What broken code would pass
suggested_fix?: string; // From audit report
production_file?: string; // Related production code
error_type?: "assertion" | "exception" | "timeout" | "skip";
error_message?: string;
expected?: string;
actual?: string;
}
<analysis>
Before each phase, identify: inputs available, gaps in understanding, classification decisions needed (input mode, error type, production bug vs test issue).
</analysis>
Dispatch subagent (Task tool) with /fix-tests-parse command. Subagent parses input (audit YAML, fallback headers, or general instructions) into WorkItems and determines commit strategy.
Skip for audit_report/general_instructions modes.
pytest --tb=short 2>&1 || npm test 2>&1 || cargo test 2>&1
Parse failures into WorkItems with error_type, message, stack trace, expected/actual. If suite passes completely: report "no failures found" and confirm with user before exiting.
Dispatch subagent (Task tool). Subagent MUST read the referenced files:
First, read these files to understand the quality requirements:
- Read the fix-tests-execute command file for the fix execution protocol
- Read patterns/assertion-quality-standard.md for the complete Assertion Strength Ladder and Full Assertion Principle
Then execute the fix protocol on these work items: [work items]
[Copy in the full Test Writer Template from skills/dispatching-parallel-agents/SKILL.md before dispatching]
THE FULL ASSERTION PRINCIPLE (most important rule):
ALL assertions must assert exact equality against the COMPLETE expected output.
This applies to ALL output -- static, dynamic, or partially dynamic.
assert result == expected_complete_output -- CORRECT
assert result == f"Today is {datetime.date.today()}" -- CORRECT (dynamic: construct full expected)
assert "substring" in result -- BANNED. ALWAYS. NO EXCEPTIONS.
assert dynamic_value in result -- BANNED. Dynamic content is no excuse for partial check.
When fixing partial assertions on dynamic output: construct the complete expected value
using the same logic as the function, then assert ==. Prefer construct-then-compare
over normalization. Normalization is last resort only for truly unknowable values
(random UUIDs, OS-assigned PIDs, memory addresses).
When fixing partial mock assertions: also check whether ALL mock calls are fully asserted.
Assert EVERY call with ALL args; verify call count. NEVER use mock.ANY -- construct
the expected argument dynamically if it is dynamic.
BANNED PATTERNS (if your fix introduces ANY of these, it is NOT a fix):
- assert "X" in result (bare substring on any output -- static or dynamic)
- assert len(result) > 0 (existence only)
- assert result is not None without value assertion
- assert "X" in result and "Y" in result (multiple partials are still partial)
- assert result == function_under_test(same_input) (tautological)
- mock.ANY in any call assertion
- assert_called() or assert_called_once() without argument verification
- Asserting only some mock calls
Every assertion must be Level 4+ on the Assertion Strength Ladder.
Replacing a Level 1 assertion with a Level 2 assertion is NOT a fix.
patterns/assertion-quality-standard.md - the Full Assertion Principle and Assertion Strength LadderPRODUCTION BUG DETECTED
Test: [test_function]
Expected behavior: [what test expects]
Actual behavior: [what code does]
This is not a test issue - production code has a bug.
Options:
A) Fix production bug (then test will pass)
B) Update test to match buggy behavior (not recommended)
C) Skip test, create issue for bug
Your choice: ___
Do NOT silently fix production bugs as "test fixes." </CRITICAL>
FOR priority IN [critical, important, minor]:
FOR item IN work_items[priority]:
Execute Phase 2
IF stuck after 2 attempts:
Add to stuck_items[]
Continue to next item
## Stuck Items
### [item.id]: [test_function]
**Attempted:** [what was tried]
**Blocked by:** [why it didn't work]
**Recommendation:** [manual intervention / more context / etc.]
Dispatch subagent (Task tool):
First, read these files to understand the quality requirements:
- Read patterns/assertion-quality-standard.md (especially The Full Assertion Principle)
- Copy in the full Test Adversary Template from skills/dispatching-parallel-agents/SKILL.md
ROLE: Test Adversary. Your job is to BREAK the new/modified test assertions.
## Context
- Modified test files: [list of files changed during fix phase]
- Git diff of changes: [paste or reference the diff]
- Production files under test: [paths]
## Mandatory Checks
1. IMMEDIATE REJECTION: Flag any assertion that is:
- assert "X" in result on deterministic output (BANNED)
- assert len(x) > 0 or assert x is not None (BANNED)
- A fix that replaced one BANNED pattern with another (Pattern 10)
2. For each new/modified assertion:
- Classify on Assertion Strength Ladder (must be Level 4+)
- Determine if function under test is deterministic
- If deterministic: only Level 5 (exact equality) is acceptable
- Construct a plausible broken implementation that still passes
- Verdict: KILLED or SURVIVED
3. Overall verdict:
- Any SURVIVED or BANNED assertion: FAIL (list required re-fixes)
- All KILLED + Level 4+: PASS
Return: Per-assertion verdicts and overall PASS/FAIL.
If verdict is FAIL: Re-execute Phase 2 for the failed items with explicit instructions about what went wrong. Do NOT skip re-review.
pytest -v # or appropriate test command
## Fix Tests Summary
### Input Mode
[audit_report / general_instructions / run_and_fix]
### Metrics
| Metric | Value |
|--------|-------|
| Total items | N |
| Fixed | X |
| Stuck | Y |
| Production bugs | Z |
### Fixes Applied
| Test | File | Issue | Fix | Commit |
|------|------|-------|-----|--------|
| test_foo | test_auth.py | Pattern 2 | Strengthened to full object match | abc123 |
### Test Suite Status
- Before: X passing, Y failing
- After: X passing, Y failing
### Stuck Items (if any)
[List with recommendations]
### Production Bugs Found (if any)
[List with recommended actions]
Fixes complete. Re-run audit-green-mirage to verify no new mirages?
A) Yes, audit fixed files
B) No, satisfied with fixes
Flaky tests: Identify non-determinism source (time, random, ordering, external state). Mock or control it. Use deterministic waits, not sleep-and-hope.
Implementation-coupled tests: Identify the BEHAVIOR the test should verify. Rewrite to test through the public interface. Remove mocks of the unit under test's own internals; do not remove mocks of external services.
Missing tests entirely: Read production code. Identify key behaviors. Write tests following existing test file patterns in the codebase. Ensure tests would catch real failures.
Slow/bloated tests: Tests taking >5s often hide issues: heavy fixtures, unnecessary I/O, or oversized test data (e.g., 1024x1024 matrix where 4x4 suffices). Separate slow tests with marks (@pytest.mark.slow, @pytest.mark.integration, etc.). Shrink test inputs to the minimum that exercises the behavior. Move real I/O to integration tier. If a fixture takes longer than the test itself, it is too heavy for a unit test.
<RULE>Before completing, ALL boxes must be checked. If ANY unchecked: STOP and fix.</RULE>
patterns/assertion-quality-standard.md)<FINAL_EMPHASIS> Tests exist to catch bugs. Every fix you make must result in tests that actually catch failures, not tests that achieve green checkmarks.
Fix it. Prove it works. Move on. No over-engineering. No under-testing. </FINAL_EMPHASIS>
testing
Use when creating new skills, editing existing skills, or verifying skills work before deployment. Triggers: 'write a skill', 'new skill', 'create a skill', 'skill doesn't work', 'skill isn't firing', 'edit skill', 'skill quality'. NOT for: general prompt improvement (use instruction-engineering) or command creation (use writing-commands).
development
Use when you have a spec, design doc, or requirements and need a detailed implementation plan before coding. Triggers: 'write a plan', 'create implementation plan', 'plan this out', 'break this down into steps', 'convert design to tasks', 'implementation order'. Also invoked by develop during planning. NOT for: reviewing existing plans (use reviewing-impl-plans).
testing
Use when creating new commands, editing existing commands, or reviewing command quality. Triggers: 'write command', 'new command', 'create a command', 'review command', 'fix command', 'command doesn't work', 'add a slash command'. NOT for: skill creation (use writing-skills).
development
Use when about to claim discovery during debugging. Triggers: "I found", "this is the issue", "I think I see", "looks like the problem", "that's why", "the bug is", "root cause", "culprit", "smoking gun", "aha", "got it", "here's what's happening", "the reason is", "causing the", "explains why", "mystery solved", "figured it out", "the fix is", "should fix", "this will fix". Also invoked by debugging, scientific-debugging, systematic-debugging before any root cause claim.