skills/debugging/test-guided-bug-detector/SKILL.md
Uses failing test results as signals to guide bug search and narrow down candidate fault locations. Use when one or more tests are failing and the user wants to understand what's broken, when CI reports failures, or when triaging a batch of test failures after a change.
npx skillsauth add santosomar/general-secure-coding-agent-skills test-guided-bug-detectorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
When tests fail, the failure set itself is a signal. One failure tells you where to look; the pattern across many failures tells you what kind of thing broke.
| Failure pattern | Most likely cause | First move |
| -------------------------------------------- | ----------------------------------------------------- | ----------------------------------------------- |
| One test fails | Localized bug in the code that test covers | Read the assertion; → bug-localization |
| Many tests fail with the same error | Shared dependency broke (fixture, helper, import) | Find the shared thing — not the individual tests |
| Many tests fail with different errors | Environment/infra (DB down, fixture not loading) | Check setup/teardown logs, not test bodies |
| All tests in one file fail | Module-level import/fixture in that file | Check the file's top-level, not the tests |
| Tests fail only in CI, not locally | Env difference: version, path, timezone, locale, parallelism | Diff CI env vs local env, not the code |
| Tests fail only when run together | Test pollution — one test mutates shared state | Bisect the test order; find the polluter |
| Same tests intermittently fail | Flake — timing, network, randomness | Do NOT chase the code — stabilize the test |
The classic move: code executed by failing tests but not by passing tests is suspicious.
fail_hits / sqrt(total_fails × (fail_hits + pass_hits))This is mechanical but surprisingly effective. You need ≥3 failing and ≥3 passing tests for the signal to separate from noise.
Before debugging, group failures that share a root cause. Debugging 20 failures that are secretly 1 bug is 19× wasted effort.
Cluster by, in order:
Pick the largest cluster. Fix it. Re-run. Repeat.
Input: 47 tests failing after a merge.
Triage:
KeyError: 'tenant_id' → same error → one clustertest_billing.py → file-local → one clusterConnectionRefused → infra → ignore for nowCluster 1 (41 tests): All 41 use @with_authenticated_user fixture. Fixture source: creates a User dict. Grep the diff: tenant_id was added as a required field in User.__init__ but the fixture wasn't updated.
Root cause: One line in conftest.py. 41 failures → 1 bug.
Cluster 2 (5 tests): After fixing cluster 1, re-run. 3 of the 5 now pass (they were also blocked by the fixture). 2 remain. Both assert on a dollar amount that's off by exactly the tax rate. The merge also changed tax calculation.
47 → 2 root causes.
conftest.py/setup.py level, or the test DB didn't come up.True == True or similar tautology: The test itself is broken — pytest collected an accidentally-named non-test function, or someone committed a assert True # TODO placeholder.## Clusters
1. <N> failures — <shared root: exception/fixture/file>
Suspected fault: <file:line> (<how you narrowed it>)
2. ...
## Recommended order
Fix cluster <N> first (<reason: biggest / blocks others / fastest>)
## Quarantine
- <test name>: flaky, <mechanism> — do not chase
development
Extracts human-readable pseudocode from a verified formal artifact (Dafny, Lean, TLA+) while preserving the verified properties as annotations, so the proof-carrying logic can be reimplemented in a production language. Use when porting verified code to an unverified target, when documenting what a formal spec actually does, or when handing a verified algorithm to an implementer.
development
Translates natural-language or pseudocode descriptions of concurrent and distributed systems into TLA+ specifications ready for the TLC model checker. Identifies state variables, actions, type invariants, safety properties, and liveness properties from the description. Use when formalizing a protocol, when the user describes a distributed algorithm to verify, when designing a consensus or locking scheme, or when starting formal verification of a concurrent system.
testing
Reduces a TLA+ model so TLC can actually check it — shrinks constants, adds state constraints, abstracts data, or applies symmetry — when the state space is too large to enumerate. Use when TLC runs out of memory, when checking takes hours, or when a spec works at N=2 and you need confidence at larger scale.
development
TLA+-specific instance of model-guided repair — reads a TLC error trace, identifies the enabling condition that should have been false, strengthens the corresponding action, and maps the fix to source code. Use when TLC reports an invariant violation or deadlock and you have the code-to-TLA+ mapping from extraction.