skills/bughunt/SKILL.md
Continuous speculative bug-hunting — audit codebase against project invariants and risk patterns via parallel agents, form hypothesis, prove with failing test, fix at root cause, hand off to /ship, run /retro, then loop. Use when invoked as `/bughunt`. Loops by default; pass `--once` for a single iteration. Pass `--auto` to merge automatically once CI is green; default is to wait for the user to merge in the GitHub UI. Project-agnostic; reads risk dimensions from CLAUDE.md key-invariants and project memory.
npx skillsauth add eumemic/dev-skills bughuntInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You're driving an autonomous loop that scans the codebase for risk patterns, forms specific bug hypotheses, proves them with red tests, fixes at root cause, ships, retros, then either merges or waits.
This skill is a thin specialization over the shared loop-driver framework. Read dev-skills:loop-driver for the shared phases (branching, /ship handoff, /retro, merge, idle escalation, flag passthrough). This file owns:
Everything else — flags, branching, /ship invocation, /retro pass, merge handling, idle escalation — lives in loop-driver.
(See loop-driver for the shared invariants. These are additional to those.)
CLAUDE.md (especially "Key invariants" / "How to approach changes") and project memory before scoping risk patterns. Project-stated invariants are usually the highest-yield audit dimensions because the project has already named them.Dispatch multiple agents in parallel (single message, multiple Agent tool calls), each scanning a distinct risk dimension. Each agent returns ranked candidate hypotheses; you synthesize.
Read CLAUDE.md for "Key invariants", "Invariants", "Correctness rules", or similar. Read project memory for feedback_* and project_* files mentioning past bugs (e.g., "phase 2 lessons", "incident postmortem"). Past bugs cluster — the same shape often recurs in adjacent code. Pass these excerpts into each agent's prompt.
The standard set (pick what fits the codebase):
gather() losing partial results on one failure.try without finally; tasks created without registration in the cleanup registry.Any boundaries — places where mypy infers Any (dict accesses on JSON, raw fetches, untyped third-party returns) and downstream code assumes a specific shape.except Exception: blocks that don't re-raise; try/except around code where a failure should propagate; broad excepts in projects with a "fail hard" stance.now() called inside a transaction returning a different value than elsewhere.CLAUDE.md invariants and feedback_* memory excerpts.(reachability × severity × provability) ÷ fix-scope scores, file:line references, and the test sketch per finding.Use subagent_type: general-purpose. Run all agents parallel via a single message with multiple Agent tool calls. If the loop was invoked with --model=<value> (see loop-driver flags), include model: "<value>" on each Agent call so every audit subagent runs on the chosen Claude generation; otherwise omit model: and let them inherit the session's model.
Collect ranked lists. Cross-agent corroboration amplifies score. Provability is a hard filter at this stage — anything below ~0.6 doesn't make the synthesis cut. Speculation without a path to a red test is wasted iteration.
If the audit produces no hypothesis with provability ≥ 0.6, fall through to loop-driver Phase 8's empty-iteration branch.
Apply the loop-driver's quality-gate principle. For bughunt, the bar is:
A candidate clears the gate only if all are true:
If a candidate is borderline ("input feels reachable but I'd need to confirm"), present 2–3 candidates to the user via AskUserQuestion with reachability evidence for each.
If no candidate clears the gate: report what was found and why each was rejected, then proceed to loop-driver Phase 8 as if the iteration were empty. Don't ship a speculative fix to a theoretically-reachable but practically-impossible bug.
Write a test that reaches the suspect code path with the input that triggers the hypothesis, and asserts the correct outcome. Run it. It must fail.
Confirm the failure mode is the predicted one. A test that fails for the wrong reason is worse than no test:
ImportError → test is broken; fix the test.KeyError upstream of the bug → input shape is wrong; adjust until it reaches the suspect line.If the test passes — the bug doesn't exist (or the hypothesis was wrong). Discard the candidate. Skip silently and pick the next one from the synthesis. Don't write a tracking issue, don't surface to the user.
If you've burned through three candidates this iteration without a red test, the audit is mis-calibrated. Surface via AskUserQuestion: "Three candidates this iteration didn't produce red tests — keep going at deeper audit, switch dimensions, or stop?"
With the test red for the right reason, find the smallest fix that flips it green without rewriting unrelated logic.
/kaizen candidates.Re-run the failing test. Confirm it passes. Run the full test suite to confirm no other tests went red.
If the fix breaks other tests, investigate before adjusting them. A test relying on the buggy behavior is itself a bug; fix it properly.
/ship defaultsWhen invoking dev-skills:ship (loop-driver Phase 4):
--commit-type=fix (almost always; refactor only if the fix is structural with no observable behavior delta).--issue=<N> if the bug was already filed (search first: gh issue list --search "<key symptom>").(Additive to loop-driver's shared boundaries.)
AskUserQuestion when (in addition to loop-driver's shared triggers):
/ship. (Possible deeper root cause; revisit Phase 3 fix step.)When in doubt, ask. A wrong fix is worse than a missed bug.
testing
This skill should be used when the user asks to "test this feature", "verify this works", "run tests", "check if this is working", "test thoroughly", or mentions testing a specific feature or fix.
tools
Drive a GitHub `shovel-ready` issue queue end-to-end — pick highest-ROI, TDD, /ship, /retro, merge; when empty, audit closed work and unlabeled candidates to refill it; otherwise wait for new labels. Use when the user invokes `/shovel-ready`, asks to "work the shovel-ready queue", "drain shovel-ready", "drive issues to merge", "autopilot on labeled issues", "issue-label queue", or sets up an autonomous-development loop on a GitHub-labeled issue tracker. GitHub-specific (uses `gh` CLI); skip for GitLab/Jira.
development
Drive a code change from working tree to a green PR — runs project checks, /simplify, code-review subagent, commits, opens PR, monitors CI. Stops when CI is green; caller handles merge. Use when invoked as `/ship`, or when another skill (e.g., `/shovel-ready`, `/kaizen`, `/bughunt`) hands off a working tree of changes to be shipped. Project-agnostic; reads check commands from CLAUDE.md.
development
Reflect on a slice of the current session (a single iteration, the whole session, or a natural pause point) to identify durable, codifiable learnings about workflow OR dev infrastructure — and ship them only if they clear a quality bar. Most invocations produce nothing; that's the point. Use when the user asks to "run a retro", "do a retrospective", "reflect on the session", "what did you learn", "improve skills based on this session", "codify learnings"; when invoked from a loop driver (`/shovel-ready`, `/kaizen`, `/bughunt`) with `--scope=iteration`; or **proactively at pause points** when the agent is waiting on async work (CI, autodev runs, monitor events) and has cycles to think meta.