skills/mutation-testing/SKILL.md
Validates Go test suite quality through mutation testing using go-gremlins/gremlins. Mutates production code, runs the test suite against each mutant, and reports which mutants the tests fail to kill — exposing weak assertions that line coverage cannot detect. Use when evaluating test effectiveness, validating newly written tests, or improving test quality for mission-critical code (consensus, channel state, payment flows, crypto). Triggers: "mutation test", "are these tests strong", "validate test quality", "/mutation-testing".
npx skillsauth add roasbeef/claude-files mutation-testingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Mutation testing evaluates test quality by introducing small, deliberate bugs into production code (mutants) and checking whether the test suite fails. A test that passes on a mutant did not actually verify the behavior the mutant changed.
This skill is a thin orchestrator over go-gremlins/gremlins — a maintained Go mutation testing tool. The skill provides install, run, and analysis wrappers that produce machine-readable JSON for downstream tooling (notably the test-refine skill).
A test suite can hit 100% line coverage and still be useless: tests can execute code without asserting on its results, or assert only on side-irrelevant fields. Mutation testing closes this gap by checking whether the test suite distinguishes the original code from a mutant. See references/coverage-pitfalls.md (in the test-refine skill) for the broader context.
test-forge or by hand — verify they have real assertions.test-refine — survivors map to weak-assertion findings.Target efficacy (gremlins terminology: test_efficacy = killed / (killed + lived)):
| Code class | Target | |---|---| | Mission-critical (consensus, wallet, channel, crypto) | 90%+ | | Core business logic | 80–90% | | General code | 70–80% | | Trivial/glue code | run only if cheap |
~/.claude/skills/mutation-testing/scripts/install-gremlins.sh
The script pins to a known-good version (override with GREMLINS_VERSION=...). Requires go on PATH and $(go env GOPATH)/bin on PATH.
# Default: cwd, JSON to .reviews/mutations/<slug>.json
~/.claude/skills/mutation-testing/scripts/unleash.sh
# Targeted package
~/.claude/skills/mutation-testing/scripts/unleash.sh \
--pkg ./internal/wallet \
--output .reviews/mutations/wallet.json
# With integration tests and a config file
~/.claude/skills/mutation-testing/scripts/unleash.sh \
--pkg ./internal/channel \
--integration \
--config .gremlins.yaml \
--silent
~/.claude/skills/mutation-testing/scripts/analyze-survivors.sh \
--input .reviews/mutations/wallet.json \
--output .reviews/mutations/wallet.md
Produces a markdown report with: efficacy/coverage summary, survivors ranked by file (consensus/channel/wallet paths bubble to the top), and mutator-type breakdown.
gremlins unleash --output <file> emits a single JSON document:
{
"go_module": "github.com/example/foo",
"test_efficacy": 82.00,
"mutations_coverage": 80.00,
"mutants_total": 100,
"mutants_killed": 82,
"mutants_lived": 8,
"mutants_not_viable": 2,
"mutants_not_covered": 10,
"elapsed_time": 123.456,
"files": [
{
"file_name": "wallet.go",
"mutations": [
{ "line": 42, "column": 8, "type": "CONDITIONALS_NEGATION", "status": "KILLED" }
]
}
]
}
Mutation status values:
| Status | Meaning | Action |
|---|---|---|
| KILLED | Test suite caught the mutation | Good — no action |
| LIVED | Tests passed despite mutation | Survivor — strengthen tests |
| NOT COVERED | Mutation in code no test exercises | Add a test for that path |
| TIMED OUT | Tests timed out — implicit kill | Investigate (might be perf bug) |
| NOT VIABLE | Mutation produced uncompilable code | Excluded from score |
| RUNNABLE | Dry-run only; would be tested | (only in --dry-run) |
Key metrics:
test_efficacy = killed / (killed + lived) — quality of assertions on covered code.mutations_coverage = (killed + lived) / (killed + lived + not_covered) — how much code is exercised at all.A high mutations_coverage with low test_efficacy means tests run code without verifying its behavior — the classic "100% line coverage, 0% real testing" failure mode.
Gremlins is configured via .gremlins.yaml (or --config <path>). Mutators ship default-on for safe operators and default-off for aggressive ones.
Default-on mutators (always enabled):
arithmetic-base — + - * / %conditionals-boundary — < <= > >=conditionals-negation — == !=, boolean conditionsincrement-decrement — ++ --invert-negatives — -x ↔ +xDefault-off mutators — enable for critical packages:
invert-assignments — += -= *= /= etc. swapsinvert-bitwise — & | ^ swapsinvert-bwassign — &= |= ^= swapsinvert-logical — && ↔ || (security-critical: catches auth bypass mutations)invert-loopctrl — break ↔ continueremove-self-assignments — drop x = x op y updatesRecommended config for consensus/wallet/payment code:
silent: false
unleash:
workers: 0 # use all CPUs
test-cpu: 0 # no per-test CPU pinning
threshold:
efficacy: 90 # fail if below 90%
mutant-coverage: 85
mutants:
arithmetic-base: { enabled: true }
conditionals-boundary: { enabled: true }
conditionals-negation: { enabled: true }
increment-decrement: { enabled: true }
invert-negatives: { enabled: true }
invert-assignments: { enabled: true }
invert-bitwise: { enabled: true }
invert-bwassign: { enabled: true }
invert-logical: { enabled: true } # critical for && / || in auth
invert-loopctrl: { enabled: true }
remove-self-assignments:{ enabled: true }
See gremlins.dev configuration docs for the full schema.
For CI, use --silent and set thresholds in config or via env vars:
gremlins unleash --silent --output mutations.json ./...
# Exit nonzero if efficacy < threshold.
The unleash.threshold.efficacy and unleash.threshold.mutant-coverage keys cause gremlins to exit nonzero when the run falls below the configured percentages — wire this into your PR check.
test-refineThe test-refine skill consumes gremlins JSON to identify weak-assertion zones (smell S12: mutation-survivor). When invoked with --use-mutations, it calls unleash.sh and cross-references LIVED mutants with the AST smell scan.
test-forgeAfter test-forge generates tests, run mutation testing to validate them. LIVED mutants are direct evidence of weak assertions in the generated tests.
code-reviewInclude the test_efficacy delta in PR review — regression of >5% in covered code is a strong signal of weakening test quality.
High efficacy (≥90%): Tests have strong assertions. Focus remaining work on NOT COVERED mutants (uncovered code paths).
Medium (75–90%): Tests cover main paths. Survivors usually indicate boundary or error-path gaps.
Low (<75%): Significant gaps — tests likely run code without checking outputs. Pair with test-refine to identify the specific smells.
Mutator breakdown tells you the kind of weakness:
conditionals-boundary LIVED → missing edge tests at thresholds.invert-logical LIVED → missing truth-table coverage for &&/||.arithmetic-base LIVED → tests don't verify calculation results.remove-self-assignments LIVED → state mutations not asserted.Some LIVED mutants are semantically equivalent to the original — no test could kill them. Common cases:
When you identify an equivalent mutant, document it (e.g., a comment near the mutation site, or a project-level EQUIVALENT_MUTANTS.md) so reviewers don't waste time on it. Gremlins doesn't filter equivalents automatically.
From the upstream README: gremlins targets smallish Go modules (microservices). On very large modules, runs can take hours. Mitigations:
--pkg ./internal/wallet. Don't pass ./... on a 500k-LOC monorepo.--workers to bound parallelism if memory is tight.--dry-run first to preview the mutation count and skip if it's too large.references/mutation_operators.md — gremlins mutator catalog with examples.references/best_practices.md — patterns for boundary, security, and state-machine testing.development
Clear-writing guide distilled from Steven Pinker's "The Sense of Style." Use when writing or revising prose that must be clear to a reader — documentation, design docs, specs, explanations, essays, emails, reports, RFCs, release notes — or when asked to make writing clearer, tighter, less academic, or less jargon-laden. Activate for "make this clearer", "tighten this", "why is this hard to read", "edit this for clarity", or any prose-quality pass.
development
Interactively debug Go programs in a single context using Delve (dlv) driven through tmux. Use when a bug requires runtime inspection — stepping through code, examining variables, walking goroutines, attaching to a live process, or debugging a hanging integration test — rather than just reading the source. Triggers include "step through this", "set a breakpoint", "attach to the running server", "why is this goroutine stuck", "debug this failing test".
development
Find similar vulnerabilities and bugs across codebases using pattern-based analysis. Use when hunting bug variants, building CodeQL/Semgrep queries, analyzing security vulnerabilities, or performing systematic code audits after finding an initial issue.
development
Refines an existing Go test suite — removes trivial/duplicate tests, strengthens weak assertions, reshapes tests around invariants, and closes branch-coverage gaps. Uses code-guided coverage and (when available) gremlins mutation-testing survivor data rather than relying on line coverage alone. Use when test quality is uneven, after a test-generation pass, before opening a PR, or as a quality gate on critical paths (consensus, channel state, payment flows). Triggers: "refine these tests", "tests are bloated", "tighten assertions", "remove trivial tests", "audit test quality", "/test-refine".