skills/test-refine/SKILL.md
Refines an existing Go test suite — removes trivial/duplicate tests, strengthens weak assertions, reshapes tests around invariants, and closes branch-coverage gaps. Uses code-guided coverage and (when available) gremlins mutation-testing survivor data rather than relying on line coverage alone. Use when test quality is uneven, after a test-generation pass, before opening a PR, or as a quality gate on critical paths (consensus, channel state, payment flows). Triggers: "refine these tests", "tests are bloated", "tighten assertions", "remove trivial tests", "audit test quality", "/test-refine".
npx skillsauth add roasbeef/claude-files test-refineInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
A test suite can be voluminous and yet weak: tests that overlap, assert nothing meaningful, run code without checking outputs, or miss the branches that actually matter. Line coverage doesn't catch this — a 100% line-covered suite can have zero real assertions. This skill operates on an existing test suite and refines it.
This is orthogonal to test generation skills (test-forge, property-based-testing):
test-forge generates new tests.property-based-testing designs PBT for new or existing functions.mutation-testing validates whether existing tests catch behavior changes.test-refine (this skill) consumes the above signals and changes the test suite — strengthening assertions, removing dead weight, reshaping for invariants, closing branch-coverage gaps.test-forge pass — sharpen the generated tests.The skill runs read-only triage first, produces a markdown report, then applies fixes only after the user reviews and approves.
# Default: package in cwd. JSON intermediates land in /tmp; markdown in
# .reviews/test-refinement/<date>-<scope>.md.
~/.claude/skills/test-refine/scripts/triage.sh
# Pin scope to a single file
~/.claude/skills/test-refine/scripts/triage.sh --scope file --target ./internal/wallet/wallet_test.go
# Diff-scoped (changed test files in current branch vs auto-detected
# default branch — origin/HEAD, then main, then master).
~/.claude/skills/test-refine/scripts/triage.sh --scope diff
# Whole repo (slow)
~/.claude/skills/test-refine/scripts/triage.sh --scope repo
# With mutation testing for the strongest signal. Diff scope fans out
# across the unique set of affected packages (capped by
# MUTATION_FANOUT_CAP=5 by default; override via env var). Repo scope
# hard-fails — narrow before re-trying.
~/.claude/skills/test-refine/scripts/triage.sh --scope package --use-mutations
~/.claude/skills/test-refine/scripts/triage.sh --scope diff --use-mutations
The triage script:
go test -cover -covermode=atomic to capture per-function statement+branch coverage.~/.claude/skills/mutation-testing/scripts/unleash.sh for LIVED mutant data.detect-smells.go, detect-duplicates.go, domain-checks.go).score.go (composite priority — see below).The user reviews the report. Each finding has a checkbox. The user checks boxes for the fixes they approve, then:
~/.claude/skills/test-refine/scripts/apply-fixes.sh \
--report .reviews/test-refinement/2026-05-06-wallet.md
The script applies only the checked items:
rapid for flagged cases).After fixes, it re-runs go test ./... -race -count=1, appends an "After" metrics section to the same report, and surfaces the diff.
Safety rule: a test is never removed unless its checkbox in the report is explicitly checked. This honors the global rule "never remove/skip tests without asking" from
~/.claude/CLAUDE.md.
When a triage produces dozens of findings, ranking matters. The composite score combines three signals:
priority = w_risk × risk_score(file_path)
+ w_severity × severity(smell_id)
+ w_gap × branch_gap(function)
| Component | Range | Source |
|---|---|---|
| risk_score | 0.2–1.0 | File-path heuristic: consensus|channel|commit|payment|crypto|sign|verify|wallet|htlc → 1.0; internal/ → 0.7; cmd/ → 0.4; test/ helpers → 0.2 |
| severity | 0.3 (L) / 0.6 (M) / 1.0 (H) | Smell catalog severity (references/smell-catalog.md) |
| branch_gap | 0.0–1.0 | Uncovered branches in the target function / total branches |
Default weights: 0.5 / 0.3 / 0.2. Override via --weights risk=0.6,severity=0.3,gap=0.1.
Full catalog with Go examples in references/smell-catalog.md.
| ID | Smell | Severity |
|---|---|---|
| S01 | No assertions at all (recognises testify, helper-named functions, and t.Run subtests) | High |
| S02 | Tautological assertion (x == x) | High |
| S03 | Getter/setter trivial test | Medium |
| S04 | Asserts no panic only | High |
| S05 | Unchecked error from SUT | High |
| S06 | Sensitive equality on rendered text — fmt.Sprint* and non-canonical .String(). Skipped for canonical types (chainhash.Hash, UUID, big.Int, time.Time, OutPoint, etc.) | Medium |
| S07 | Conditional/skipped assertion | Medium |
| S08 | Duplicate test body (semantic) | Medium |
| S09 | Assertion roulette — only fires when ≥4 bare asserts share the same call+RHS shape and the test has ≥8 bare asserts; confidence 0.4 (advisory) | Low |
| S10 | Expect-the-expected (want derived from got) | High |
| S11 | Side-effect not asserted | Medium |
| S12 | Mutation-survivor zone (gremlins data required) | High |
For systems / distributed / Bitcoin / networking code, "good test" goes beyond standard assertion strength. The skill also checks four dimensions detailed in references/domain-checks.md:
-race-friendly patterns and exercise concurrent calls. Sequential test of concurrent code is flagged.error or accepts context.Context? Tests must cover error path, cancellation, timeout. For network/disk SUTs, missing fault-injection tests are flagged. Inspired by Jepsen nemesis patterns.rapid PBT (see references/reshape-to-invariants.md).time.Now(), rand.* without seeds, os.Getenv, goroutine-ordering assumptions in tests are flagged. Suggests injectable clocks/RNGs (DST-style).Findings flagged for removal candidacy (still always require user approval):
test_efficacy. The test catches no mutants the rest of the suite doesn't already catch.Test function names must not contain underscores. Use TestEncodeTxRoundtrip, not TestEncodeTx_Roundtrip. For variants of the same logical test, use t.Run("subtest name", func(t *testing.T) {...}). The reshape pass enforces this — any rewrite the skill proposes uses subtests, not underscored function names. See references/strengthening-patterns.md.
The user opted into "aggressive — reshape tests for invariants". Examples:
Marshal(x) == fixed_bytes for a single value → reshape into a rapid roundtrip property: Unmarshal(Marshal(x)) == x for arbitrary x.rapid.StateMachine covering all transitions.Every reshape proposal in the report shows the original test verbatim alongside the proposed rewrite. If the user rejects the reshape, no change is made.
- [ ] Remove TestX) is the only way a test is removed. Bulk approval is not a thing.gremlins unleash on touched packages to verify reshape didn't weaken mutation score. Pass --verify-mutations.property-based-testing/references/strategies.md.LIVED, the AST is checked for equivalence patterns (mutated value immediately overwritten, mutation in unreachable code, associative no-op) before flagging as S12.The committed markdown report lives at:
.reviews/test-refinement/<YYYY-MM-DD>-<scope-slug>.md
Contents:
test_efficacy if mutation data available.Apply fix checkbox, a False positive — won't fix checkbox (so disagreement is recordable across re-runs), and a collapsible <details> block with the test function body so the reviewer can sanity-check without context-switching to the source file.rapid.property-based-testing skill's references for PBT conversion templates.mutation-testing skill output (gremlins JSON) when --use-mutations is set./code-review and /pre-pr-review flows as a sub-step.references/smell-catalog.md — full smell catalog with Go detection logic.references/coverage-pitfalls.md — why line coverage misleads; branch / MC/DC / mutation testing context.references/strengthening-patterns.md — weak-to-strong assertion rewrites.references/domain-checks.md — concurrency, failure-mode, determinism for distributed/Bitcoin code.references/reshape-to-invariants.md — converting example tests to rapid properties.references/workflow.md — phase-by-phase walkthrough with examples.development
Clear-writing guide distilled from Steven Pinker's "The Sense of Style." Use when writing or revising prose that must be clear to a reader — documentation, design docs, specs, explanations, essays, emails, reports, RFCs, release notes — or when asked to make writing clearer, tighter, less academic, or less jargon-laden. Activate for "make this clearer", "tighten this", "why is this hard to read", "edit this for clarity", or any prose-quality pass.
development
Interactively debug Go programs in a single context using Delve (dlv) driven through tmux. Use when a bug requires runtime inspection — stepping through code, examining variables, walking goroutines, attaching to a live process, or debugging a hanging integration test — rather than just reading the source. Triggers include "step through this", "set a breakpoint", "attach to the running server", "why is this goroutine stuck", "debug this failing test".
development
Find similar vulnerabilities and bugs across codebases using pattern-based analysis. Use when hunting bug variants, building CodeQL/Semgrep queries, analyzing security vulnerabilities, or performing systematic code audits after finding an initial issue.
testing
This skill provides agent mail management via the Subtrate command center. Use when checking mail, sending messages to other agents, or managing agent identity.