posit-dev/review-testing/SKILL.md
Review test code for quality, design, and completeness after implementing a feature or fixing a bug. Use when the user asks to "review my tests", "check my test quality", "are these tests good enough", "review testing", or after completing a feature implementation that includes tests. Also use when tests feel brittle, flaky, or superficial. Cross-references production code to find coverage gaps.
npx skillsauth add posit-dev/skills review-testingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You are reviewing test code written alongside a feature implementation or bug fix. Ensure the tests are well-designed, thorough, and maintainable — not just that they pass. Tests that merely mirror implementation details create false confidence and become a maintenance burden during refactoring.
Identify what to review:
Read test files first, before production code. If you can infer the feature's requirements and edge cases from the tests alone, that's a sign the tests are well-written. If you need to read the implementation to understand what the tests are doing, that's a finding worth reporting.
Weigh each test against four qualities:
Regression protection — Does this test actually catch bugs? A test that exercises trivial code or skips complex branches protects against nothing. Check: does the test touch business-critical logic, or only verify the happy path of a simple getter?
Refactoring resilience — Will this test break when someone restructures code without changing behavior? Tests coupled to internal method names, call sequences, or private state punish every cleanup with false failures, eroding trust in the suite.
Fast feedback — Unit tests should run in milliseconds. If a test hits the filesystem, network, or database unnecessarily, that's a design issue. But don't confuse speed with value — an integration test verifying a real database query is better than a fast unit test that mocks everything and verifies nothing.
Maintainability — Can someone unfamiliar with this code read the test and understand what it verifies and why? Tests with sprawling setup, cryptic names, or deeply nested mocking fail this check.
The most common weakness in generated tests: asserting only the obvious output and missing the full "blast radius" of a state change.
When a test triggers an action, ask what else changed. If a test adds an item to a cart, does it only check the item count? Or does it also verify the price calculation, subtotal update, and that other items are unaffected?
Flag when:
Each test should have exactly one Arrange-Act-Assert cycle. If a test acts and asserts multiple times in sequence, it's testing multiple behaviors and should be split.
Flag when:
How test data is created determines whether the suite is maintainable at scale. Inline setup (data created in the test body) is fine for simple tests but painful when constructor signatures change across many tests. Implicit setup (shared beforeEach/setUp blocks) eliminates duplication but obscures what each test actually depends on. Delegated setup (factory functions or builders called explicitly) keeps tests readable while centralizing construction logic.
Flag when:
Mocks are essential for isolation, but overuse turns tests into mirrors of the implementation.
The key principle: mock at architectural boundaries, not at every function call. Use stubs (canned data) for query-type dependencies and mocks (interaction verification) only for commands with side effects like sending emails or writing to external systems.
Respect the codebase's existing mocking convention. Flag inconsistency within the project, not deviation from a universal rule.
Flag when:
| Smell | What It Looks Like | Why It Matters |
|---|---|---|
| Assertion Roulette | Multiple assertions with no failure messages | Can't tell which assertion broke without debugging |
| Eager Test | One test exercises several unrelated methods | Failures are ambiguous — which behavior broke? |
| Lazy Test | Multiple tests call the same method with identical inputs | Redundant maintenance cost, no coverage gain |
| Sleepy Test | Hard-coded sleep()/Sys.sleep()/setTimeout() | Flaky in CI, slow everywhere. Use polling or explicit waits. |
| Rotten Green | Assertions inside try/tryCatch or conditional branches | Test always passes because the assertion is never reached |
| Sensitive Equality | Asserting against toString()/print() output | Breaks on formatting changes; assert structural properties instead |
| Print Statement | print()/console.log() instead of assertions | Debugging leftovers that verify nothing |
| Snapshot Abuse | Snapshots as a substitute for behavioral assertions | Any change triggers failure, developers blindly update. Good uses: one snapshot of an HTML component's structure (not every prop combination), error message text, CLI output. Bad: snapshotting entire objects or rendering every variant. |
| Implementation Mirror | Expected values computed using the same logic as production code | Test and production code will always agree — even when both are wrong. Hardcode expected values from a known-good source. |
Test names should describe behavior, not implementation. A well-named test suite reads like a feature specification.
Flag when:
test_processData_returns_true) — break on renametest1, test_it_works, test_basic)Prefer behavioral names: test_expired_subscription_blocks_access, delivery_with_past_date_is_invalid, empty_cart_shows_zero_total.
Cross-reference production code changes against the test suite:
Walk through the implementation and note every decision point — each if, match, switch, error handler, or early return. Check whether the test suite exercises both sides of that decision.
When reviewing R tests using testthat, check if the testing-r-packages skill is available and invoke it for R-specific conventions and patterns.
## Summary
[Overall assessment: How well do these tests protect the codebase?]
## Critical Issues (Blocking)
[Tests that provide false confidence or will cause real problems.]
## Required Changes
[Design problems that weaken the test suite.]
## Strong Suggestions
[Improvements to test quality and maintainability.]
## Noted
[Minor style or convention issues. Mention once, then move on.]
## Verdict
Request Changes | Needs Discussion | Approve
## Next Steps
[Options for proceeding]
Use file:line references for every finding. Quote the specific test code that demonstrates the issue and show what better code looks like.
At the end of the review, offer the user these options:
Discuss and address findings: Use the AskUserQuestion tool to walk through the issues. Group by severity or topic, offer resolution options, and mark the recommended choice.
Fix the issues: Offer to apply fixes directly in priority order — blocking issues first, then required changes, then suggestions. Confirm before continuing after each group.
Add to a pull request: When reviewing in context of a PR, offer to post the review as a PR comment. Include attribution: "Review assisted by the review-testing skill."
If operating as a subagent, skip the next steps and output only the review findings.
tools
Bulk resolve unresolved PR review threads on the current branch’s PR — typically after threads have been addressed manually or via /pr-threads-address
development
Address PR review feedback by systematically working through every unresolved PR review thread on the current branch's PR - analyze each comment, make the requested code changes (with tests where useful), commit, and optionally reply and resolve.
tools
Build modern Shiny dashboards and applications using bslib (Bootstrap 5). Use when creating new Shiny apps, modernizing legacy apps (fluidPage, fluidRow/column, tabsetPanel, wellPanel, shinythemes), or working with bslib page layouts, grid systems, cards, value boxes, navigation, sidebars, filling layouts, theming, accordions, tooltips, popovers, toasts, or bslib inputs. Assumes familiarity with basic Shiny.
tools
Guide for drafting issue closure and decline responses as an open-source package maintainer. Use when helping compose a reply that says "no" to a feature request, closes an issue as won't-fix, redirects a user to a different package, explains why a design choice is intentional, or otherwise declines or closes a community contribution. Also use when the maintainer needs to explain a deprecation, point out a user misunderstanding, or communicate an effort/scope tradeoff to a contributor.