.stencila/skills/software-test-review/SKILL.md
Evaluate the quality of TDD tests against slice acceptance criteria, codebase conventions, and Red-phase execution results, producing a structured review with Accept or Revise recommendations. Use when tests written during the Red phase of red-green-refactor need quality review — checking coverage of acceptance criteria, conformance with codebase test conventions, test quality (naming, assertions, isolation, readability, triviality), edge-case and error-path coverage, and whether Red-phase failures indicate correctly missing implementation. Flags trivial low-value tests that add more maintenance cost than testing value. Discovers codebase conventions independently and produces an actionable review report.
npx skillsauth add stencila/stencila software-test-reviewInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Review tests written during the Red phase of a TDD cycle. This skill evaluates tests across multiple quality dimensions: acceptance-criteria coverage, codebase convention conformance, test quality (including detection of trivial low-value tests), edge-case handling, and whether Red-phase failures indicate correctly missing implementation. The output is a structured review report with an Accept or Revise recommendation.
This skill does not write or modify any code or test files. It only reads, evaluates, and reports.
The review answers one central question: Are these tests good enough to drive the Green phase of implementation? Tests that are accepted should be a reliable specification of the slice's expected behavior. Tests that need revision should come back with specific, actionable feedback so the test-creation agent can fix them efficiently.
A corollary question is equally important: Does every test earn its keep? Each test carries ongoing maintenance cost — it must be loaded, run, kept passing through refactors, and understood by future developers. A test is only worth that cost if it provides meaningful confidence that a real behavior works correctly. Trivial tests that merely restate the code, assert tautologies, or verify something the type system already guarantees add net-negative value: they cost time to maintain and provide no real safety net. This review must actively identify and flag such tests.
The reviewer must distinguish between acceptance criteria that should drive automated tests and criteria that are better satisfied by implementation or human inspection. Documentation requirements, doc comments, README edits, changelog entries, and similar non-executable deliverables are usually not reasons to demand source-inspecting unit tests. Their absence from the test suite is not automatically a coverage gap.
This skill requires the following information to operate:
| Input | Required | Description | |----------------------|----------|-------------------------------------------------------------| | Slice name | Yes | Name or identifier of the current slice | | Slice scope | Yes | Concise description of what the slice covers | | Acceptance criteria | Yes | The criteria the tests must verify | | Target packages | Yes | Packages, modules, or directories involved | | Test files | Yes | List of test file paths to review | | Test command | No | Command used to run the tests | | Test execution results | No | Structured output from running the tests (pass/fail counts, failure details) |
When used standalone, these inputs come from the user or the agent's prompt. When used within a workflow, the workflow's stage prompt will specify how to obtain them.
After completing its work, this skill reports:
| Output | Description | |----------------|--------------------------------------------------------------------------| | Recommendation | Whether the tests should be accepted or revised (Accept / Revise) | | Review report | Structured report with coverage matrix, findings, and recommendations |
Ensure the required inputs are available:
.stencila/plans/ as a standalone convenience — look for criteria related to the current slice name. In workflow use, the stage prompt should always provide acceptance criteria explicitly.If test execution results are available:
read_file to load each test file listed in the inputsIndependently discover the codebase's test conventions to evaluate conformance. Do not assume conventions from the test files being reviewed — those files may deviate from the codebase's norms.
glob to search for test files in the target package directories:
**/*test*, **/*spec*, **/tests/**, **/__tests__/**, **/test/**read_file to examine existing tests and learn:
tests/ directory, or in a parallel source tree?For each acceptance criterion:
Produce a coverage matrix:
| Acceptance Criterion | Covered By | Status |
|---|---|---|
| Criterion text | test_function_name | ✅ Covered / ❌ Missing / ⚠️ Weak / ➖ Non-test deliverable |
A criterion is Weak if a test exists but does not adequately verify the criterion (e.g., it asserts the wrong property, uses a trivial input, or only tests the happy path when the criterion includes error handling). A criterion is Non-test deliverable when it is better satisfied by implementation and review than by an automated test.
Compare the test files against the conventions discovered in Step 4:
Convention deviations are lower severity than coverage gaps but should still be flagged because they create maintenance burden and inconsistency.
Assess each test across these dimensions:
test_parse_returns_malformed_token_error_for_empty_stringtest1, test_parse, test_it_worksEvery test has a maintenance cost: it must be kept loadable, execute successfully, stay green through refactors, and be understood by future developers. A test is only worth that cost if it provides meaningful confidence that the system behaves correctly under conditions that could realistically fail. Actively look for and flag trivial tests that add more maintenance cost than testing value.
Flag a test as trivial if it matches any of these anti-patterns:
42 and then asserting the field is 42 (this tests the language's assignment, not the code)Optional[str] produces an Optional)toString(), __str__(), Display, or similar string-conversion output matches a hardcoded string when the implementation is a trivial format string with no conditional logicWhen evaluating whether a test is trivial, ask: "What bug would this test catch that would not already be caught by the compiler, type checker, linter, or another test?" If the answer is "none" or "only if someone makes a typo in a trivial one-liner," the test is trivial.
Trivial tests are not just useless — they are actively harmful because they:
Severity: Flag individual trivial tests as Medium. If the majority of the test suite consists of trivial tests (more than half), elevate this to a High finding because the suite provides inadequate real coverage despite appearing to have tests. A suite full of trivial tests is worse than a suite with fewer but meaningful tests.
If test execution results are available:
If no execution results are available, skip this step and note the limitation.
Follow the Report Format below.
Recommend Accept when:
Recommend Revise when:
When recommending Revise, the review report serves as feedback for the test-creation agent. Make findings specific and actionable so the test creator knows exactly what to fix. For trivial tests, explicitly recommend removing or replacing them — do not just note they are trivial but leave them in place. Specify what meaningful test (if any) should take their place, or state clearly that the test should be deleted with no replacement.
When in doubt, recommend Revise — it is safer to improve tests before driving implementation than to accept weak tests that lead to incorrect Green-phase code.
One to three sentences summarizing the test suite's quality and the most important finding. State the recommendation (Accept or Revise) and the primary reason.
A short bullet list of what the tests do well. Recognizing strengths helps the test creator know what to preserve during revision.
The coverage matrix from Step 5. This is the most important section — missing coverage is the primary reason for a Revise recommendation.
Group findings under these headings when relevant (omit headings with no findings):
For each finding:
Severity guidelines:
A numbered list of concrete improvements in priority order. Each recommendation should say what to change, where, and why. When useful, suggest specific test names, assertion patterns, or restructuring approaches.
Slice scope: "Token validation for auth module"
Acceptance criteria: AuthToken type has sub, exp, iat, roles fields; AuthError.MalformedToken returned for empty string; expired tokens rejected
Test execution: 3 tests, all failed with resolution errors indicating AuthToken cannot be found — correct Red-phase behavior.
Review:
Overall Assessment
The test suite covers all three acceptance criteria with clear, well-named tests that fail correctly in the Red phase. Accept — tests are ready to drive the Green phase.
Strengths
- Each acceptance criterion maps to a dedicated test
- Test names describe the expected behavior precisely
- Failure messages indicate missing implementation, not test bugs
Acceptance-Criteria Coverage
| Acceptance Criterion | Covered By | Status | |---|---|---| |
AuthTokenhas required fields |test_auth_token_has_required_fields| ✅ Covered | |MalformedTokenfor empty string |test_parse_returns_malformed_token_for_empty_string| ✅ Covered | | Expired tokens rejected |test_expired_token_is_rejected| ✅ Covered |Findings
Convention conformance
- Low:
test_auth_token_has_required_fieldsuses a loose truthiness check for field existence, while existing tests in the package use equality assertions for field verification. Consistency is preferred but not blocking.Recommendations
- Consider switching field assertions to equality checks to match existing conventions in the
authpackage — this is minor and does not block acceptance
Slice scope: "CSV parser error handling"
Acceptance criteria: parse_csv raises CsvError::EmptyFile for empty input; parse_csv skips malformed rows and returns partial results; parse_csv raises CsvError::EncodingError for non-UTF-8 input
Test execution: 2 tests failed with SyntaxError: invalid syntax — test file has Python syntax errors.
Review:
Overall Assessment
The test suite is missing coverage for one of three acceptance criteria, and the test file has a syntax error that prevents execution. Revise — fix the syntax error and add the missing encoding-error test.
Strengths
- Test names are descriptive and follow the project's
test_<behavior>_<condition>convention- The empty-file test has a clear assertion against the specific error type
Acceptance-Criteria Coverage
| Acceptance Criterion | Covered By | Status | |---|---|---| |
EmptyFilefor empty input |test_parse_csv_raises_empty_file_for_empty_input| ✅ Covered | | Skip malformed rows, return partial |test_parse_csv_skips_malformed_rows| ⚠️ Weak | |EncodingErrorfor non-UTF-8 | — | ❌ Missing |Findings
Red-phase validation
- High: Test file has a syntax error on line 23 — missing closing parenthesis in the
test_parse_csv_skips_malformed_rowsfunction. Tests fail due toSyntaxError, not due to missing implementation.Test quality
- Medium:
test_parse_csv_skips_malformed_rowsonly checks that the function returns a list, but does not assert the partial results contain the valid rows. The assertion should verify the content of the returned data, not just its type.Edge cases and error paths
- High: No test for
CsvError::EncodingErrorwhen given non-UTF-8 input. This is an explicit acceptance criterion.Recommendations
- Fix the syntax error on line 23 of
tests/test_parser.py— add the missing closing parenthesis- Add a test
test_parse_csv_raises_encoding_error_for_non_utf8that passes binary non-UTF-8 data and assertsCsvError.EncodingErroris raised- Strengthen
test_parse_csv_skips_malformed_rowsto assert the returned list contains only the valid rows, not just that it is a list
Slice scope: "Add Config class with validation for the settings module"
Acceptance criteria: Config constructor validates that port is 1–65535; Config constructor validates that host is non-empty; constructor raises ConfigError.InvalidPort and ConfigError.InvalidHost respectively
Test suite has 8 tests. Test execution: 8 tests, all failed with import/resolution errors indicating Config cannot be found — correct Red-phase behavior.
Review:
Overall Assessment
While all acceptance criteria are covered, 5 of 8 tests are trivial — they test constructor field assignment, type identity, and default values with no validation logic. The suite gives a false sense of thoroughness. Revise — remove the 5 trivial tests and keep the 3 meaningful validation tests.
Strengths
- The three validation tests (
test_new_rejects_port_zero,test_new_rejects_empty_host,test_new_accepts_valid_config) are well-named and assert the right error types- Red-phase failures are correct — all failures indicate missing implementation
Acceptance-Criteria Coverage
| Acceptance Criterion | Covered By | Status | |---|---|---| | Port validated 1–65535 |
test_new_rejects_port_zero,test_new_rejects_port_above_65535| ✅ Covered | | Host validated non-empty |test_new_rejects_empty_host| ✅ Covered | |InvalidPorterror returned |test_new_rejects_port_zero| ✅ Covered | |InvalidHosterror returned |test_new_rejects_empty_host| ✅ Covered |Findings
Test quality — Triviality
- High: 5 of 8 tests (63%) are trivial, making the majority of the suite maintenance cost with no testing value:
- Medium:
test_config_has_port_field— creates aConfigwithport: 8080and assertsconfig.port == 8080. This is a tautological constructor echo test; it tests that assignment works, not that the code is correct.- Medium:
test_config_has_host_field— same pattern: setshostto"localhost"and asserts it equals"localhost". No logic is exercised.- Medium:
test_config_default_port— asserts that a default-constructedConfighasport == 0. Unless the default value has business significance specified in the acceptance criteria, this tests the default mechanism, not application logic.- Medium:
test_config_is_debug— asserts that the string representation of aConfigcontains"Config". This tests the auto-generated debug/string representation, which is guaranteed by the language framework.- Medium:
test_config_clone— clones aConfigand asserts the clone equals the original. This tests the language's copy/clone mechanism, not application logic.Recommendations
- Delete
test_config_has_port_field,test_config_has_host_field,test_config_default_port,test_config_is_debug, andtest_config_clone— they catch no bugs that the language toolchain would miss and will need updating on every refactor- Keep
test_new_rejects_port_zero,test_new_rejects_port_above_65535,test_new_rejects_empty_host, andtest_new_accepts_valid_config— these test real validation logic- The resulting 3-test suite will have 100% meaningful coverage of the acceptance criteria with zero maintenance waste
➖ Non-test deliverable unless the project already has an established automated check for them and the criterion clearly intends that. Do not recommend Revise solely because there is no unit test asserting docs exist.documentation
An agent skill providing instructions for AI agents.
testing
Critically review a Stencila workflow and suggest improvements. Use when asked to review, audit, critique, evaluate, or improve a workflow directory or WORKFLOW.md file. Covers frontmatter validation, DOT pipeline quality, workflow structure, agent selection quality, discovery metadata, ephemeral workflow conventions, workflow composition, and adherence to Stencila workflow patterns.
development
Create a new Stencila workflow. Use when asked to create, write, scaffold, or set up a workflow directory or WORKFLOW.md file. Covers workflow discovery, duplicate-name checks, ephemeral workflows, WORKFLOW.md frontmatter, DOT pipeline authoring, goals, agents, branching, composition, and validation.
development
Critically review an existing or proposed Stencila theme artifact for correctness, token usage, target coverage, cross-target portability, dark-mode handling, maintainability, and approval readiness. Use when asked to review, critique, assess, audit, or validate a theme.css file, theme patch, theme plan, site theme, document theme, plot theme, print or PDF theme, check design tokens, assess DOCX or email behavior, review dark mode support, or validate with stencila themes validate.