Writing Tests

Invocation Notice

Inform the user when this skill is being invoked by name: writing-tests.

When to Use

Writing or reviewing tests in any language or framework.
Choosing what kind of test to write (unit, integration, E2E).
Diagnosing a suite that is slow, brittle, flaky, or hard to maintain.
Reviewing test structure, naming, or test double usage.
Unsure what test coverage a change needs.

When NOT to use:

Python-specific practices (fixtures, async, multi-version, free-threaded): see python-testing.

Quick Reference

Test observable behavior — outputs, state transitions, emitted effects — not implementation internals.
One behavior per test. Split when you can't write a single coherent name covering all assertions.
Write the regression test before the fix; a test you never saw fail proves nothing.
Keep tests order-independent and deterministic. Inject clocks and randomness; never sleep.
Prefer DAMP over DRY in test code: readable duplication beats abstraction that hides the story.
Mock and stub at collaboration boundaries only. Prefer real collaborators for domain logic.

Test Structure (AAA)

Every test follows the same three-phase structure — called Arrange/Act/Assert, Given/When/Then, or Setup/Exercise/Verify depending on tradition. All three describe identical phases:

| Phase | AAA | Given/When/Then | Meszaros | | ----- | ------- | --------------- | -------- | | 1 | Arrange | Given | Setup | | 2 | Act | When | Exercise | | 3 | Assert | Then | Verify | | 4 | — | — | Teardown |

Separate phases with a blank line — that is the minimum required structure. Section-label comments (# Arrange / # Act / # Assert, or # Given / # When / # Then) are acceptable when the team finds them helpful; they are not required. The examples in this document include labels as a teaching aid.

// Arrange
cart = new_cart()
cart.add(item("widget", price=9.99), quantity=3)

// Act
total = cart.total()

// Assert
assert total == 29.97

Teardown is required when a test acquires real resources (files, connections). Prefer automatic cleanup via framework hooks over manual teardown in the test body.

Test organization (file layout, grouping by class vs. flat functions, describe blocks) follows your testing framework's conventions. The AAA structure applies at the individual test function level, regardless of how tests are grouped.

Violations that signal a structural problem:

Interleaved Act/Assert cycles — the test covers multiple behaviors; split it.
No Assert — "testing that nothing crashes" is not testing.
Giant Arrange that buries the Act — extract a named Creation Method.

See references/test-structure.md for AAA in depth, DAMP vs. DRY, and GWT vs. AAA.

Naming

Test names are the first thing a reader sees on a failure. They must communicate the scenario and the expected outcome without entering the test body. The name is the test's specification.

Three-part form: <Behavior>_<Condition>_<ExpectedOutcome>

Casing and separators follow your language convention — the goal (communicating scenario and expected outcome) is universal; the style is not.

// snake_case (Python, Ruby)
withdraw_money__insufficient_funds__raises_error
cart_with_no_items__total_is_zero

// camelCase segments (Java, C#, JS)
withdrawMoney_insufficientFunds_raisesError
cartWithNoItems_totalIsZero

Alternative — full behavioral statement (readable as documentation):

should_reject_transfer_when_balance_is_negative
returns_empty_list_for_unknown_category

Note: some frameworks require a specific prefix for test discovery (e.g., test_ in pytest, Test in Go). That prefix is a framework requirement, not a naming-quality choice — it does not substitute for a descriptive name. A function named test_withdraw is as opaque as testWithdraw; test_withdraw__insufficient_funds__raises_error communicates what the test verifies.

Avoid: numbered suffixes (test_2, test_order_2), names that state only the entry point without a condition or expected outcome, names that describe input type rather than scenario.

See references/test-structure.md for naming conventions in depth.

Comments

Don't write comments that explain what the test does — the name and structure should. Write a comment only when:

A non-obvious business rule drives the assertion and the test name cannot carry the full context.
A deliberate deviation from convention would otherwise look like a mistake.

A test that requires comments to be understandable has a naming or structure problem. Fix the test.

Test Portfolio

Shape the suite toward many fast unit tests, fewer integration tests, and a minimal set of end-to-end tests — the testing pyramid. Treat the ratios as a sanity check against a slow, expensive, inverted pyramid, not a quota. Let contract risk and the natural behavioral unit drive the actual mix.

Match test scope to the natural behavioral unit of the code:

Pure function with no collaborators → unit test.
Component coordinating several well-tested sub-components → integration test at the component boundary.
Service or module with a defined API → integration test treating the service as the unit, using real or in-process infrastructure.
User-facing workflow spanning multiple services → E2E test (keep to a minimum; critical paths only).

Coverage expectations for each change:

Happy path and the primary alternative paths.
Validation failures and error paths.
Boundary inputs and edge cases.
A regression test for every bug fixed — write it before the fix.
Cleanup, cancellation, and resource-release paths when the code touches I/O or long-running work.

Multiple write-sites: when the same contract outcome can be produced by more than one code path (canonical path, dedup shortcut, cache hit, idempotency early-return), each path needs its own test. A test of the canonical path does not prove the shortcut is correct.

// Test the canonical path
result = repo.save(record, mode=CANONICAL)
assert result.id != null

// Also test the shortcut — it may skip validation or use a different write order
result = repo.save(record, mode=DEDUP)
assert result.id != null
assert result.normalized_url == normalize(record.url)

Derived-pair invariants: when a spec asserts field_a == f(field_b) and the two fields are written by different producers, neither field-local tests nor per-producer tests prove the invariant holds after composition. Write a post-composition test.

// Each producer may be correct in isolation but the pair must be verified together
manifest = merge(resolver_output, fetcher_output)
assert manifest.normalized_url == normalize(manifest.source_url)

See references/test-portfolio.md for the pyramid vs. trophy vs. honeycomb debate, scope decision questions, property-based testing, and mutation testing.

Test Doubles

"Mock" is not a synonym for "test double." Use Meszaros's taxonomy:

| Type | Definition | Use when | | --------- | ------------------------------ | ------------------------------------------------------------------ | | Dummy | Passed but never used | A parameter is required but irrelevant to this test | | Fake | Simplified real implementation | In-memory database, fake queue, in-process event bus | | Stub | Returns canned values | Providing indirect inputs: HTTP responses, clocks, config | | Spy | Stub that records calls | Verifying a side-effect occurred; assert post-hoc on the recording | | Mock | Pre-programmed expectations | The interaction itself is the behavior under test |

Prefer state verification (assert on observable outcomes) over behavior verification (assert on call sequences). Tests that assert call sequences are structure-sensitive: they break when implementation is refactored without changing behavior.

When to use doubles: external I/O (network, filesystem, email), non-determinism (clocks, randomness), operations that must not execute in tests (charge a card, send a real notification). Prefer real collaborators for domain logic — test domain objects with their actual peers rather than stubbing business logic. A Fake (in-memory implementation) is a test double and is appropriate for external infrastructure; using a FakeRepo does not violate this principle.

Stub at the boundary, not deep inside. Replace the dependency at the seam where your module uses it. Deep mock chains that mirror the internal call graph are a design smell — they indicate I/O has not been properly decoupled from logic.

Common mistake: asserting that a mock was called without asserting on observable output. Call-count assertions prove the mock was invoked, not that the behavior was correct.

See references/test-doubles.md for the full taxonomy with examples, the classical vs. mockist distinction, and common misuse patterns.

TDD

Red-green-refactor is an ordering discipline: write a failing test, write minimum code to pass, then refactor. The principles in this skill apply regardless of whether tests are written before or after code.

One rule from TDD that applies universally: when fixing a bug, write the failing test that reproduces it before touching production code. A test you never saw fail doesn't prove the bug is gone. For failures that can't be reproduced locally (race conditions, load-dependent), write a characterization test capturing the known behavior and confirm the failure reproduces in CI before patching.

Common Mistakes

Fragile Test — breaks when unrelated code changes; caused by asserting on internals (call sequences, private state, string representations of objects).
Assertion Roulette — multiple assertions with no identifying messages; when one fails the reader can't tell which.
Eager Test — verifies multiple behaviors in one test; split when the name can't coherently cover all assertions.
Mystery Guest — fixture set up outside the test body (external file, shared database, implicit setup); the reader can't understand cause and effect without finding the external resource.
Erratic Test — non-deterministic; primary causes: shared mutable state that bleeds between tests, ordering dependencies (test B silently relies on test A having run first), timing assertions using sleep instead of condition-based waits, unordered collections in equality checks, unseeded randomness, and resource leaks from missing teardown. Diagnose ordering issues by running with a randomized seed. Fix with per-test setup/teardown hooks, condition-based waits, sorted comparisons, and seeded randomness.
Slow Tests — unit tests touching real databases, networks, or filesystems; replace with fakes or in-process substitutes.
Testing the mock — asserting call counts without asserting on observable output; proves the mock was invoked, not that the behavior was correct.
Missing regression test — fixing a bug without first writing a failing test that reproduces it.

References

references/test-structure.md — AAA in depth, naming conventions, DAMP vs. DRY, GWT vs. AAA
references/test-portfolio.md — pyramid, trophy, honeycomb, scope decision guide, property-based testing, mutation testing
references/test-doubles.md — full taxonomy with pseudocode, classical vs. mockist school, when to use each type

Writing Tests

Invocation Notice

Inform the user when this skill is being invoked by name: writing-tests.

When to Use

Writing or reviewing tests in any language or framework.
Choosing what kind of test to write (unit, integration, E2E).
Diagnosing a suite that is slow, brittle, flaky, or hard to maintain.
Reviewing test structure, naming, or test double usage.
Unsure what test coverage a change needs.

When NOT to use:

Python-specific practices (fixtures, async, multi-version, free-threaded): see python-testing.

Quick Reference

Test observable behavior — outputs, state transitions, emitted effects — not implementation internals.
One behavior per test. Split when you can't write a single coherent name covering all assertions.
Write the regression test before the fix; a test you never saw fail proves nothing.
Keep tests order-independent and deterministic. Inject clocks and randomness; never sleep.
Prefer DAMP over DRY in test code: readable duplication beats abstraction that hides the story.
Mock and stub at collaboration boundaries only. Prefer real collaborators for domain logic.

Test Structure (AAA)

Every test follows the same three-phase structure — called Arrange/Act/Assert, Given/When/Then, or Setup/Exercise/Verify depending on tradition. All three describe identical phases:

// Arrange
cart = new_cart()
cart.add(item("widget", price=9.99), quantity=3)

// Act
total = cart.total()

// Assert
assert total == 29.97

Teardown is required when a test acquires real resources (files, connections). Prefer automatic cleanup via framework hooks over manual teardown in the test body.

Violations that signal a structural problem:

Interleaved Act/Assert cycles — the test covers multiple behaviors; split it.
No Assert — "testing that nothing crashes" is not testing.
Giant Arrange that buries the Act — extract a named Creation Method.

See references/test-structure.md for AAA in depth, DAMP vs. DRY, and GWT vs. AAA.

Naming

Test names are the first thing a reader sees on a failure. They must communicate the scenario and the expected outcome without entering the test body. The name is the test's specification.

Three-part form: <Behavior>_<Condition>_<ExpectedOutcome>

Casing and separators follow your language convention — the goal (communicating scenario and expected outcome) is universal; the style is not.

// snake_case (Python, Ruby)
withdraw_money__insufficient_funds__raises_error
cart_with_no_items__total_is_zero

// camelCase segments (Java, C#, JS)
withdrawMoney_insufficientFunds_raisesError
cartWithNoItems_totalIsZero

Alternative — full behavioral statement (readable as documentation):

should_reject_transfer_when_balance_is_negative
returns_empty_list_for_unknown_category

Avoid: numbered suffixes (test_2, test_order_2), names that state only the entry point without a condition or expected outcome, names that describe input type rather than scenario.

See references/test-structure.md for naming conventions in depth.

Comments

Don't write comments that explain what the test does — the name and structure should. Write a comment only when:

A non-obvious business rule drives the assertion and the test name cannot carry the full context.
A deliberate deviation from convention would otherwise look like a mistake.

A test that requires comments to be understandable has a naming or structure problem. Fix the test.

Test Portfolio

Match test scope to the natural behavioral unit of the code:

Pure function with no collaborators → unit test.
Component coordinating several well-tested sub-components → integration test at the component boundary.
Service or module with a defined API → integration test treating the service as the unit, using real or in-process infrastructure.
User-facing workflow spanning multiple services → E2E test (keep to a minimum; critical paths only).

Coverage expectations for each change:

Happy path and the primary alternative paths.
Validation failures and error paths.
Boundary inputs and edge cases.
A regression test for every bug fixed — write it before the fix.
Cleanup, cancellation, and resource-release paths when the code touches I/O or long-running work.

// Test the canonical path
result = repo.save(record, mode=CANONICAL)
assert result.id != null

// Also test the shortcut — it may skip validation or use a different write order
result = repo.save(record, mode=DEDUP)
assert result.id != null
assert result.normalized_url == normalize(record.url)

// Each producer may be correct in isolation but the pair must be verified together
manifest = merge(resolver_output, fetcher_output)
assert manifest.normalized_url == normalize(manifest.source_url)

See references/test-portfolio.md for the pyramid vs. trophy vs. honeycomb debate, scope decision questions, property-based testing, and mutation testing.

Test Doubles

"Mock" is not a synonym for "test double." Use Meszaros's taxonomy:

Common mistake: asserting that a mock was called without asserting on observable output. Call-count assertions prove the mock was invoked, not that the behavior was correct.

See references/test-doubles.md for the full taxonomy with examples, the classical vs. mockist distinction, and common misuse patterns.

TDD

Common Mistakes

Fragile Test — breaks when unrelated code changes; caused by asserting on internals (call sequences, private state, string representations of objects).
Assertion Roulette — multiple assertions with no identifying messages; when one fails the reader can't tell which.
Eager Test — verifies multiple behaviors in one test; split when the name can't coherently cover all assertions.
Mystery Guest — fixture set up outside the test body (external file, shared database, implicit setup); the reader can't understand cause and effect without finding the external resource.
Erratic Test — non-deterministic; primary causes: shared mutable state that bleeds between tests, ordering dependencies (test B silently relies on test A having run first), timing assertions using sleep instead of condition-based waits, unordered collections in equality checks, unseeded randomness, and resource leaks from missing teardown. Diagnose ordering issues by running with a randomized seed. Fix with per-test setup/teardown hooks, condition-based waits, sorted comparisons, and seeded randomness.
Slow Tests — unit tests touching real databases, networks, or filesystems; replace with fakes or in-process substitutes.
Testing the mock — asserting call counts without asserting on observable output; proves the mock was invoked, not that the behavior was correct.
Missing regression test — fixing a bug without first writing a failing test that reproduces it.

References

references/test-structure.md — AAA in depth, naming conventions, DAMP vs. DRY, GWT vs. AAA
references/test-portfolio.md — pyramid, trophy, honeycomb, scope decision guide, property-based testing, mutation testing
references/test-doubles.md — full taxonomy with pseudocode, classical vs. mockist school, when to use each type

Adoption

ahgraber/writing-tests

$ install --global

Security Scan Results

SKILL.md

Writing Tests

Invocation Notice

When to Use

Quick Reference

Test Structure (AAA)

Naming

Comments

Test Portfolio

Test Doubles

TDD

Common Mistakes

References

Related Skills

ahgraber/editorial-review

ahgraber/synthesize

ahgraber/strudel

ahgraber/skills/debugging

ahgraber/writing-tests

$ install --global

Security Scan Results

SKILL.md

Writing Tests

Invocation Notice

When to Use

Quick Reference

Test Structure (AAA)

Naming

Comments

Test Portfolio

Test Doubles

TDD

Common Mistakes

References

Related Skills

ahgraber/editorial-review

ahgraber/synthesize

ahgraber/strudel

ahgraber/skills/debugging