Writing Tests

Core Principle

Tests prove behavior works. A test that can't fail is worthless. A test that tests mocks instead of real code is theater.

Writing Good Tests

One behavior per test

Each test should verify exactly one thing. If the test name needs "and" in it, split it into two tests.

Good:  "creates user with valid email"
Good:  "rejects user with duplicate email"
Bad:   "creates user and sends welcome email and updates counter"

Derive test cases from three sources

Build test coverage from three independent sources and verify every item maps to at least one test:

User requirements -- what was requested (spec, issue, conversation)
Features implemented -- what the code actually does (scan the diff)
Claims in the response -- what you're about to tell the user works

Anything that appears in any source but has no corresponding test is a coverage gap. This catches the common failure mode where implemented features work but aren't tested, or where claimed behavior isn't verified.

For each source, enumerate user journeys: "As a [role], I want to [action], so that [benefit]." Generate test cases from each journey -- this ensures tests cover user-visible behavior, not implementation details.

DAMP over DRY in tests

Each test should be independently readable without chasing shared setup through helper functions. Duplication in tests is acceptable -- even desirable -- when it makes the test's intent obvious at a glance. Extract shared setup only when it genuinely reduces noise without hiding what the test does.

Test pyramid

For API/web projects, aim for ~80% unit, ~15% integration, ~5% E2E. Adjust ratios based on project risk profile -- data pipelines may need heavier integration coverage, CLI tools may need minimal E2E.

Unit tests (~80%): fast, isolated, test one behavior per test. Run in milliseconds. No database, no network, no filesystem. These form the foundation -- cheap to write, cheap to run, fast feedback.
Integration tests (~15%): verify component boundaries -- API endpoints hitting a real test database, service layers wired to real dependencies, queue producers and consumers working together. Slower than unit tests but catch wiring bugs that mocks hide.
E2E tests (~5%): validate critical user paths end-to-end through the real system. Expensive to write, slow to run, brittle to maintain. Limit to high-value flows (signup, checkout, core workflow). Every E2E test must justify its maintenance cost.

Name tests by expected behavior

The test name should describe what happens, not what's being called.

Good:  "returns 404 when user does not exist"
Bad:   "test getUserById"
Good:  "sends notification after order is placed"
Bad:   "test processOrder"

Use real objects when practical

Mocks should be a last resort, not a first choice. Every mock is an assumption about behavior that may drift from reality.

| Use real objects for | Use mocks/fakes for | |---------------------|---------------------| | Database queries (use test DB) | External HTTP APIs | | Internal services and classes | Payment gateways | | File system operations (use temp dirs) | Email/SMS delivery | | Business logic and transformations | Third-party SDKs with rate limits |

Exception: framework-provided test doubles. When a framework offers dedicated faking mechanisms (Laravel Queue::fake(), Event::fake(); React test providers and vi.mock for API layers), use them -- they are the idiomatic approach and maintained alongside the framework. The principle is: avoid hand-rolled mocks that drift, not framework-blessed test utilities.

Tests expose bugs, not the reverse

If a test uncovers broken or buggy behavior, fix the source code -- never adjust the test to match incorrect behavior. A test that passes against a bug is worse than no test at all.

Assert on outcomes, not implementation

Good:  assert user exists in database after create
Bad:   assert repository.save() was called once
Good:  assert response body contains expected fields
Bad:   assert serializer.serialize() was called with user

Test edge cases

For every feature, consider:

Empty input / null / undefined
Boundary values (0, 1, max, max+1)
Invalid types (string where number expected)
Concurrent access (if applicable)
Error paths (network failure, timeout, permission denied)
Unicode and special characters in string inputs

Silent failure coverage

Tests must detect silent failures, not just happy paths. For every code path that catches, logs, or short-circuits on error, add an assertion that proves the failure was observable. Hunt targets during test writing:

Empty catch blocks (try { ... } catch {}) — add a test that triggers the error and asserts the logger (or equivalent signal) was called with the original exception.
Swallowed rejections (.catch(() => []), .catch(() => null)) — add a test that triggers the rejection and asserts the caller sees a distinguishable signal (specific return value, logged error, re-thrown).
Converted errors (catch (e) { return defaultValue; }) — assert both the return value AND that the error was recorded somewhere the operator can find it.
Missing async handling — assert that a rejected promise inside the function causes a surfaced failure (not just an unhandled-rejection warning).
No rollback around transactional work — assert that a failure mid-transaction leaves no partial state (row counts before and after match, queue stays unchanged, etc.).

Assertion pattern: instead of expect(result).toBe(null) (which passes for both "handled gracefully" and "silent drop"), prefer expect(logger.error).toHaveBeenCalledWith(expect.any(DatabaseError)) — make the observable signal part of the contract.

Red-Green-Refactor (When It Applies)

Tests-first answer "what should this do?" Tests-after answer "what does this do?" The distinction matters: tests written after implementation are biased toward verifying what you built, not what's required.

For bug fixes, writing the failing test first is genuinely valuable -- it proves the bug exists and proves the fix works. For new features, the order is less critical than the quality.

Bug fixes: prove-it pattern

The failing test is proof the bug exists. The passing test is proof the fix works. Without both halves, there is no proof -- just coincidence.

Write a test that reproduces the bug
Run it and watch it fail -- confirm it fails for the right reason. A test that fails due to a typo or import error hasn't captured the bug. The failure message should describe the buggy behavior.
Apply the fix
Run it and watch it pass -- confirm the fix addresses the specific failure AND other tests still pass. A fix that breaks something else isn't a fix.
If the test passes immediately without a fix, the test is verifying existing behavior, not the bug. Go back to step 1.

This is non-negotiable for bugs -- a fix without a regression test is a fix that will break again. The two-run sequence (fail then pass) is the proof. Skipping the first run means the test might pass for reasons unrelated to the fix.

New features: test alongside

Write tests as you build, not after. "I'll add tests later" means "I won't add tests."

The goal: by the time the feature is done, tests exist and pass. Whether you wrote the test 5 minutes before or 5 minutes after the code matters less than whether the test exists and is good.

Minimum viability during green phase: When making a test pass, write the simplest code that satisfies it. Not the abstraction you think is "right," not the feature you imagine you'll need next. The simplest thing. Refactor only after the test is green.

Anti-Patterns

Testing mock behavior instead of real behavior

Symptom: Test passes but production breaks. Tests assert that mocks were called correctly, not that the actual system works.

Fix: Replace mocks with real objects for internal code. Only mock at system boundaries (external APIs, email, payment).

Test-only methods in production code

Symptom: Methods like reset(), clearState(), setTestMode() that exist only because tests need them.

Fix: If tests need to reset state, the code has a design problem. Refactor to make state explicit and injectable.

Snapshot tests as the only test

Symptom: All tests are snapshots that get bulk-updated whenever anything changes.

Fix: Snapshots catch unintended changes but don't verify correctness. Add behavioral assertions alongside snapshots.

Testing the framework

Symptom: Tests verify that the ORM saves records, the router routes requests, or the framework does what its docs say.

Fix: Trust the framework. Test YOUR logic -- the business rules, transformations, and decisions your code makes.

Incomplete mocks

Symptom: Mock only includes the fields the test author knows about. Downstream code consumes other fields and gets undefined.

Fix: Mock the COMPLETE data structure as it exists in reality, not just the fields the immediate test uses. Before creating a mock response, check what fields the real API/type contains -- include ALL fields the system might consume downstream. Use real objects or factory-generated fixtures with all fields populated. If you must mock, generate from the real type/schema.

Mocking without understanding

Before mocking any method, ask: (1) What side effects does the real method have? (2) Does this test depend on any of those side effects? (3) Mock at the lowest level that removes the slow/external part -- not higher.

AI-generated test smells

Tests written by LLMs (including self-written) tend to produce a specific class of failures. Scan for these before committing:

Mock of the system under test — mocking the very function being tested, so the test asserts what the mock returned. Always a mistake. Delete the mock; call the real function.
Circular assertion — computing the expected value the same way the code computes the actual value (expect(sum(a,b)).toBe(a+b)). The test passes even when both are wrong. Replace with a hand-computed expected value or a known fixture.
Snapshot of unreviewed output — first-run snapshot committed without reading it. The snapshot enshrines whatever the code happened to emit, bugs included. Hand-write the first snapshot or diff it line by line before accepting.
Assertion-free exercise — test calls the function, checks nothing, passes because nothing threw. Every test needs at least one expect(...) / assert ... tied to the behavior under test.
Over-broad matchers — expect(result).toBeTruthy() on a function that returns an object. Passes for {}, true, "anything", all equally. Pin to the specific shape.
Implementation-echo assertions — expect(repo.save).toHaveBeenCalledTimes(1) when the real contract is "the user exists in the database afterward." Assert on outcomes, not call counts.

Persistent test infrastructure state contamination

Symptom: Integration tests fail with row-count multipliers — expected 2 rows, got 8; expected 3 jobs dispatched, got 12. The numbers look like a code bug ("the loop runs N times instead of once"), but they're clean integer multiples of the expected value, and the same test passes in CI on a fresh container.

Root cause: Persistent test infrastructure (long-running docker compose up, a shared local database, a volume left between iterations) accumulates state across test runs. The current run's data sits on top of the previous run's data; assertions counting rows or jobs see the sum.

Diagnostic shortcut: if expected vs. actual differs by a clean integer multiple (2x, 3x, 4x...), state contamination is more likely than a logic bug. Real logic bugs rarely produce uniform multipliers across unrelated assertions.

Fix: reset infrastructure state between runs. In order of preference:

Ephemeral containers per test session (testcontainers, pytest-postgresql, or docker compose run --rm <service> for one-shot runs) — slowest to start, strongest isolation. Default for CI.
Fixture-driven TRUNCATE / DROP DATABASE in a session-scoped or per-test fixture — fast, requires careful coverage of every stateful table.
Volume teardown between iterations (docker compose down -v before each run) when running locally — manual but reliable.

Never rely on tests "cleaning up after themselves." If a previous run errored mid-test, the cleanup didn't run, and the next run inherits the partial state.

Vacuous forall over an empty collection

Symptom: A test asserts a forall-style predicate over a model's child collection (every, all, .iter().all(), a manual loop returning true) and it passes — but the factory that built the parent never attached children. Every such operator returns true over an empty collection, so the assertion is vacuously satisfied and never exercises the production shape where children exist.

Fix: When an assertion's truth depends on a child collection being non-empty, attach a realistic child set explicitly (->has(...), hand-attach via the relationship) — don't rely on the factory's minimal default. Confirm the predicate actually flips for at least one populated case, so the test could fail.

Constructing the object-under-test below the layer that transforms it

Symptom: The fix guards or transforms a field in an upstream layer (a parser, normalizer, from_api_response constructor, serializer), but the test builds the object directly via the leaf constructor — Model(field=x), new T(...), the raw initializer — injecting the already-correct value. The upstream strip/transform never runs, the guard never fires, and the test is green while production is still broken.

Fix: Enter through the same entry point production uses — drive the test from the input the upstream layer actually receives (the raw API payload, the unparsed dict) so the transform under test executes. If a test must construct the leaf form directly, it is not covering the transform; add a separate test that feeds the pre-transform input.

Synchronous adapters hide timing-dependent races

Symptom: A test fires two or more parallel requests through a mock/adapter that resolves synchronously (a promise that settles in the same microtask, an in-memory fake with zero latency) and asserts a coalescing/dedup/single-flight guard held. It passes — but only because every call observed the shared in-flight state before any reset ran. Under real wire latency the staggered arrivals miss the window, and the guard spawns N operations instead of one.

Fix: Don't treat same-tick microtask concurrency as a proxy for production burst behavior. For dedup/coalescing logic, inject controllable latency (fake timers, deferred resolution staggered across ticks) so a later arrival lands after the reset. Assert the guard holds for arrival-staggered bursts, not just same-tick ones.

Asserting only presence, never absence

Symptom: Tests for a payload builder, serializer, or DTO assert positively on the fields that should be there (assert payload["id"] == x) and never assert that unexpected fields are absent. When a shared builder is reused across verbs/contexts (CREATE vs UPDATE, the same serializer on multiple endpoints), a new field leaking into the wrong payload passes every existing test.

Fix: For any payload/serializer whose field set is a contract, pin absence as well as presence: assert "proof_document_id" not in payload. The negative assertion is the contract guard — the positive one passes whether or not extra fields leak in.

When Stuck

| Stuck on... | Do this | |-------------|---------| | Don't know how to test | Write the assertion first (desired outcome), then build the test around it | | Test too complicated | Simplify the interface being tested | | Must mock everything | Code is too coupled -- use dependency injection | | Test setup too large | Extract helpers that reduce noise without hiding test intent (see DAMP). Still complex? Simplify the design |

Rationalization Table

When you catch yourself thinking "this is too simple to need tests", "I'll add tests later", "the deadline is too tight", or similar — stop and load rationalization-table.md. Thirteen common excuses with their counter-truths. If you're arguing against writing a test, you're probably losing that argument.

Verify

Before considering tests complete:

[ ] Every new public function/endpoint has at least one test
[ ] Each test has a descriptive name stating expected behavior
[ ] Tests use real objects where possible (mocks only at system boundaries)
[ ] Edge cases covered (empty, null, boundary, error paths)
[ ] Tests assert on outcomes, not implementation details
[ ] Tests are independent -- no shared mutable state between tests. If tests pass individually but fail together, use bisection to find the polluter (run one-by-one in isolation until the offending test is found)
[ ] Tests run fast enough to run frequently (< 30 seconds for unit suite)
[ ] Bug fix tests reproduce the original bug

Integration

This skill is referenced by:

/ia-work -- when adding tests for new functionality (Phase 2)
ia-debugging -- when creating failing tests to reproduce bugs
ia-verification-before-completion -- tests as primary verification evidence

Tech-Specific Skills

This skill provides generic test discipline. For framework-specific patterns, conventions, and tooling:

Laravel/PHP → ia-php-laravel (PHPUnit, factories, feature/unit split, facade faking, data providers)
React/TypeScript → ia-react-frontend (Vitest, RTL, component/hook patterns, Playwright E2E, mocking patterns)

Both skills are complementary -- this skill covers principles (why and what to test), tech-specific skills cover implementation (how to test in that framework). When both are active, framework-specific guidance takes precedence for tooling and conventions.

Writing Tests

Core Principle

Tests prove behavior works. A test that can't fail is worthless. A test that tests mocks instead of real code is theater.

Writing Good Tests

One behavior per test

Each test should verify exactly one thing. If the test name needs "and" in it, split it into two tests.

Good:  "creates user with valid email"
Good:  "rejects user with duplicate email"
Bad:   "creates user and sends welcome email and updates counter"

Derive test cases from three sources

Build test coverage from three independent sources and verify every item maps to at least one test:

User requirements -- what was requested (spec, issue, conversation)
Features implemented -- what the code actually does (scan the diff)
Claims in the response -- what you're about to tell the user works

DAMP over DRY in tests

Test pyramid

Unit tests (~80%): fast, isolated, test one behavior per test. Run in milliseconds. No database, no network, no filesystem. These form the foundation -- cheap to write, cheap to run, fast feedback.
Integration tests (~15%): verify component boundaries -- API endpoints hitting a real test database, service layers wired to real dependencies, queue producers and consumers working together. Slower than unit tests but catch wiring bugs that mocks hide.
E2E tests (~5%): validate critical user paths end-to-end through the real system. Expensive to write, slow to run, brittle to maintain. Limit to high-value flows (signup, checkout, core workflow). Every E2E test must justify its maintenance cost.

Name tests by expected behavior

The test name should describe what happens, not what's being called.

Good:  "returns 404 when user does not exist"
Bad:   "test getUserById"
Good:  "sends notification after order is placed"
Bad:   "test processOrder"

Use real objects when practical

Mocks should be a last resort, not a first choice. Every mock is an assumption about behavior that may drift from reality.

Tests expose bugs, not the reverse

If a test uncovers broken or buggy behavior, fix the source code -- never adjust the test to match incorrect behavior. A test that passes against a bug is worse than no test at all.

Assert on outcomes, not implementation

Good:  assert user exists in database after create
Bad:   assert repository.save() was called once
Good:  assert response body contains expected fields
Bad:   assert serializer.serialize() was called with user

Test edge cases

For every feature, consider:

Empty input / null / undefined
Boundary values (0, 1, max, max+1)
Invalid types (string where number expected)
Concurrent access (if applicable)
Error paths (network failure, timeout, permission denied)
Unicode and special characters in string inputs

Silent failure coverage

Empty catch blocks (try { ... } catch {}) — add a test that triggers the error and asserts the logger (or equivalent signal) was called with the original exception.
Swallowed rejections (.catch(() => []), .catch(() => null)) — add a test that triggers the rejection and asserts the caller sees a distinguishable signal (specific return value, logged error, re-thrown).
Converted errors (catch (e) { return defaultValue; }) — assert both the return value AND that the error was recorded somewhere the operator can find it.
Missing async handling — assert that a rejected promise inside the function causes a surfaced failure (not just an unhandled-rejection warning).
No rollback around transactional work — assert that a failure mid-transaction leaves no partial state (row counts before and after match, queue stays unchanged, etc.).

Red-Green-Refactor (When It Applies)

For bug fixes, writing the failing test first is genuinely valuable -- it proves the bug exists and proves the fix works. For new features, the order is less critical than the quality.

Bug fixes: prove-it pattern

The failing test is proof the bug exists. The passing test is proof the fix works. Without both halves, there is no proof -- just coincidence.

Write a test that reproduces the bug
Run it and watch it fail -- confirm it fails for the right reason. A test that fails due to a typo or import error hasn't captured the bug. The failure message should describe the buggy behavior.
Apply the fix
Run it and watch it pass -- confirm the fix addresses the specific failure AND other tests still pass. A fix that breaks something else isn't a fix.
If the test passes immediately without a fix, the test is verifying existing behavior, not the bug. Go back to step 1.

New features: test alongside

Write tests as you build, not after. "I'll add tests later" means "I won't add tests."

The goal: by the time the feature is done, tests exist and pass. Whether you wrote the test 5 minutes before or 5 minutes after the code matters less than whether the test exists and is good.

Anti-Patterns

Testing mock behavior instead of real behavior

Symptom: Test passes but production breaks. Tests assert that mocks were called correctly, not that the actual system works.

Fix: Replace mocks with real objects for internal code. Only mock at system boundaries (external APIs, email, payment).

Test-only methods in production code

Symptom: Methods like reset(), clearState(), setTestMode() that exist only because tests need them.

Fix: If tests need to reset state, the code has a design problem. Refactor to make state explicit and injectable.

Snapshot tests as the only test

Symptom: All tests are snapshots that get bulk-updated whenever anything changes.

Fix: Snapshots catch unintended changes but don't verify correctness. Add behavioral assertions alongside snapshots.

Testing the framework

Symptom: Tests verify that the ORM saves records, the router routes requests, or the framework does what its docs say.

Fix: Trust the framework. Test YOUR logic -- the business rules, transformations, and decisions your code makes.

Incomplete mocks

Symptom: Mock only includes the fields the test author knows about. Downstream code consumes other fields and gets undefined.

Mocking without understanding

AI-generated test smells

Tests written by LLMs (including self-written) tend to produce a specific class of failures. Scan for these before committing:

Mock of the system under test — mocking the very function being tested, so the test asserts what the mock returned. Always a mistake. Delete the mock; call the real function.
Circular assertion — computing the expected value the same way the code computes the actual value (expect(sum(a,b)).toBe(a+b)). The test passes even when both are wrong. Replace with a hand-computed expected value or a known fixture.
Snapshot of unreviewed output — first-run snapshot committed without reading it. The snapshot enshrines whatever the code happened to emit, bugs included. Hand-write the first snapshot or diff it line by line before accepting.
Assertion-free exercise — test calls the function, checks nothing, passes because nothing threw. Every test needs at least one expect(...) / assert ... tied to the behavior under test.
Over-broad matchers — expect(result).toBeTruthy() on a function that returns an object. Passes for {}, true, "anything", all equally. Pin to the specific shape.
Implementation-echo assertions — expect(repo.save).toHaveBeenCalledTimes(1) when the real contract is "the user exists in the database afterward." Assert on outcomes, not call counts.

Persistent test infrastructure state contamination

Fix: reset infrastructure state between runs. In order of preference:

Ephemeral containers per test session (testcontainers, pytest-postgresql, or docker compose run --rm <service> for one-shot runs) — slowest to start, strongest isolation. Default for CI.
Fixture-driven TRUNCATE / DROP DATABASE in a session-scoped or per-test fixture — fast, requires careful coverage of every stateful table.
Volume teardown between iterations (docker compose down -v before each run) when running locally — manual but reliable.

Never rely on tests "cleaning up after themselves." If a previous run errored mid-test, the cleanup didn't run, and the next run inherits the partial state.

Vacuous forall over an empty collection

Constructing the object-under-test below the layer that transforms it

Synchronous adapters hide timing-dependent races

Asserting only presence, never absence

When Stuck

Rationalization Table

Verify

Before considering tests complete:

[ ] Every new public function/endpoint has at least one test
[ ] Each test has a descriptive name stating expected behavior
[ ] Tests use real objects where possible (mocks only at system boundaries)
[ ] Edge cases covered (empty, null, boundary, error paths)
[ ] Tests assert on outcomes, not implementation details
[ ] Tests are independent -- no shared mutable state between tests. If tests pass individually but fail together, use bisection to find the polluter (run one-by-one in isolation until the offending test is found)
[ ] Tests run fast enough to run frequently (< 30 seconds for unit suite)
[ ] Bug fix tests reproduce the original bug

Integration

This skill is referenced by:

/ia-work -- when adding tests for new functionality (Phase 2)
ia-debugging -- when creating failing tests to reproduce bugs
ia-verification-before-completion -- tests as primary verification evidence

Tech-Specific Skills

This skill provides generic test discipline. For framework-specific patterns, conventions, and tooling:

Laravel/PHP → ia-php-laravel (PHPUnit, factories, feature/unit split, facade faking, data providers)
React/TypeScript → ia-react-frontend (Vitest, RTL, component/hook patterns, Playwright E2E, mocking patterns)

Adoption

iliaal/writing-tests

$ install --global

Security Scan Results

SKILL.md

Writing Tests

Core Principle

Writing Good Tests

One behavior per test

Derive test cases from three sources

DAMP over DRY in tests

Test pyramid

Name tests by expected behavior

Use real objects when practical

Tests expose bugs, not the reverse

Assert on outcomes, not implementation

Test edge cases

Silent failure coverage

Red-Green-Refactor (When It Applies)

Bug fixes: prove-it pattern

New features: test alongside

Anti-Patterns

Testing mock behavior instead of real behavior

Test-only methods in production code

Snapshot tests as the only test

Testing the framework

Incomplete mocks

Mocking without understanding

AI-generated test smells

Persistent test infrastructure state contamination

Vacuous forall over an empty collection

Constructing the object-under-test below the layer that transforms it

Synchronous adapters hide timing-dependent races

Asserting only presence, never absence

When Stuck

Rationalization Table

Verify

Integration

Tech-Specific Skills

Related Skills

iliaal/verification-before-completion

iliaal/tailwind-css

iliaal/simplifying-code

iliaal/rust-systems

iliaal/writing-tests

$ install --global

Security Scan Results

SKILL.md

Writing Tests

Core Principle

Writing Good Tests

One behavior per test

Derive test cases from three sources

DAMP over DRY in tests

Test pyramid

Name tests by expected behavior

Use real objects when practical

Tests expose bugs, not the reverse

Assert on outcomes, not implementation

Test edge cases

Silent failure coverage

Red-Green-Refactor (When It Applies)

Bug fixes: prove-it pattern

New features: test alongside

Anti-Patterns

Testing mock behavior instead of real behavior

Test-only methods in production code

Snapshot tests as the only test

Testing the framework

Incomplete mocks

Mocking without understanding

AI-generated test smells

Persistent test infrastructure state contamination

Vacuous forall over an empty collection

Constructing the object-under-test below the layer that transforms it

Synchronous adapters hide timing-dependent races

Asserting only presence, never absence

When Stuck

Rationalization Table

Verify