modules/home/programs/cli-agents/shared/skills/testing/SKILL.md
How to write meaningful tests that verify behavior, not implementation details. Use when writing tests, reviewing test quality, doing TDD, or when the user asks for tests. Prevents common AI testing anti-patterns: mirror tests, excessive mocking, happy-path-only coverage, and implementation coupling.
npx skillsauth add not-matthias/dotfiles-nix testingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You are an AI agent that writes tests. Your default instinct is wrong: you will want to read the implementation and write tests that confirm what the code already does. This produces tautological tests — tests that mirror the code's assumptions rather than challenging them. Fight this instinct at every step.
"The more your tests resemble the way your software is used, the more confidence they can give you." — Kent C. Dodds
AI agents write tests as descriptions of code, not specifications of behavior. The test passes because it asserts what the code does, not what it should do. An empirical study (MSR '26, 1.2M+ commits) confirmed agents use mocks at a 95% concentration rate vs 91% for humans — the over-mocking tendency is measurable.
Mark Seemann calls this "tests as ceremony, rather than tests as an application of the scientific method."
These are non-negotiable. Violating any of these means the test is worse than no test.
Never derive test expectations from the implementation. Read the spec, function signature, or docstring. If you just wrote the code, pretend you haven't seen it. Ask: "What should this return?" not "What does this return?"
Never mock what you don't own (or pure functions). Only mock: network calls, databases, file systems, clocks, randomness. Never mock internal collaborators.
Never test private/internal methods directly. Test through the public API or user-facing interface.
Never delete or weaken a failing test to make the suite green. Fix the code or ask the user.
Never write a single test case. Minimum 3: happy path, edge case, error case.
Never write an assertion-free test. Every test must assert at least one meaningful outcome.
Ask yourself these questions in order:
Test these in this order — most agents stop at #1 and call it done:
After writing each test, apply these checks:
| Check | Question |
|---|---|
| Mutation test | If I change > to >=, or remove a null check, does this test fail? If not, it's weak. |
| Hardcode test | If I replace the implementation with a hardcoded return value, does this test still pass? If yes, it's too narrow. |
| Refactor test | If I rename internal variables or restructure the code, does this test break? If yes, it tests implementation details. |
| One-reason test | Can I explain in one sentence what bug this test catches? If not, split it. |
| Name test | Does the test name describe the expected behavior? Good: rejects_order_when_inventory_is_zero. Bad: test_process_order. |
| Anti-Pattern | What It Looks Like | Fix |
|---|---|---|
| Mirror test | Read impl of parse("hello"), assert whatever it returns | Decide what parse("hello") should return from the spec, then assert |
| Mock fest | Mock the DB, logger, config, clock, and half the module | Use real test DB or in-memory equivalent; only mock true external boundaries |
| Implementation coupling | expect(spy).toHaveBeenCalledWith('_internalMethod') | Assert on observable output or side effects instead |
| Happy path only | Only test add(2, 3) == 5 | Add: add(0, 0), add(-1, 1), add(MAX, 1), invalid input |
| Kitchen sink | One test with 15 assertions checking everything | One behavior per test, one clear reason to fail |
| Snapshot addiction | toMatchSnapshot() on everything | Assert specific values; snapshots hide what matters |
| Test the framework | Assert that useState updates state | Test your code's behavior, not library internals |
Test-first is recommended because it prevents tautological tests by design — you can't mirror an implementation that doesn't exist yet.
When fixing a bug: always write a failing test that reproduces the bug before writing the fix. This is non-negotiable even if you're not doing TDD otherwise.
When writing tests after implementation: consciously ignore the implementation. Read only the function signature, types, and documentation. Derive expected values from the specification, not from running the code in your head.
For complex pure functions (parsers, validators, serializers, math), consider property-based tests instead of (or alongside) example-based tests. They catch edge cases you wouldn't think to write by generating hundreds of random inputs.
When to use:
parse(serialize(x)) == xWhen NOT to use:
| Situation | Test Type | |---|---| | Pure function with complex logic | Unit tests (many inputs) + property-based if fitting | | API endpoint / route handler | Integration test with real-ish DB | | UI component | Integration test (render + interact + assert DOM) | | Critical user workflow | E2E test (signup, checkout, etc.) | | Bug fix | Regression test at lowest level that reproduces it | | Glue code / simple delegation | Usually don't test — integration tests cover it |
Cover use cases, not lines.
No numeric coverage target. Instead, for every public function/endpoint/component, require:
If a function is too trivial to warrant three tests (getter, simple delegation), it probably doesn't need a dedicated test — integration tests will cover it.
rstest for parameterized tests — collapse boundary cases into one testproptest/quickcheck for property-based testing on complex pure functionsResult::Err variants explicitly — don't just test the Ok path@pytest.mark.parametrize to collapse boundary caseshypothesis for property-based testing on complex pure functionspytest.raises(ExceptionType, match="...") for error pathsuserEvent over fireEventmsw for API mocking instead of mocking fetch directlytoMatchSnapshot is almost never the right choiceThe implementation is hidden — you only see the contract. This is how you should approach testing.
pub struct Cart { /* hidden */ }
impl Cart {
pub fn new() -> Self;
/// Quantity must be 1..=99
pub fn add(&mut self, item_id: &str, price: f64, qty: u32) -> Result<()>;
/// Errors if item not in cart
pub fn remove(&mut self, item_id: &str) -> Result<()>;
/// Always >= 0.0, includes discounts
pub fn total(&self) -> f64;
/// Single-use. Errors if invalid or already used.
pub fn apply_discount(&mut self, code: &str) -> Result<()>;
}
#[test]
fn test_add() {
let mut cart = Cart::new();
cart.add("hat", 25.0, 1).unwrap();
assert_eq!(cart.items.len(), 1); // internal field!
assert_eq!(cart.items[0].qty, 1); // internal field!
}
#[test]
fn test_total() {
let mut cart = Cart::new();
cart.add("hat", 25.0, 2).unwrap();
assert_eq!(cart.total(), 50.0); // just mirrors 25*2
}
test_add reaches into cart.items — breaks if storage changes from Vec to HashMap. test_total only checks the obvious multiplication, derived from reading the impl. Neither test catches real bugs.
fn cart_with(item_id: &str, price: f64, qty: u32) -> Cart {
let mut c = Cart::new();
c.add(item_id, price, qty).unwrap();
c
}
#[rstest]
#[case(0, true)] // below range
#[case(1, false)] // lower bound
#[case(99, false)] // upper bound
#[case(100, true)] // above range
fn add_rejects_invalid_quantity(
#[case] qty: u32,
#[case] should_err: bool,
) {
let result = Cart::new().add("hat", 10.0, qty);
assert_eq!(result.is_err(), should_err);
}
#[test]
fn remove_nonexistent_item_errors() {
assert!(Cart::new().remove("nope").is_err());
}
#[test]
fn discount_code_is_single_use() {
let mut cart = cart_with("hat", 100.0, 1);
cart.apply_discount("SAVE10").unwrap();
assert!(cart.apply_discount("SAVE10").is_err());
}
#[test]
fn total_never_negative_after_large_discount() {
let mut cart = cart_with("hat", 5.0, 1);
let _ = cart.apply_discount("HALF_OFF");
assert!(cart.total() >= 0.0);
}
4 tests. Each targets a non-obvious spec constraint: quantity boundaries (parameterized), missing-item error, single-use invariant, non-negative invariant. The happy path is covered implicitly by cart_with in the other tests — no dedicated test needed.
Parameterized tests shine when testing a single function against many inputs. Use two test functions (positive and negative) instead of a boolean parameter — it's more readable and the test names are self-documenting.
#[test]
fn basic_java_command() {
assert!(command_has_executable("java -jar bench.jar", &["java"]));
}
#[test]
fn java_with_absolute_path() {
assert!(command_has_executable("/usr/bin/java -jar bench.jar", &["java"]));
}
#[test]
fn java_with_env_prefix() {
assert!(command_has_executable("FOO=bar java -jar bench.jar", &["java"]));
}
#[test]
fn gradle_chained_with_and() {
assert!(command_has_executable("cd /app && gradle bench", &["gradle"]));
}
#[test]
fn javascript_must_not_match_java() {
assert!(!command_has_executable("javascript-runtime run", &["java"]));
}
#[test]
fn javascript_path_must_not_match_java() {
assert!(!command_has_executable("/home/user/javascript/run.sh", &["java"]));
}
#[test]
fn scargoship_must_not_match_cargo() {
assert!(!command_has_executable("scargoship build", &["cargo"]));
}
7 tests, lots of boilerplate. Adding a new case means copy-pasting an entire function. The positive/negative intent is buried in assert! vs assert!.
use rstest::rstest;
#[rstest]
#[case("java -jar bench.jar", &["java"])]
#[case("/usr/bin/java -jar bench.jar", &["java"])]
#[case("FOO=bar java -jar bench.jar", &["java"])]
#[case("cd /app && gradle bench", &["gradle"])]
#[case("cat file | python script.py", &["python"])]
#[case("sudo java -jar bench.jar", &["java"])]
#[case("(cd /app && java -jar bench.jar)", &["java"])]
#[case("setup.sh; java -jar bench.jar", &["java"])]
#[case("try_first || java -jar bench.jar", &["java"])]
fn matches(#[case] command: &str, #[case] names: &[&str]) {
assert!(command_has_executable(command, names));
}
#[rstest]
#[case("javascript-runtime run", &["java"])]
#[case("/home/user/javascript/run.sh", &["java"])]
#[case("scargoship build", &["cargo"])]
#[case("node index.js", &["gradle", "java", "maven", "mvn"])]
fn does_not_match(#[case] command: &str, #[case] names: &[&str]) {
assert!(!command_has_executable(command, names));
}
Same coverage, adding a case is one line. The function names (matches / does_not_match) make the intent obvious without reading the assertion.
tools
Spawn the pi coding agent with a specific model/provider. Use when asked to run pi with a particular model, switch pi's model, use DeepSeek V4 Flash/Pro in pi, or look up pi's --model/--provider/--models CLI flags and thinking-level shorthand.
development
Navigate to directories using zoxide (frecency-based directory jumper). Use when the user says "go to", "navigate to", "cd to", "jump to" a project or directory by nickname/partial name (e.g. "go to my dotfiles", "jump to dot").
tools
Use when manipulating Zellij sessions, creating tabs or panes, sending commands to panes, capturing output, or looking up Zellij CLI commands for terminal multiplexer operations
development
Emulates not-matthias's technical blog writing style. Use when writing blog posts, technical articles, README content, or any long-form technical prose. Produces investigation-driven, first-person narratives with dry humor, practical code examples, and concrete takeaways.