skills/testing/SKILL.md
Testing strategy, patterns, and evaluation for software and LLM/AI systems. Use when: writing tests, choosing test boundaries, designing test data, structuring test suites, evaluating LLM outputs, building evaluation pipelines, setting coverage thresholds, auditing test coverage gaps in existing projects, or improving test quality and structure.
npx skillsauth add michaelsvanbeek/personal-agent-skills testingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
| Level | Scope | Speed | Count | |-------|-------|-------|-------| | Unit | Single function/class, no I/O | <10ms each | Many (70%+) | | Integration | Module boundaries, real DB/API calls | <1s each | Moderate (20%) | | E2E | Full user workflow through the system | Seconds | Few (10%) |
pytest fixtures and unittest.mock.patch in Python. Use vi.mock / vi.fn in Vitest.Tests should read as specifications. Name them by behavior, not implementation:
# Good
class TestOrderProcessing:
def test_applies_discount_when_coupon_is_valid(self): ...
def test_rejects_order_when_inventory_is_insufficient(self): ...
def test_returns_empty_list_when_no_orders_exist(self): ...
# Bad
class TestOrderService:
def test_process(self): ...
def test_error(self): ...
// Good
describe("OrderService", () => {
it("applies discount when coupon is valid", () => { ... });
it("rejects order when inventory is insufficient", () => { ... });
});
def make_user(**overrides) -> User:
defaults = {
"id": "usr_test",
"name": "Test User",
"email": "[email protected]",
"role": "viewer",
}
return User(**(defaults | overrides))
# Usage
admin = make_user(role="admin")
function makeUser(overrides: Partial<User> = {}): User {
return {
id: "usr_test",
name: "Test User",
email: "[email protected]",
role: "viewer",
...overrides,
};
}
pytest's tmp_path for file system tests. Never write to real paths.freezegun or time_machine in Python for time-dependent tests.Tests play a critical gate role at release time (see release-management skill for full workflow):
v2.0.0-beta.1) should run the same test gates as stable releases.Performance testing is a specialized discipline with its own tools and strategies:
Integrate performance tests into CI using the tiered approach from the performance-testing skill: smoke tests on every PR, full load tests on release tags.
Testing LLM-based systems requires a different approach because outputs are non-deterministic and correctness is often fuzzy.
| Level | What it tests | How | |-------|--------------|-----| | Unit | Individual tools, parsers, formatters | Standard unit tests — these are deterministic | | Component | Single LLM call with prompt + expected behavior | Assertion-based eval with LLM-as-judge fallback | | Agent | Multi-step workflow end-to-end | Scenario-based eval with success criteria | | Regression | Known failure cases don't regress | Golden dataset of input/expected-output pairs |
Score LLM outputs across multiple dimensions, not just "correct/incorrect":
| Dimension | What it measures | How to evaluate | |-----------|-----------------|-----------------| | Correctness | Factual accuracy of the output | Compare against golden answer; LLM-as-judge | | Relevance | Does the output address the input? | LLM-as-judge or embedding similarity | | Completeness | Are all required elements present? | Checklist of required fields/topics | | Conciseness | No unnecessary content | Token count vs. golden answer length | | Format compliance | Matches expected structure | Schema validation (JSON schema, regex) | | Safety | No harmful, biased, or leaked content | Content classifier + keyword blocklist | | Latency | Response time | Wall-clock measurement | | Token efficiency | Cost per evaluation | Input + output token count |
Build and maintain a curated evaluation dataset:
eval_cases = [
{
"id": "order-summary-001",
"input": "Summarize order #12345",
"context": {"order": {...}}, # relevant context provided to the LLM
"expected": "Order #12345: 3 items, total $47.50, shipped March 18",
"criteria": {
"must_contain": ["#12345", "$47.50"],
"must_not_contain": ["credit card", "password"],
"max_tokens": 100,
},
},
]
For dimensions that are hard to evaluate programmatically, use a capable model as a judge:
judge_prompt = """
Rate the following response on a scale of 1-5 for correctness and relevance.
Input: {input}
Expected: {expected}
Actual: {actual}
Return JSON: {"correctness": <1-5>, "relevance": <1-5>, "reasoning": "<brief explanation>"}
"""
For functions with broad input domains, use property-based testing to generate many random inputs and verify invariants:
from hypothesis import given, strategies as st
@given(st.lists(st.integers()))
def test_sort_preserves_length(items):
assert len(sorted(items)) == len(items)
@given(st.lists(st.integers(), min_size=1))
def test_sort_first_element_is_minimum(items):
assert sorted(items)[0] == min(items)
Use when:
For configuring pytest and Vitest test runners in VS Code / Cursor, including Test Explorer integration, debug configurations, and local task automation, see the ide-setup skill.
For iOS-specific testing patterns including Swift Testing framework (@Suite, @Test, #expect), ViewInspector for SwiftUI view testing, URLProtocol network mocking, SwiftData test containers, and protocol-based mock infrastructure, see the ios-testing skill.
development
TypeScript coding standards and type safety conventions. Use when: creating TypeScript files, defining interfaces and types, writing type-safe code, reviewing TypeScript for type correctness, auditing a codebase for type safety gaps, eliminating any or ts-ignore usage, or improving strict-mode compliance. Covers strict typing, avoiding any and ts-ignore, discriminated unions, Zod runtime validation, immutability patterns, and proper type definitions.
testing
Writing clear, actionable tickets in any issue tracker (Jira, Linear, GitHub Issues, ServiceNow, etc.). Use when: creating epics, stories, tasks, bugs, or spikes; writing acceptance criteria; decomposing work for a sprint; linking dependencies between tickets; auditing backlog items for clarity; or coaching a team on ticket quality. Covers title conventions, description templates, acceptance criteria, decomposition rules, dependency linking, and org-specific pluggable configuration.
development
Writing effective status updates for different audiences and cadences. Use when: writing a weekly status update, preparing a monthly summary, drafting a quarterly review, sending updates to leadership, sharing progress with stakeholders, or improving the clarity and impact of team communications. Covers weekly, monthly, and quarterly formats tailored for upward, lateral, and downward communication.
development
--- name: statistics description: >- Statistical analysis and hypothesis testing for data-driven decisions. Use when: choosing the right statistical test for a question, calculating sample sizes, running A/B test analysis, comparing distributions, measuring correlation, building confidence intervals, validating assumptions before applying a test, interpreting p-values and effect sizes, or selecting the right summary statistics for a dataset. Covers descriptive statistics, hypothesi