plugins/autonomous-dev/skills/testing-guide/SKILL.md
GenAI-first testing with structural assertions, congruence validation, and tier-based test structure. Use when writing tests, setting up test infrastructure, or validating coverage. TRIGGER when: test, pytest, coverage, TDD, test patterns, congruence, validation. DO NOT TRIGGER when: production code implementation, documentation, config-only changes.
npx skillsauth add akaszubski/autonomous-dev testing-guideInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
What to test, how to test it, and what NOT to test — for a plugin made of prompt files, Python glue, and configuration.
Traditional unit tests work for deterministic logic. But most bugs in this project are drift — docs diverge from code, agents contradict commands, component counts go stale. GenAI congruence tests catch these. Unit tests don't.
Decision rule: Can you write assert x == y and it won't break next week? → Unit test. Otherwise → GenAI test or structural test.
An LLM evaluates one artifact against criteria. Use for: doc completeness, security posture, architectural intent.
pytestmark = [pytest.mark.genai]
def test_agents_documented_in_claude_md(self, genai):
agents_on_disk = list_agents()
claude_md = Path("CLAUDE.md").read_text()
result = genai.judge(
question="Does CLAUDE.md document all active agents?",
context=f"Agents on disk: {agents_on_disk}\nCLAUDE.md:\n{claude_md[:3000]}",
criteria="All active agents should be referenced. Score by coverage %."
)
assert result["score"] >= 5, f"Gap: {result['reasoning']}"
The most valuable pattern. An LLM checks two files that should agree. Use for: command↔agent alignment, FORBIDDEN lists, config↔reality.
def test_implement_and_implementer_share_forbidden_list(self, genai):
implement = Path("commands/implement.md").read_text()
implementer = Path("agents/implementer.md").read_text()
result = genai.judge(
question="Do these files have matching FORBIDDEN behavior lists?",
context=f"implement.md:\n{implement[:5000]}\nimplementer.md:\n{implementer[:5000]}",
criteria="Both should define same enforcement gates. Score 10=identical, 0=contradictory."
)
assert result["score"] >= 5
More reliable than holistic scoring. Each criterion is evaluated independently with a binary MET/UNMET judgment. Use for: security posture, enforcement quality, multi-faceted assessments.
def test_security_posture_analytic(self, genai):
result = genai.judge_analytic(
question="Evaluate the security posture of this codebase",
context=f"Hook samples:\n{hook_content[:5000]}",
criteria=[
{"name": "No hardcoded secrets", "description": "No real API keys or tokens in source", "max_points": 1},
{"name": "Named exit codes", "description": "Hooks use named constants, not bare numbers", "max_points": 1},
{"name": "Path validation", "description": "File operations validate paths", "max_points": 1},
],
)
assert result["total_score"] >= 2, f"{result['total_score']}/{result['max_score']}: {result['reasoning']}"
Return value: {"criteria_results": [...], "total_score": N, "max_score": N, "pass": bool, "band": str, "reasoning": str}
When to use: Multi-faceted evaluations where you need to know which specific criteria passed or failed. Each criterion gets its own LLM call for independent judgment.
For high-stakes judgments where a single LLM evaluation might be unreliable. Runs multiple rounds and checks for agreement. Uses median score as the final result.
def test_pipeline_completeness_consistent(self, genai):
result = genai.judge_consistent(
question="Does implement.md define a complete SDLC pipeline?",
context=f"implement.md:\n{content[:6000]}",
criteria="Pipeline should have research, plan, test, implement, review, security, docs steps.",
rounds=3,
)
assert result["final_score"] >= 7, f"median={result['final_score']}, agreement={result['agreement']}"
Return value: {"rounds": [...], "agreement": bool, "scores": [...], "final_score": median, "pass": bool, "band": str, "reasoning": str}
When to use: Critical assessments where false positives/negatives are costly. Agreement=False signals the evaluation needs human review.
All ask() calls default to temperature=0 for deterministic, reproducible judging. Override only when you need creative/diverse outputs:
# Default: temperature=0 (deterministic judging)
response = genai.ask("Evaluate this code", temperature=0)
# Override for creative tasks like edge case generation
response = genai.ask("Generate unusual test inputs", temperature=0.7)
No LLM needed. When two configs/files must stay in sync, read both and compare directly. Catches the #1 recurring bug class: adding something to one place but not the other.
def test_policy_and_hook_in_sync(self):
"""Policy always_allowed and hook NATIVE_TOOLS must be identical."""
policy_tools = set(json.load(open(POLICY_FILE))["tools"]["always_allowed"])
hook_tools = hook.NATIVE_TOOLS
# Check BOTH directions
assert policy_tools - hook_tools == set(), f"In policy not hook: {policy_tools - hook_tools}"
assert hook_tools - policy_tools == set(), f"In hook not policy: {hook_tools - policy_tools}"
When to use: Any time two files define overlapping data — permissions↔hook, manifest↔disk, config↔worktree copy, command frontmatter↔policy. Key principle: Read both sources dynamically. Never hardcode expected values in the test itself.
No LLM needed. Discover components dynamically and assert structural properties. Use for: component existence, manifest sync, skill loading.
def test_all_active_skills_have_content(self):
skills_dir = Path("plugins/autonomous-dev/skills")
for skill in skills_dir.iterdir():
if skill.name == "archived" or not skill.is_dir():
continue
skill_md = skill / "SKILL.md"
assert skill_md.exists(), f"Skill {skill.name} missing SKILL.md"
assert len(skill_md.read_text()) > 100, f"Skill {skill.name} is a hollow shell"
Define properties that must always hold, instead of testing specific examples. Catches 23-37% more bugs than example-based tests. Use for: pure functions, serialization, data transformations, parsers.
from hypothesis import given, strategies as st
@given(st.lists(st.integers()))
def test_sort_preserves_elements(arr):
"""Invariant: sorting never loses or adds elements."""
result = sorted(arr)
assert set(result) == set(arr)
assert len(result) == len(arr)
@given(st.dictionaries(st.text(min_size=1), st.text()))
def test_config_roundtrip(config):
"""Invariant: serialize → deserialize = identity."""
assert json.loads(json.dumps(config)) == config
When to use: Pure functions, roundtrips, idempotent operations, parsers. When NOT to use: Agent prompts (use GenAI judge), filesystem checks (use structural).
Good candidates (pure functions with testable invariants):
Bad candidates (avoid PBT for these):
Strategy rules:
.filter() instead of assume() for filtering invalid inputs@example() decorators with known edge cases alongside @given()Configure profiles via HYPOTHESIS_PROFILE environment variable:
# Default (local development): 50 examples per test
pytest tests/property/ -v
# CI mode: 200 examples per test, no deadline
HYPOTHESIS_PROFILE=ci pytest tests/property/ -v
Profiles are registered in tests/property/conftest.py:
default: max_examples=50 (fast local iteration)ci: max_examples=200, deadline=None (thorough CI runs)When code invokes external operations whose correctness depends on runtime context (CWD, environment variables, credentials, file permissions), assert on the kwargs, not just the cmd-list or return value. Static-shape tests (asserting cmd == ["claude", "-p", prompt]) silently pass even when cwd= or env= is wrong — and shipped Issue #1064 through a full 8-agent pipeline.
The principle: capture the full kwargs dict via monkeypatch, then assert on the specific runtime variable that affects correctness.
def test_subprocess_passes_cwd_to_avoid_context_bleed(monkeypatch) -> None:
"""Regression test: must pass cwd= so the spawned subprocess does not
inherit the parent's CWD."""
captured_kwargs: dict = {}
def fake_run(cmd, **kwargs):
captured_kwargs.update(kwargs)
return subprocess.CompletedProcess(
args=cmd, returncode=0, stdout="...", stderr=""
)
monkeypatch.setattr("mymodule.subprocess.run", fake_run)
my_function_under_test(...)
# Assert on the RUNTIME KWARG, not just the cmd-list
assert "cwd" in captured_kwargs, "subprocess.run must receive cwd="
assert captured_kwargs["cwd"] == str(Path.home()), (
f"cwd must be Path.home() (got {captured_kwargs['cwd']!r})"
)
When to use this pattern:
subprocess.run / subprocess.Popen whose correctness depends on cwd, env, timeout, input, or stdin.Path.cwd() or relative-path resolution.Anti-pattern to avoid:
# BAD — passes even when cwd is wrong, which is the bug Issue #1064 was
def test_subprocess_invokes_claude(monkeypatch):
def fake_run(cmd, **kwargs):
return subprocess.CompletedProcess(args=cmd, returncode=0, stdout="ok", stderr="")
monkeypatch.setattr("mymodule.subprocess.run", fake_run)
result = my_function_under_test(...)
# ↓ This static-shape assertion misses the entire CWD bug class.
assert result.returncode == 0
Reference example: tests/unit/scripts/test_extract_and_label_intent_corpus.py::test_call_claude_p_judge_passes_cwd_to_avoid_project_context (regression test for Issue #1064, the bug that motivated this pattern).
Pipeline integration: The plan-critic agent's Operational Integration Test axis enforces this pattern at plan time. Plans introducing subprocess/network/fs calls must specify either a kwarg-assertion test (this pattern) or an integration smoke test — score 1 if neither is specified.
# BAD — breaks every time a component is added/removed
assert len(agents) == 14
assert hook_count == 17
# GOOD — minimum thresholds + structural checks
assert len(agents) >= 8, "Pipeline needs at least 8 agents"
assert "implementer.md" in agent_names, "Core agent missing"
# BAD — test has its OWN copy of expected data, drifts from both real sources
VALID_TOOLS = {"Read", "Write", "Edit"} # stale copy in test
EXPECTED_COMMANDS = {"implement.md": {"Read", "Write"}} # another stale copy
assert actual_tools == VALID_TOOLS # passes even when BOTH sources are wrong
# GOOD — cross-validate real sources directly against each other
policy_tools = set(json.load(open(POLICY_FILE))["tools"]["always_allowed"])
hook_tools = hook.NATIVE_TOOLS
assert policy_tools == hook_tools, f"Drift: policy-only={policy_tools - hook_tools}"
# BEST — add GenAI test to catch gaps in BOTH sources
result = genai.judge(
question="Are any known tools missing from this list?",
context=json.dumps(sorted(hook_tools)),
criteria="Check against known Claude Code native tools..."
)
Rule: When two configs must stay in sync, read both dynamically and compare. Never create a third copy in the test — that's three things that can drift instead of two.
# BAD — breaks on every config update
assert settings["version"] == "3.51.0"
# GOOD — test structure, not values
assert "version" in settings
assert re.match(r"\d+\.\d+\.\d+", settings["version"])
# BAD — breaks on renames/moves
assert Path("plugins/autonomous-dev/lib/old_name.py").exists()
# GOOD — use glob discovery
assert any(Path("plugins/autonomous-dev/lib").glob("*skill*"))
Rule: If the test itself is the thing that needs updating most often, delete it.
Rule: When an audit produces an "affected files" list and some files are intentionally excluded after manual review (false positives), write a parametrized negative-assertion test that locks the exclusion. Without it, the next audit cycle re-flags the same files and the team re-litigates the decision.
Trigger when: an issue body lists files audited but intentionally not modified.
The test asserts the audit's marker string is absent from each excluded file, so any future "fix" to one of these files trips the test instead of slipping through. Each parametrize entry MUST be accompanied by a comment (or docstring section) explaining WHY this file is excluded — the lock is only useful if a future reader can quickly verify whether the exclusion still holds.
# GOOD — negative-assertion scope lock with per-file rationale
@pytest.mark.parametrize(
"file_path",
[
"plugins/autonomous-dev/lib/error_analyzer.py", # Excluded: ErrorAnalyzer != GenAIAnalyzer
"plugins/autonomous-dev/lib/codebase_analyzer.py", # Excluded: CodebaseAnalyzer != GenAIAnalyzer
],
)
def test_audit_false_positive_files_unchanged(file_path: str) -> None:
"""Lock Issue #1007 audit: these files do NOT call GenAIAnalyzer.
Each file is excluded because its name superficially matches the audit
pattern but the code does not. Catches future re-flagging.
"""
content = (REPO_ROOT / file_path).read_text()
assert "from genai_utils import GenAIAnalyzer" not in content, (
f"{file_path} now imports GenAIAnalyzer — re-verify Issue #1007 audit."
)
# BAD — comment-only exclusion (drifts; no enforcement)
# Issue #1007 audit: skip error_analyzer.py, codebase_analyzer.py (false positives).
# (Next agent ignores the comment and "fixes" one of them.)
Reference: tests/unit/lib/test_phase3_wrap_adoption.py::test_phase3_false_positive_files_unchanged and the Negative-Assertion Scope Locks subsection in docs/TESTING-STRATEGY.md (Layer 2).
No manual @pytest.mark needed — directory location determines tier. Source of truth: plugins/autonomous-dev/lib/tier_registry.py.
| Tier | Lifecycle | Directory | Markers | Max Duration |
|------|-----------|-----------|---------|--------------|
| T0 | permanent | tests/genai/ | genai, acceptance | - |
| T0 | permanent | tests/regression/smoke/ | smoke | 5s |
| T1 | stable | tests/e2e/ | e2e, slow | 5min |
| T1 | stable | tests/integration/ | integration | 30s |
| T2 | semi-stable | tests/regression/regression/ | regression | 30s |
| T2 | semi-stable | tests/regression/extended/ | extended, slow | 5min |
| T2 | semi-stable | tests/property/ | property, slow | 5min |
| T3 | ephemeral | tests/regression/progression/ | progression, tdd_red | - |
| T3 | ephemeral | tests/unit/ | unit | 1s |
| T3 | ephemeral | tests/hooks/ | hooks, unit | 1s |
| T3 | ephemeral | tests/security/ | unit | 1s |
Lifecycle definitions:
Where to put a new test:
regression/smoke/regression/regression/unit/integration/e2e/genai/Run commands:
pytest -m smoke # CI gate (T0)
pytest -m "smoke or regression" # Feature protection (T0+T2)
pytest -m "not slow" # Fast tests only
pytest tests/genai/ --genai # GenAI validation (opt-in, T0)
# tests/genai/conftest.py provides two fixtures:
# - genai: Gemini Flash via OpenRouter (cheap, fast)
# - genai_smart: Haiku 4.5 via OpenRouter (complex reasoning)
# Requires: OPENROUTER_API_KEY env var + --genai pytest flag
# Cost: ~$0.02 per full run with 24h response caching
Scaffold for any repo: /scaffold-genai-uat generates the full tests/genai/ setup with portable client, universal tests, and project-specific congruence tests auto-discovered by GenAI.
| Test This | With This | Not This | |-----------|-----------|----------| | Pure Python functions | Unit tests | — | | Component interactions | Integration tests | — | | Doc ↔ code alignment | GenAI congruence | Hardcoded string matching | | Two configs in sync | Cross-validation | Hardcoded intermediary list | | Component existence | Structural (glob) | Hardcoded counts | | FORBIDDEN list sync | GenAI congruence | Manual comparison | | Security posture | GenAI judge | Regex scanning | | Config structure | Structural | Config values | | Agent output quality | GenAI judge | Output string matching |
Link tests to GitHub issues for traceability. The TestIssueTracer library (plugins/autonomous-dev/lib/test_issue_tracer.py) scans for these patterns automatically.
| Pattern | Example | Type |
|---------|---------|------|
| Class name | class TestIssue656: | class_name |
| Function name | def test_issue_589_regression(): | function_name |
| Docstring | """Regression for #656""" | docstring |
| Comment | # Issue: #656 | comment |
| GH shorthand | GH-42 | gh_shorthand |
| Pytest marker | @pytest.mark.issue(656) | marker |
class TestIssue656 or # Fixes #656)"""Implements #675""")# Quick check: does an issue have a test?
from test_issue_tracer import TestIssueTracer
tracer = TestIssueTracer(Path('.'))
tracer.check_issue_has_test(675) # True/False
# Full analysis report
report = tracer.analyze()
print(report.format_table())
Run via /audit --test-tracing for a full tracing report.
An independent agent writes behavioral tests from the spec/acceptance criteria ONLY, without seeing the implementation code or implementer output. This catches cases where the implementation satisfies its own tests but drifts from the original specification.
Isolation rules:
tests/spec_validation/ (separate from unit tests in tests/unit/)Test placement: tests/spec_validation/test_spec_{feature_name}.py
Complementarity with other test types:
The spec-validator adds value because it is structurally blind to implementation details. Even if the implementer writes comprehensive unit tests, those tests are influenced by HOW the code was written. The spec-validator tests WHAT the spec requires.
Verdict: Binary only. SPEC-VALIDATOR-VERDICT: PASS or SPEC-VALIDATOR-VERDICT: FAIL. No partial credit.
Mutation testing validates that your tests actually catch real bugs, not just exercise code paths. Coverage metrics give false confidence; mutation testing proves test quality.
mutmut introduces small code changes (mutants) — flipping < to <=, True to False, + to - — and checks if your tests detect them. If a test suite still passes after a mutation, that mutant "survived" and your tests have a gap.
# Run against three critical files (default)
bash scripts/run_mutation_tests.sh
# Run against a single file
bash scripts/run_mutation_tests.sh --file plugins/autonomous-dev/lib/pipeline_state.py
# Run in CI mode (summary output, non-blocking)
bash scripts/run_mutation_tests.sh --ci
# Run against all of lib/
bash scripts/run_mutation_tests.sh --all
pipeline_state.py, tool_validator.py, settings_generator.py)Not all surviving mutants indicate test gaps. Some mutations produce functionally equivalent code:
Low-value (skip these):
"error" to "XXerrorXX") — rarely affects behaviorHigh-value (kill these):
< to <=, == to !=) — missing boundary tests+ to -, * to /) — missing calculation testsTrue to False, and to or) — missing logic testsMutation testing is orthogonal to the test tier system. It measures test quality (do tests catch bugs?) rather than test coverage (does code run?). Use it as a quality-of-tests metric:
@pytest.mark.skip is forbidden for new code. Fix it or adjust expectations.test_regression_issue_NNN_description.--genai flag required, no surprise API costs.hypothesis invariants over hardcoded input/output pairs where applicable.development
One topic, one home. Routes content to its canonical store (CLAUDE.md, PROJECT.md, MEMORY.md, docs/, memory/) and audits for duplication. TRIGGER when: auditing CLAUDE.md/PROJECT.md/MEMORY.md sizes, deduplicating docs, applying the content-allocation pattern to a new repo, running /align --content. DO NOT TRIGGER when: implementing features, writing tests, routine code edits, debugging.
testing
Prompt engineering patterns for writing agent prompts and skill files — constraint budgets, register shifting, HARD GATE patterns, anti-personas. Use when writing or reviewing agents/*.md or skills/*/SKILL.md. TRIGGER when: agent prompt, skill file, prompt engineering, model-tier compensation, HARD GATE, prompt quality. DO NOT TRIGGER when: user-facing docs, README, CHANGELOG, config files.
testing
7-step planning workflow for pre-implementation design. Enforced by plan_gate hook, critiqued by plan-critic agent. Use when creating plans, design documents, or architecture decisions before implementation. TRIGGER when: plan, planning, /plan, design document, architecture decision. DO NOT TRIGGER when: implementation, coding, testing.
testing
JSON persistence, atomic writes, file locking, crash recovery, and state versioning patterns. Use when implementing stateful libraries or features requiring persistent state. TRIGGER when: state persistence, atomic write, file locking, crash recovery, checkpoint. DO NOT TRIGGER when: stateless utilities, pure functions, config reads, documentation.