Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

akaszubski/testing-guide

Name: testing-guide
Author: akaszubski

plugins/autonomous-dev/skills/testing-guide/SKILL.md

npx skillsauth add akaszubski/anyclaude-local testing-guide

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Complete Testing Guide

Purpose: Unified testing methodology covering TDD, progression tracking, and regression prevention.

Auto-activates when: Keywords like "test", "tdd", "coverage", "baseline", "progression", "regression" appear.

Three-Layer Testing Strategy ⭐ UPDATED

Critical Insight

Layer 1 (pytest) validates STRUCTURE and BEHAVIOR Layer 2 (GenAI) validates INTENT and MEANING Layer 3 (Meta-analysis) validates SYSTEM PERFORMANCE and OPTIMIZATION

All three layers are needed for complete autonomous system coverage.

Testing Decision Matrix

Layer 1: When to Use Traditional Tests (pytest)

✅ BEST FOR:

Fast, deterministic checks (CI/CD, pre-commit)
Binary outcomes (file exists? count correct? function returns X?)
Regression prevention (catches obvious breaks)
Automated validation (no human needed)
Structural correctness (file format, API signature)
Performance benchmarks (< 100ms response time)
Coverage tracking (80%+ line coverage)

❌ NOT GOOD FOR:

Semantic understanding (does code match intent?)
Quality assessment (is this "good" code?)
Architectural alignment (does implementation serve goals?)
Context-aware decisions (does this fit the project?)

Examples:

# ✅ PERFECT for traditional tests
def test_user_creation():
    """Test user is created with email."""
    user = create_user("[email protected]")
    assert user.email == "[email protected]"  # Binary check

def test_file_exists():
    """Test config file exists."""
    assert Path(".env").exists()  # Clear yes/no

def test_api_response_time():
    """Test API responds in < 100ms."""
    start = time.time()
    response = api.get("/users")
    duration = time.time() - start
    assert duration < 0.1  # Measurable benchmark

Layer 2: When to Use GenAI Validation (Claude)

✅ BEST FOR:

Semantic understanding (does implementation match documented intent?)
Behavioral validation (does agent DO what description says?)
Architectural drift detection (subtle changes in meaning)
Quality assessment (is code maintainable, clear, well-designed?)
Context-aware validation (does this align with PROJECT.md goals?)
Intent preservation (does WHY match WHAT?)
Complex reasoning (trade-offs, design decisions)

❌ NOT GOOD FOR:

Fast CI/CD checks (too slow, requires API calls)
Deterministic outcomes (slightly different answers each run)
Simple binary checks (overkill for "does file exist")
Automated regression tests (needs human review)

Examples:

# ✅ PERFECT for GenAI validation

## Validate Architectural Intent

Read orchestrator.md. Does it actually validate PROJECT.md before
starting work? Look for:

- File existence checks (bash if statements)
- Reading GOALS/SCOPE/CONSTRAINTS
- Blocking behavior if misaligned
- Clear rejection messages

Don't just check if "PROJECT.md" appears - validate the BEHAVIOR
matches the documented INTENT.

## Assess Code Quality

Review the authentication implementation. Evaluate:

- Is error handling comprehensive?
- Are security best practices followed?
- Is the code maintainable for future developers?
- Does it align with project's security constraints?

Provide contextual assessment with trade-offs.

Layer 3: When to Use System Performance Testing (Meta-analysis) ⭐ NEW

✅ BEST FOR:

Agent effectiveness tracking (which agents succeed? which fail?)
Model optimization (is Opus needed or can we use Haiku?)
Cost efficiency analysis (are we spending too much per feature?)
ROI measurement (what's the return on investment?)
System-level optimization (how can the autonomous system improve itself?)
Resource allocation (where should we invest more/less?)

❌ NOT GOOD FOR:

Individual feature validation (use Layer 1 or 2)
Fast feedback loops (takes time to collect data)
Binary pass/fail checks (provides metrics, not yes/no)

What it measures:

## Agent Performance

| Agent       | Invocations | Success Rate | Avg Time | Cost  |
| ----------- | ----------- | ------------ | -------- | ----- |
| researcher  | 1.8/feature | 100%         | 42s      | $0.09 |
| planner     | 1.0/feature | 100%         | 28s      | $0.07 |
| implementer | 1.2/feature | 95%          | 180s     | $0.45 |

## Model Optimization Opportunities

- reviewer: Sonnet → Haiku (save 92%)
- security-auditor: Already Haiku ✅
- doc-master: Already Haiku ✅

## ROI Tracking

- Total cost: $18.70 (22 features)
- Value delivered: $8,800 (88hr × $100/hr)
- ROI: 470× return on investment

## System Performance

- Average cost per feature: $0.85
- Average time per feature: 18 minutes
- Success rate: 95%
- Target: < $1.00/feature, < 20min, > 90%

Commands:

/test system-performance              # Run system performance analysis
/test system-performance --track-issues  # Auto-create optimization issues

When to run:

Weekly or monthly (not per-feature)
After major changes to agent pipeline
When reviewing system costs
During sprint retrospectives

See: SYSTEM-PERFORMANCE-GUIDE.md

Testing Layers: When to Use Which

Layer 1: Unit Tests (pytest, fast)

What: Individual functions in isolation When: Every function with logic Speed: < 1 second total CI/CD: YES, run on every commit

Example:

# tests/unit/test_auth.py
def test_hash_password():
    """Test password is hashed."""
    hashed = hash_password("secret123")
    assert hashed != "secret123"  # Binary: not plaintext
    assert len(hashed) == 60  # bcrypt length

Validation Type: STRUCTURE

✅ Fast
✅ Deterministic
✅ Automated
❌ Doesn't check if hashing algorithm is secure

Layer 2: Integration Tests (pytest, medium)

What: Components working together When: Testing workflows, API endpoints, database interactions Speed: < 10 seconds total CI/CD: YES, run before merge

Example:

# tests/integration/test_user_workflow.py
def test_user_registration_workflow(db):
    """Test complete user registration."""
    # Create user
    user = register_user("[email protected]", "password123")

    # Verify in database
    db_user = db.query(User).filter_by(email="[email protected]").first()
    assert db_user is not None

    # Verify password hashed
    assert db_user.password != "password123"

Validation Type: BEHAVIOR

✅ Tests real workflows
✅ Catches integration issues
✅ Automated
❌ Doesn't validate if workflow serves business goals

Layer 3: UAT Tests (pytest, slow)

What: End-to-end user workflows When: Before release, testing complete features Speed: < 60 seconds total CI/CD: Optional, can run nightly

Example:

# tests/uat/test_user_journey.py
def test_complete_user_journey(tmp_path):
    """Test user journey: signup → login → create post → logout."""
    # Setup
    app = create_app(tmp_path)

    # Signup
    response = app.post("/signup", data={"email": "[email protected]"})
    assert response.status_code == 201

    # Login
    response = app.post("/login", data={"email": "[email protected]"})
    assert "session_token" in response.cookies

    # Create post
    response = app.post("/posts", json={"title": "Test"})
    assert response.status_code == 201

    # Logout
    response = app.post("/logout")
    assert "session_token" not in response.cookies

Validation Type: WORKS

✅ Tests complete user experience
✅ Catches workflow breaks
✅ Can be automated (but slow)
❌ Doesn't validate if feature aligns with strategic goals

Layer 4: GenAI Validation (Claude, comprehensive)

What: Architectural intent, semantic alignment, quality assessment When: Before release, after major changes, monthly maintenance Speed: 2-5 minutes (manual review) CI/CD: NO, requires human review

Example:

# /validate-architecture

## Validate: Does orchestrator enforce PROJECT.md-first architecture?

Read agents/orchestrator.md and analyze:

1. **Intent (from ARCHITECTURE.md)**:
   "Prevent scope creep by validating alignment before work"

2. **Implementation Analysis**:
   - Does orchestrator check if PROJECT.md exists?
   - Does it read GOALS, SCOPE, CONSTRAINTS?
   - Does it validate feature aligns with goals?
   - Does it block work if misaligned?
   - Does it create PROJECT.md if missing?

3. **Behavioral Evidence**:
   - Line 20: `if [ ! -f PROJECT.md ]` ✓ Checks existence
   - Line 81-83: Reads GOALS/SCOPE/CONSTRAINTS ✓
   - Line 357-391: Displays rejection if misaligned ✓
   - Line 77: `exit 0` blocks work if PROJECT.md missing ✓

**Assessment**: ✅ ALIGNED
Implementation matches documented intent. Orchestrator actually
enforces PROJECT.md-first, not just mentions it.

**Why GenAI?**:
Static test could only check if "PROJECT.md" appears in file.
GenAI validates the BEHAVIOR matches the INTENT.

Validation Type: INTENT

✅ Validates semantic meaning
✅ Detects architectural drift
✅ Assesses quality and design
❌ Slow, requires human review
❌ Not deterministic (slight variations)

Recommended Testing Workflow

Development Phase

Write unit tests (pytest, fast feedback)
```
pytest tests/unit/test_feature.py -v
```
Write integration tests (pytest, workflow validation)
```
pytest tests/integration/test_feature.py -v
```
Run all tests locally (before commit)
```
pytest -v
```

Pre-Commit (Automated)

# .claude/hooks/pre-commit or CI/CD pipeline
pytest tests/unit/ tests/integration/ -v --tb=short
# Fast: < 10 seconds
# Automated: Catches obvious breaks

Pre-Release (Manual)

Run UAT tests (complete workflows)
```
pytest tests/uat/ -v
```
GenAI architectural validation (intent alignment)
```
/validate-architecture
```
Manual testing (critical user paths)

Hybrid Approach: Best of Both Worlds

Example: Testing Orchestrator

Static Tests (pytest) - Layer 1

# tests/test_architecture.py
def test_orchestrator_exists():
    """Test orchestrator agent file exists."""
    assert (agents_dir / "orchestrator.md").exists()

def test_orchestrator_has_task_tool():
    """Test orchestrator can coordinate agents."""
    content = (agents_dir / "orchestrator.md").read_text()
    assert "Task" in content

Catches: File deletion, obvious regressions Misses: Whether orchestrator actually uses Task tool correctly

GenAI Validation (Claude) - Layer 4

Read orchestrator.md. Validate:

1. Does it use Task tool to coordinate agents?
2. Does it coordinate in correct order (researcher → planner → test-master → implementer)?
3. Does it enforce TDD (tests before code)?
4. Does it log to session files for context management?

Look for actual implementation code (bash scripts, tool invocations),
not just descriptions.

Catches: Behavioral drift, incorrect usage, missing logic Provides: Contextual assessment of quality and alignment

Complete Testing Strategy

| Test Type | Speed | Automated | Validates | When to Use | | --------------- | ------ | ----------- | ---------------- | ----------------------------- | | Unit | < 1s | ✅ Yes | Structure, logic | Every function | | Integration | < 10s | ✅ Yes | Workflows, APIs | Component interaction | | UAT | < 60s | ⚠️ Optional | User journeys | End-to-end flows | | GenAI | 2-5min | ❌ No | Intent, quality | Before release, major changes |

Cost-Benefit Analysis

Traditional Tests (pytest):

Cost: Developer time to write tests
Benefit: Fast feedback, regression prevention
ROI: High (run thousands of times)

GenAI Validation:

Cost: API calls, human review time
Benefit: Architectural integrity, drift detection
ROI: High (catches subtle issues static tests miss)

Three Testing Approaches (Traditional)

1. TDD (Test-Driven Development)

Purpose: Write tests BEFORE implementation When: New features, refactoring, coverage gaps Location: tests/unit/ and tests/integration/

2. Progression Testing

Purpose: Track metrics over time, prevent regressions When: Optimizing performance, training models Location: tests/progression/

3. Regression Testing

Purpose: Ensure fixed bugs never return When: After bug fixes Location: tests/regression/

TDD Methodology

Core Principle

Red → Green → Refactor

Red: Write failing test (feature doesn't exist yet)
Green: Implement minimum code to make test pass
Refactor: Clean up code while keeping tests green

TDD Workflow

Step 1: Write Test First

# tests/unit/test_trainer.py

def test_train_lora_with_valid_params():
    """Test LoRA training succeeds with valid parameters."""
    result = train_lora(
        model="[model_repo]/Llama-3.2-1B-Instruct-4bit",
        data="data/sample.json",
        rank=8
    )

    assert result.success is True
    assert result.adapter_path.exists()

Step 2: Run Test (Should FAIL)

pytest tests/unit/test_trainer.py::test_train_lora_with_valid_params -v
# FAILED - NameError: 'train_lora' not defined ✅ Expected!

Step 3: Implement Code

# src/[project_name]/methods/lora.py

def train_lora(model: str, data: str, rank: int = 8):
    """Train LoRA adapter."""
    # Minimum implementation to pass test
    ...
    return Result(success=True, adapter_path=Path("adapter.npz"))

Step 4: Run Test (Should PASS)

pytest tests/unit/test_trainer.py::test_train_lora_with_valid_params -v
# PASSED ✅

Test Coverage Standards

Minimum: 80% coverage Target: 90%+ coverage

Check coverage:

pytest --cov=src/[project_name] --cov-report=term-missing tests/

Test Patterns

Arrange-Act-Assert (AAA):

def test_example():
    # Arrange: Set up test data
    model = load_model()
    data = load_test_data()

    # Act: Execute the operation
    result = train_model(model, data)

    # Assert: Verify expected outcome
    assert result.success is True

Parametrize for multiple cases:

@pytest.mark.parametrize("rank,expected_params", [
    (8, 8 * 4096 * 2),
    (16, 16 * 4096 * 2),
    (32, 32 * 4096 * 2),
])
def test_lora_parameter_count(rank, expected_params):
    adapter = create_lora_adapter(rank=rank)
    assert adapter.num_parameters == expected_params

Progression Testing

Purpose

Track metrics over time, automatically detect regressions.

How It Works

First Run: Establish baseline
Subsequent Runs: Compare to baseline
Regression: Test FAILS if metric drops >tolerance
Progression: Baseline updates if metric improves >2%

Baseline File Format

{
  "metric_name": "lora_accuracy",
  "baseline_value": 0.856,
  "tolerance": 0.05,
  "established_at": "2025-10-18T10:30:00",
  "history": [
    { "date": "2025-10-18", "value": 0.856, "change": "baseline established" },
    { "date": "2025-10-20", "value": 0.872, "change": "+1.9% improvement" }
  ]
}

Progression Test Template

# tests/progression/test_lora_accuracy.py

import json
import pytest
from pathlib import Path
from datetime import datetime

BASELINE_FILE = Path(__file__).parent / "baselines" / "lora_accuracy_baseline.json"
TOLERANCE = 0.05  # ±5%

class TestProgressionLoRAAccuracy:
    @pytest.fixture
    def baseline(self):
        if BASELINE_FILE.exists():
            return json.loads(BASELINE_FILE.read_text())
        return None

    def test_lora_accuracy_progression(self, baseline):
        # Measure current metric
        current_accuracy = self._train_and_evaluate()

        # First run - establish baseline
        if baseline is None:
            self._establish_baseline(current_accuracy)
            pytest.skip(f"Baseline established: {current_accuracy:.4f}")

        # Compare to baseline
        baseline_value = baseline["baseline_value"]
        diff_pct = ((current_accuracy - baseline_value) / baseline_value) * 100

        # Check for regression
        if diff_pct < -TOLERANCE * 100:
            pytest.fail(f"REGRESSION: {abs(diff_pct):.1f}% worse than baseline")

        # Check for progression
        if diff_pct > 2.0:
            self._update_baseline(current_accuracy, diff_pct, baseline)

        print(f"✅ Accuracy: {current_accuracy:.4f} ({diff_pct:+.1f}%)")

    def _train_and_evaluate(self) -> float:
        # Your measurement logic here
        pass

    def _establish_baseline(self, value: float):
        # Create baseline file
        pass

    def _update_baseline(self, new_value: float, improvement: float, old_baseline: dict):
        # Update baseline file
        pass

Metrics to Track

| Metric | Higher Better? | Typical Tolerance | | ----------------- | -------------------- | ----------------- | | Accuracy | ✅ Yes | ±5% | | Loss | ❌ No (lower better) | ±5% | | Training Speed | ✅ Yes | ±10% | | Memory Usage | ❌ No (lower better) | ±5% | | Inference Latency | ❌ No (lower better) | ±10% |

Regression Testing

Purpose

Ensure fixed bugs never return.

When to Create

After fixing any bug
Issue closed with "Closes #123"
Commit contains "fix:"

Regression Test Template

# tests/regression/test_regression_suite.py

class TestMLXPatterns:
    """[FRAMEWORK]-specific regression tests."""

    def test_bug_47_mlx_nested_layers(self):
        \"\"\"
        Regression test: [FRAMEWORK] model.layers AttributeError

        Bug: Code tried model.layers[i] (doesn't exist)
        Fix: Use model.model.layers[i] (nested structure)
        Date fixed: 2025-10-18
        Issue: #47

        Ensures bug never returns.
        \"\"\"

        # Arrange
        model = create_mock_mlx_model()

        # Act & Assert: Correct way works
        layer = model.model.layers[0]
        assert layer is not None

        # Assert: Wrong way fails (bug would use this)
        with pytest.raises(AttributeError):
            _ = model.layers

Test Organization

# tests/regression/test_regression_suite.py

class TestMLXPatterns:
    """[FRAMEWORK]-specific bugs."""
    pass

class TestDataProcessing:
    """Data handling bugs."""
    pass

class TestAPIIntegration:
    """External API bugs."""
    pass

class TestErrorHandling:
    """Error handling bugs."""
    pass

Test Organization

Directory Structure

tests/
├── unit/                    # TDD unit tests
│   ├── test_trainer.py
│   ├── test_lora.py
│   └── test_dpo.py
├── integration/             # TDD integration tests
│   └── test_training_pipeline.py
├── progression/             # Baseline tracking
│   ├── test_lora_accuracy.py
│   ├── test_training_speed.py
│   └── baselines/
│       └── *.json
└── regression/              # Bug prevention
    ├── test_regression_suite.py
    └── README.md

Naming Conventions

Test Files: test_{module}.py Test Functions: test_{what_is_being_tested}() Test Classes: Test{Feature}

Examples:

test_trainer.py - Tests for trainer module
test_train_lora_with_valid_params() - Specific test
TestProgressionLoRAAccuracy - Progression test class

Best Practices

✅ DO

Write tests first (TDD)
Test one thing per test (focused)
Use descriptive names (test_train_lora_raises_error_on_invalid_model)
Arrange-Act-Assert pattern
Mock external dependencies (APIs, files)
Test edge cases (empty data, None, negative numbers)
Keep tests fast (<1 second each)
Maintain 80%+ coverage

❌ DON'T

Don't test implementation details (test behavior, not internal code)
Don't have test dependencies (tests should be isolated)
Don't hardcode paths (use fixtures, temporary directories)
Don't skip tests (fix them or delete them)
Don't write tests after implementation (that's not TDD!)
Don't test third-party libraries (trust they're tested)
Don't ignore failing tests (fix immediately)

Pytest Fixtures

Common Fixtures

# conftest.py

import pytest
from pathlib import Path
import tempfile

@pytest.fixture
def temp_dir():
    """Temporary directory for test files."""
    with tempfile.TemporaryDirectory() as tmpdir:
        yield Path(tmpdir)

@pytest.fixture
def sample_data():
    """Sample training data."""
    return {
        "messages": [
            {"role": "user", "content": "Hello"},
            {"role": "assistant", "content": "Hi there!"}
        ]
    }

@pytest.fixture
def mock_model():
    """Mock [FRAMEWORK] model."""
    class MockModel:
        def __init__(self):
            self.model = type('obj', (object,), {'layers': []})()
    return MockModel()

Coverage Targets

By Test Type

| Test Type | Target Coverage | | ----------- | --------------------- | | Unit | 90%+ | | Integration | 70%+ | | Progression | N/A (metric tracking) | | Regression | N/A (bug prevention) |

Overall Project

Minimum: 80% (enforced in CI/CD)
Target: 90%
Stretch: 95%

Check Coverage

# Run tests with coverage
pytest --cov=src/[project_name] --cov-report=html tests/

# Open coverage report
open htmlcov/index.html

# Show missing lines
pytest --cov=src/[project_name] --cov-report=term-missing tests/

CI/CD Integration

Pre-Push Hook

# Run all tests before push
pytest tests/ -v

# Check coverage
pytest --cov=src/[project_name] --cov-report=term --cov-fail-under=80 tests/

GitHub Actions

- name: Run Tests
  run: |
    pytest tests/ -v --cov=src/[project_name] --cov-report=xml

- name: Upload Coverage
  uses: codecov/codecov-action@v3
  with:
    files: ./coverage.xml

Quick Reference

Test Types Decision Matrix

| Scenario | Test Type | Location | | -------------------- | --------------- | ------------------ | | New feature | TDD | tests/unit/ | | Optimize performance | Progression | tests/progression/ | | Fixed bug | Regression | tests/regression/ | | Multiple components | Integration | tests/integration/ |

Running Tests

# All tests
pytest tests/

# Specific file
pytest tests/unit/test_trainer.py

# Specific test
pytest tests/unit/test_trainer.py::test_train_lora

# With coverage
pytest --cov=src/[project_name] tests/

# Verbose output
pytest -v tests/

# Stop on first failure
pytest -x tests/

# Parallel execution
pytest -n auto tests/

This skill provides complete testing methodology for maintaining high-quality, well-tested code in [PROJECT_NAME].

akaszubski/testing-guide

plugins/autonomous-dev/skills/testing-guide/SKILL.md

Complete testing methodology - TDD, progression tracking, regression prevention, and test patterns

5 stars

testing

Updated Apr 2, 2026

$ install --global

skillsauth

npx skillsauth add akaszubski/anyclaude-local testing-guide

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 2, 2026, 7:27 PM57.3s1 file scanned

SKILL.md

name:: testing-guide
type:: knowledge
description:: Complete testing methodology - TDD, progression tracking, regression prevention, and test patterns
keywords:: test, testing, tdd, pytest, coverage, baseline, progression, regression, quality
auto_activate:: true

Complete Testing Guide

Purpose: Unified testing methodology covering TDD, progression tracking, and regression prevention.

Auto-activates when: Keywords like "test", "tdd", "coverage", "baseline", "progression", "regression" appear.

Three-Layer Testing Strategy ⭐ UPDATED

Critical Insight

All three layers are needed for complete autonomous system coverage.

Testing Decision Matrix

Layer 1: When to Use Traditional Tests (pytest)

✅ BEST FOR:

Fast, deterministic checks (CI/CD, pre-commit)
Binary outcomes (file exists? count correct? function returns X?)
Regression prevention (catches obvious breaks)
Automated validation (no human needed)
Structural correctness (file format, API signature)
Performance benchmarks (< 100ms response time)
Coverage tracking (80%+ line coverage)

❌ NOT GOOD FOR:

Semantic understanding (does code match intent?)
Quality assessment (is this "good" code?)
Architectural alignment (does implementation serve goals?)
Context-aware decisions (does this fit the project?)

Examples:

# ✅ PERFECT for traditional tests
def test_user_creation():
    """Test user is created with email."""
    user = create_user("[email protected]")
    assert user.email == "[email protected]"  # Binary check

def test_file_exists():
    """Test config file exists."""
    assert Path(".env").exists()  # Clear yes/no

def test_api_response_time():
    """Test API responds in < 100ms."""
    start = time.time()
    response = api.get("/users")
    duration = time.time() - start
    assert duration < 0.1  # Measurable benchmark

Layer 2: When to Use GenAI Validation (Claude)

✅ BEST FOR:

Semantic understanding (does implementation match documented intent?)
Behavioral validation (does agent DO what description says?)
Architectural drift detection (subtle changes in meaning)
Quality assessment (is code maintainable, clear, well-designed?)
Context-aware validation (does this align with PROJECT.md goals?)
Intent preservation (does WHY match WHAT?)
Complex reasoning (trade-offs, design decisions)

❌ NOT GOOD FOR:

Fast CI/CD checks (too slow, requires API calls)
Deterministic outcomes (slightly different answers each run)
Simple binary checks (overkill for "does file exist")
Automated regression tests (needs human review)

Examples:

# ✅ PERFECT for GenAI validation

## Validate Architectural Intent

Read orchestrator.md. Does it actually validate PROJECT.md before
starting work? Look for:

- File existence checks (bash if statements)
- Reading GOALS/SCOPE/CONSTRAINTS
- Blocking behavior if misaligned
- Clear rejection messages

Don't just check if "PROJECT.md" appears - validate the BEHAVIOR
matches the documented INTENT.

## Assess Code Quality

Review the authentication implementation. Evaluate:

- Is error handling comprehensive?
- Are security best practices followed?
- Is the code maintainable for future developers?
- Does it align with project's security constraints?

Provide contextual assessment with trade-offs.

Layer 3: When to Use System Performance Testing (Meta-analysis) ⭐ NEW

✅ BEST FOR:

Agent effectiveness tracking (which agents succeed? which fail?)
Model optimization (is Opus needed or can we use Haiku?)
Cost efficiency analysis (are we spending too much per feature?)
ROI measurement (what's the return on investment?)
System-level optimization (how can the autonomous system improve itself?)
Resource allocation (where should we invest more/less?)

❌ NOT GOOD FOR:

Individual feature validation (use Layer 1 or 2)
Fast feedback loops (takes time to collect data)
Binary pass/fail checks (provides metrics, not yes/no)

What it measures:

## Agent Performance

| Agent       | Invocations | Success Rate | Avg Time | Cost  |
| ----------- | ----------- | ------------ | -------- | ----- |
| researcher  | 1.8/feature | 100%         | 42s      | $0.09 |
| planner     | 1.0/feature | 100%         | 28s      | $0.07 |
| implementer | 1.2/feature | 95%          | 180s     | $0.45 |

## Model Optimization Opportunities

- reviewer: Sonnet → Haiku (save 92%)
- security-auditor: Already Haiku ✅
- doc-master: Already Haiku ✅

## ROI Tracking

- Total cost: $18.70 (22 features)
- Value delivered: $8,800 (88hr × $100/hr)
- ROI: 470× return on investment

## System Performance

- Average cost per feature: $0.85
- Average time per feature: 18 minutes
- Success rate: 95%
- Target: < $1.00/feature, < 20min, > 90%

Commands:

/test system-performance              # Run system performance analysis
/test system-performance --track-issues  # Auto-create optimization issues

When to run:

Weekly or monthly (not per-feature)
After major changes to agent pipeline
When reviewing system costs
During sprint retrospectives

See: SYSTEM-PERFORMANCE-GUIDE.md

Testing Layers: When to Use Which

Layer 1: Unit Tests (pytest, fast)

What: Individual functions in isolation When: Every function with logic Speed: < 1 second total CI/CD: YES, run on every commit

Example:

# tests/unit/test_auth.py
def test_hash_password():
    """Test password is hashed."""
    hashed = hash_password("secret123")
    assert hashed != "secret123"  # Binary: not plaintext
    assert len(hashed) == 60  # bcrypt length

Validation Type: STRUCTURE

✅ Fast
✅ Deterministic
✅ Automated
❌ Doesn't check if hashing algorithm is secure

Layer 2: Integration Tests (pytest, medium)

What: Components working together When: Testing workflows, API endpoints, database interactions Speed: < 10 seconds total CI/CD: YES, run before merge

Example:

# tests/integration/test_user_workflow.py
def test_user_registration_workflow(db):
    """Test complete user registration."""
    # Create user
    user = register_user("[email protected]", "password123")

    # Verify in database
    db_user = db.query(User).filter_by(email="[email protected]").first()
    assert db_user is not None

    # Verify password hashed
    assert db_user.password != "password123"

Validation Type: BEHAVIOR

✅ Tests real workflows
✅ Catches integration issues
✅ Automated
❌ Doesn't validate if workflow serves business goals

Layer 3: UAT Tests (pytest, slow)

What: End-to-end user workflows When: Before release, testing complete features Speed: < 60 seconds total CI/CD: Optional, can run nightly

Example:

# tests/uat/test_user_journey.py
def test_complete_user_journey(tmp_path):
    """Test user journey: signup → login → create post → logout."""
    # Setup
    app = create_app(tmp_path)

    # Signup
    response = app.post("/signup", data={"email": "[email protected]"})
    assert response.status_code == 201

    # Login
    response = app.post("/login", data={"email": "[email protected]"})
    assert "session_token" in response.cookies

    # Create post
    response = app.post("/posts", json={"title": "Test"})
    assert response.status_code == 201

    # Logout
    response = app.post("/logout")
    assert "session_token" not in response.cookies

Validation Type: WORKS

✅ Tests complete user experience
✅ Catches workflow breaks
✅ Can be automated (but slow)
❌ Doesn't validate if feature aligns with strategic goals

Layer 4: GenAI Validation (Claude, comprehensive)

Example:

# /validate-architecture

## Validate: Does orchestrator enforce PROJECT.md-first architecture?

Read agents/orchestrator.md and analyze:

1. **Intent (from ARCHITECTURE.md)**:
   "Prevent scope creep by validating alignment before work"

2. **Implementation Analysis**:
   - Does orchestrator check if PROJECT.md exists?
   - Does it read GOALS, SCOPE, CONSTRAINTS?
   - Does it validate feature aligns with goals?
   - Does it block work if misaligned?
   - Does it create PROJECT.md if missing?

3. **Behavioral Evidence**:
   - Line 20: `if [ ! -f PROJECT.md ]` ✓ Checks existence
   - Line 81-83: Reads GOALS/SCOPE/CONSTRAINTS ✓
   - Line 357-391: Displays rejection if misaligned ✓
   - Line 77: `exit 0` blocks work if PROJECT.md missing ✓

**Assessment**: ✅ ALIGNED
Implementation matches documented intent. Orchestrator actually
enforces PROJECT.md-first, not just mentions it.

**Why GenAI?**:
Static test could only check if "PROJECT.md" appears in file.
GenAI validates the BEHAVIOR matches the INTENT.

Validation Type: INTENT

✅ Validates semantic meaning
✅ Detects architectural drift
✅ Assesses quality and design
❌ Slow, requires human review
❌ Not deterministic (slight variations)

Recommended Testing Workflow

Development Phase

Write unit tests (pytest, fast feedback)
```
pytest tests/unit/test_feature.py -v
```
Write integration tests (pytest, workflow validation)
```
pytest tests/integration/test_feature.py -v
```
Run all tests locally (before commit)
```
pytest -v
```

Pre-Commit (Automated)

# .claude/hooks/pre-commit or CI/CD pipeline
pytest tests/unit/ tests/integration/ -v --tb=short
# Fast: < 10 seconds
# Automated: Catches obvious breaks

Pre-Release (Manual)

Run UAT tests (complete workflows)
```
pytest tests/uat/ -v
```
GenAI architectural validation (intent alignment)
```
/validate-architecture
```
Manual testing (critical user paths)

Hybrid Approach: Best of Both Worlds

Example: Testing Orchestrator

Static Tests (pytest) - Layer 1

# tests/test_architecture.py
def test_orchestrator_exists():
    """Test orchestrator agent file exists."""
    assert (agents_dir / "orchestrator.md").exists()

def test_orchestrator_has_task_tool():
    """Test orchestrator can coordinate agents."""
    content = (agents_dir / "orchestrator.md").read_text()
    assert "Task" in content

Catches: File deletion, obvious regressions Misses: Whether orchestrator actually uses Task tool correctly

GenAI Validation (Claude) - Layer 4

Read orchestrator.md. Validate:

1. Does it use Task tool to coordinate agents?
2. Does it coordinate in correct order (researcher → planner → test-master → implementer)?
3. Does it enforce TDD (tests before code)?
4. Does it log to session files for context management?

Look for actual implementation code (bash scripts, tool invocations),
not just descriptions.

Catches: Behavioral drift, incorrect usage, missing logic Provides: Contextual assessment of quality and alignment

Complete Testing Strategy

Cost-Benefit Analysis

Traditional Tests (pytest):

Cost: Developer time to write tests
Benefit: Fast feedback, regression prevention
ROI: High (run thousands of times)

GenAI Validation:

Cost: API calls, human review time
Benefit: Architectural integrity, drift detection
ROI: High (catches subtle issues static tests miss)

Three Testing Approaches (Traditional)

1. TDD (Test-Driven Development)

Purpose: Write tests BEFORE implementation When: New features, refactoring, coverage gaps Location: tests/unit/ and tests/integration/

2. Progression Testing

Purpose: Track metrics over time, prevent regressions When: Optimizing performance, training models Location: tests/progression/

3. Regression Testing

Purpose: Ensure fixed bugs never return When: After bug fixes Location: tests/regression/

TDD Methodology

Core Principle

Red → Green → Refactor

Red: Write failing test (feature doesn't exist yet)
Green: Implement minimum code to make test pass
Refactor: Clean up code while keeping tests green

TDD Workflow

Step 1: Write Test First

# tests/unit/test_trainer.py

def test_train_lora_with_valid_params():
    """Test LoRA training succeeds with valid parameters."""
    result = train_lora(
        model="[model_repo]/Llama-3.2-1B-Instruct-4bit",
        data="data/sample.json",
        rank=8
    )

    assert result.success is True
    assert result.adapter_path.exists()

Step 2: Run Test (Should FAIL)

pytest tests/unit/test_trainer.py::test_train_lora_with_valid_params -v
# FAILED - NameError: 'train_lora' not defined ✅ Expected!

Step 3: Implement Code

# src/[project_name]/methods/lora.py

def train_lora(model: str, data: str, rank: int = 8):
    """Train LoRA adapter."""
    # Minimum implementation to pass test
    ...
    return Result(success=True, adapter_path=Path("adapter.npz"))

Step 4: Run Test (Should PASS)

pytest tests/unit/test_trainer.py::test_train_lora_with_valid_params -v
# PASSED ✅

Test Coverage Standards

Minimum: 80% coverage Target: 90%+ coverage

Check coverage:

pytest --cov=src/[project_name] --cov-report=term-missing tests/

Test Patterns

Arrange-Act-Assert (AAA):

def test_example():
    # Arrange: Set up test data
    model = load_model()
    data = load_test_data()

    # Act: Execute the operation
    result = train_model(model, data)

    # Assert: Verify expected outcome
    assert result.success is True

Parametrize for multiple cases:

@pytest.mark.parametrize("rank,expected_params", [
    (8, 8 * 4096 * 2),
    (16, 16 * 4096 * 2),
    (32, 32 * 4096 * 2),
])
def test_lora_parameter_count(rank, expected_params):
    adapter = create_lora_adapter(rank=rank)
    assert adapter.num_parameters == expected_params

Progression Testing

Purpose

Track metrics over time, automatically detect regressions.

How It Works

First Run: Establish baseline
Subsequent Runs: Compare to baseline
Regression: Test FAILS if metric drops >tolerance
Progression: Baseline updates if metric improves >2%

Baseline File Format

{
  "metric_name": "lora_accuracy",
  "baseline_value": 0.856,
  "tolerance": 0.05,
  "established_at": "2025-10-18T10:30:00",
  "history": [
    { "date": "2025-10-18", "value": 0.856, "change": "baseline established" },
    { "date": "2025-10-20", "value": 0.872, "change": "+1.9% improvement" }
  ]
}

Progression Test Template

# tests/progression/test_lora_accuracy.py

import json
import pytest
from pathlib import Path
from datetime import datetime

BASELINE_FILE = Path(__file__).parent / "baselines" / "lora_accuracy_baseline.json"
TOLERANCE = 0.05  # ±5%

class TestProgressionLoRAAccuracy:
    @pytest.fixture
    def baseline(self):
        if BASELINE_FILE.exists():
            return json.loads(BASELINE_FILE.read_text())
        return None

    def test_lora_accuracy_progression(self, baseline):
        # Measure current metric
        current_accuracy = self._train_and_evaluate()

        # First run - establish baseline
        if baseline is None:
            self._establish_baseline(current_accuracy)
            pytest.skip(f"Baseline established: {current_accuracy:.4f}")

        # Compare to baseline
        baseline_value = baseline["baseline_value"]
        diff_pct = ((current_accuracy - baseline_value) / baseline_value) * 100

        # Check for regression
        if diff_pct < -TOLERANCE * 100:
            pytest.fail(f"REGRESSION: {abs(diff_pct):.1f}% worse than baseline")

        # Check for progression
        if diff_pct > 2.0:
            self._update_baseline(current_accuracy, diff_pct, baseline)

        print(f"✅ Accuracy: {current_accuracy:.4f} ({diff_pct:+.1f}%)")

    def _train_and_evaluate(self) -> float:
        # Your measurement logic here
        pass

    def _establish_baseline(self, value: float):
        # Create baseline file
        pass

    def _update_baseline(self, new_value: float, improvement: float, old_baseline: dict):
        # Update baseline file
        pass

Metrics to Track

Regression Testing

Purpose

Ensure fixed bugs never return.

When to Create

After fixing any bug
Issue closed with "Closes #123"
Commit contains "fix:"

Regression Test Template

# tests/regression/test_regression_suite.py

class TestMLXPatterns:
    """[FRAMEWORK]-specific regression tests."""

    def test_bug_47_mlx_nested_layers(self):
        \"\"\"
        Regression test: [FRAMEWORK] model.layers AttributeError

        Bug: Code tried model.layers[i] (doesn't exist)
        Fix: Use model.model.layers[i] (nested structure)
        Date fixed: 2025-10-18
        Issue: #47

        Ensures bug never returns.
        \"\"\"

        # Arrange
        model = create_mock_mlx_model()

        # Act & Assert: Correct way works
        layer = model.model.layers[0]
        assert layer is not None

        # Assert: Wrong way fails (bug would use this)
        with pytest.raises(AttributeError):
            _ = model.layers

Test Organization

# tests/regression/test_regression_suite.py

class TestMLXPatterns:
    """[FRAMEWORK]-specific bugs."""
    pass

class TestDataProcessing:
    """Data handling bugs."""
    pass

class TestAPIIntegration:
    """External API bugs."""
    pass

class TestErrorHandling:
    """Error handling bugs."""
    pass

Test Organization

Directory Structure

tests/
├── unit/                    # TDD unit tests
│   ├── test_trainer.py
│   ├── test_lora.py
│   └── test_dpo.py
├── integration/             # TDD integration tests
│   └── test_training_pipeline.py
├── progression/             # Baseline tracking
│   ├── test_lora_accuracy.py
│   ├── test_training_speed.py
│   └── baselines/
│       └── *.json
└── regression/              # Bug prevention
    ├── test_regression_suite.py
    └── README.md

Naming Conventions

Test Files: test_{module}.py Test Functions: test_{what_is_being_tested}() Test Classes: Test{Feature}

Examples:

test_trainer.py - Tests for trainer module
test_train_lora_with_valid_params() - Specific test
TestProgressionLoRAAccuracy - Progression test class

Best Practices

✅ DO

Write tests first (TDD)
Test one thing per test (focused)
Use descriptive names (test_train_lora_raises_error_on_invalid_model)
Arrange-Act-Assert pattern
Mock external dependencies (APIs, files)
Test edge cases (empty data, None, negative numbers)
Keep tests fast (<1 second each)
Maintain 80%+ coverage

❌ DON'T

Don't test implementation details (test behavior, not internal code)
Don't have test dependencies (tests should be isolated)
Don't hardcode paths (use fixtures, temporary directories)
Don't skip tests (fix them or delete them)
Don't write tests after implementation (that's not TDD!)
Don't test third-party libraries (trust they're tested)
Don't ignore failing tests (fix immediately)

Pytest Fixtures

Common Fixtures

# conftest.py

import pytest
from pathlib import Path
import tempfile

@pytest.fixture
def temp_dir():
    """Temporary directory for test files."""
    with tempfile.TemporaryDirectory() as tmpdir:
        yield Path(tmpdir)

@pytest.fixture
def sample_data():
    """Sample training data."""
    return {
        "messages": [
            {"role": "user", "content": "Hello"},
            {"role": "assistant", "content": "Hi there!"}
        ]
    }

@pytest.fixture
def mock_model():
    """Mock [FRAMEWORK] model."""
    class MockModel:
        def __init__(self):
            self.model = type('obj', (object,), {'layers': []})()
    return MockModel()

Coverage Targets

By Test Type

| Test Type | Target Coverage | | ----------- | --------------------- | | Unit | 90%+ | | Integration | 70%+ | | Progression | N/A (metric tracking) | | Regression | N/A (bug prevention) |

Overall Project

Minimum: 80% (enforced in CI/CD)
Target: 90%
Stretch: 95%

Check Coverage

# Run tests with coverage
pytest --cov=src/[project_name] --cov-report=html tests/

# Open coverage report
open htmlcov/index.html

# Show missing lines
pytest --cov=src/[project_name] --cov-report=term-missing tests/

CI/CD Integration

Pre-Push Hook

# Run all tests before push
pytest tests/ -v

# Check coverage
pytest --cov=src/[project_name] --cov-report=term --cov-fail-under=80 tests/

GitHub Actions

- name: Run Tests
  run: |
    pytest tests/ -v --cov=src/[project_name] --cov-report=xml

- name: Upload Coverage
  uses: codecov/codecov-action@v3
  with:
    files: ./coverage.xml

Quick Reference

Test Types Decision Matrix

Running Tests

# All tests
pytest tests/

# Specific file
pytest tests/unit/test_trainer.py

# Specific test
pytest tests/unit/test_trainer.py::test_train_lora

# With coverage
pytest --cov=src/[project_name] tests/

# Verbose output
pytest -v tests/

# Stop on first failure
pytest -x tests/

# Parallel execution
pytest -n auto tests/

This skill provides complete testing methodology for maintaining high-quality, well-tested code in [PROJECT_NAME].

Related Skills

akaszubski/semantic-validation

documentation

VerifiedTrustedCommunity

GenAI-powered semantic validation - detects outdated docs, version mismatches, and architectural drift

5SKILL.mdUpdated Apr 2, 2026

akaszubski/semantic-validation

akaszubski/security-patterns

development

VerifiedTrustedCommunity

Security best practices, API key management, input validation. Use when handling secrets, user input, or security-sensitive code.

5SKILL.mdUpdated Apr 2, 2026

akaszubski/security-patterns

akaszubski/research-patterns

research

VerifiedTrustedCommunity

Research methodology and best practices for finding existing patterns

5SKILL.mdUpdated Apr 2, 2026

akaszubski/research-patterns

akaszubski/python-standards

development

VerifiedTrustedCommunity

Python code quality standards (PEP 8, type hints, docstrings). Use when writing Python code.

5SKILL.mdUpdated Apr 2, 2026

akaszubski/python-standards

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/akaszubski/anyclaude-local.git

# Copy into Claude Code skills folder (global)
cp -r anyclaude-local/plugins/autonomous-dev/skills/testing-guide ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

akaszubski/anyclaude-local

5 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT