Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

michaelalber/evaluate-tests

Name: evaluate-tests
Author: michaelalber

skills/team/evaluate-tests/SKILL.md

npx skillsauth add michaelalber/ai-toolkit evaluate-tests

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Evaluate Tests

"If tests break when you rename a private method, your tests are describing the solution, not the problem." — Adapted from Kent Beck

Two Modes

| Mode | Input | Output | Use when | |------|-------|--------|----------| | 1 · Test-file quality | a test file or directory | per-test classification + prioritized rewrite list | auditing whether existing tests are behavioral and refactor-safe | | 2 · TDD compliance | git commit history (+ test/impl files) | 0–25 compliance scorecard + anti-pattern findings | verifying test-first discipline was actually followed |

Pick the mode from the request. Run both for a full audit — Mode 1's findings feed the Behavioral and Coverage categories of Mode 2's scorecard.

Mode 1 — Test-File Quality

What This Skill Evaluates

For each test, assess two properties and one pattern:

1. Behavioral (does the test specify behavior?)

A behavioral test fails when the observable outcome breaks. An implementation-coupled test fails when internals change even if behavior is identical.

| Signal | Behavioral | Implementation-coupled | |--------|-----------|----------------------| | Assertion target | Return value, state visible through public API, side effect via public interface | Mock call count (assert_called_with), private field access (_total), internal call order | | Refactor sensitivity | Survives extract-method, rename, move-class | Breaks on rename even when behavior is unchanged | | Failure message | "Expected 42, got 0" — points to the what | "Expected save() to be called once" — points to the how |

2. Structure-insensitive (does the test survive internal refactors?)

Run this mental test: if I renamed every private method and field in this class, would this test break? If yes, it is structure-coupled.

3. Horizontal slice pattern (were tests written without matching implementation?)

Signs in git history: a commit with 3+ test stubs followed by a commit with all implementations. Signs in test file: test methods that assert on code that clearly pre-existed the test (no trace of incremental growth in the test file).

Per-Test Classification

For each test, classify as one of:

| Class | Definition | Action | |-------|-----------|--------| | PASS | Behavioral + structure-insensitive | Keep | | COUPLED | Asserts on implementation detail (call counts, private state, internal call order) | Rewrite: assert on observable outcome instead | | FRAGILE | Survives behavior changes but breaks on structural refactors | Rewrite: remove internal references | | THEATER | Passes trivially; no assertion or assertion always passes | Strengthen or delete |

Output Format

## Test Evaluation: [file or suite name]

### Summary
Tests evaluated: N | PASS: N | COUPLED: N | FRAGILE: N | THEATER: N

### Prioritized Rewrite List

| Priority | Test | Class | Reason | Suggested Fix |
|----------|------|-------|--------|---------------|
| 1 | `test_name` | COUPLED | Asserts `repo.save.called` — tests the how, not the what | Assert `repo.find(id) is not None` instead |
| 2 | `test_name` | FRAGILE | Reads `user._email` directly | Assert via `user.get_profile()["email"]` |
| 3 | `test_name` | THEATER | No assertion — only calls the method | Add assertion on return value or observable state |

### Horizontal Slice Signals
[commit evidence or "not detectable from test file alone — run Mode 2 (commit history)"]

Scope (Mode 1)

Pass a single file: evaluate all tests in it. Pass a directory: evaluate the most-changed test files first; ask before evaluating more than 20 tests in one turn.

For concrete rewrite examples by language: see Coupling Patterns.

Mode 2 — TDD Compliance (commit-history audit)

Verifies the discipline was followed, not just that tests exist. Tests written after implementation feel different, test different things, and provide different value than tests written first. This mode detects that from git history and scores it.

Workflow

Gather evidence — chronological git log, test + implementation file contents, coverage report (if available), test-run results.
Analyze commit order — did a failing-test commit precede each implementation commit? Flag commits containing both test and impl ("should be separate") and impl commits with no preceding test.

| Commit | Type | TDD compliant? | |--------|------|----------------| | abc123 | Test | N/A (first) | | def456 | Impl | Yes (test first) | | ghi789 | Both | No (should be separate) | | jkl012 | Impl | No (no preceding test) |
Analyze test quality — run Mode 1's classification on the changed tests.
Check coverage quality — distinguish real coverage (tests fail when behavior breaks; edge and error paths covered) from theater (passes even with broken behavior; no assertions; happy path only).
Generate the scorecard.

AI Anti-Patterns to Detect

| Anti-Pattern | Detection Signal | |---|---| | Test-After Implementation | Both test and impl in one commit; no failing-test commit before the impl commit | | Over-Mocking | assert_called_with(...) on implementation-internal methods | | Happy Path Only | Test inventory missing zero / overflow / invalid-input cases | | Assert-Free Tests | Zero assertion statements in the test body | | Implementation Coupling | _private_method / _internal_state references in test assertions | | Copy-Paste Tests | Test names like test_X_1, test_X_2 with only value differences |

Compliance Scorecard

## TDD Compliance Scorecard: [repo/branch]
**Period**: [date range] | **Commits analyzed**: N

| Category | Score | Status |
|----------|-------|--------|
| Test-First Development | X/5 | GREEN/YELLOW/RED |
| Behavioral Testing | X/5 | GREEN/YELLOW/RED |
| Minimal Implementation | X/5 | GREEN/YELLOW/RED |
| Refactoring Discipline | X/5 | GREEN/YELLOW/RED |
| Coverage Quality | X/5 | GREEN/YELLOW/RED |
**Overall**: X/25 ([percentage]%)

Anti-patterns: [list or "none"]
Recommendations: Immediate: [...] | Short-term: [...] | Ongoing: [...]

Full scoring methodology: Compliance Scoring. AI-specific anti-pattern catalog: AI Anti-Patterns.

Discipline rules: evidence-based only — never claim compliance without commit-history proof; be constructive (findings drive improvement, not punishment); account for context (legacy code, time pressure, learning curves) before scoring.

michaelalber/evaluate-tests

skills/team/evaluate-tests/SKILL.md

Audits existing tests in two modes: (1) test-file quality — grades tests against Beck's behavioral and structure-insensitive criteria, flagging implementation-coupled, fragile, and theater tests with a prioritized rewrite list; (2) TDD compliance — analyzes git history for test-first discipline, producing a 0-25 scorecard with AI anti-pattern findings. Use when auditing inherited suites, checking AI-generated tests before merge, prepping for safe refactoring, or verifying TDD was followed. Not for writing new tests (tdd), or when there are no tests yet.

1 stars

development

Updated Jun 21, 2026

$ install --global

skillsauth

npx skillsauth add michaelalber/ai-toolkit evaluate-tests

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Jun 21, 2026, 6:19 AM21.3s4 files scanned

SKILL.md

name:: evaluate-tests
audience:: team
description:: >
Audits existing tests in two modes:: (1) test-file quality — grades tests against Beck's

Evaluate Tests

"If tests break when you rename a private method, your tests are describing the solution, not the problem." — Adapted from Kent Beck

Two Modes

Pick the mode from the request. Run both for a full audit — Mode 1's findings feed the Behavioral and Coverage categories of Mode 2's scorecard.

Mode 1 — Test-File Quality

What This Skill Evaluates

For each test, assess two properties and one pattern:

1. Behavioral (does the test specify behavior?)

A behavioral test fails when the observable outcome breaks. An implementation-coupled test fails when internals change even if behavior is identical.

2. Structure-insensitive (does the test survive internal refactors?)

Run this mental test: if I renamed every private method and field in this class, would this test break? If yes, it is structure-coupled.

3. Horizontal slice pattern (were tests written without matching implementation?)

Per-Test Classification

For each test, classify as one of:

Output Format

## Test Evaluation: [file or suite name]

### Summary
Tests evaluated: N | PASS: N | COUPLED: N | FRAGILE: N | THEATER: N

### Prioritized Rewrite List

| Priority | Test | Class | Reason | Suggested Fix |
|----------|------|-------|--------|---------------|
| 1 | `test_name` | COUPLED | Asserts `repo.save.called` — tests the how, not the what | Assert `repo.find(id) is not None` instead |
| 2 | `test_name` | FRAGILE | Reads `user._email` directly | Assert via `user.get_profile()["email"]` |
| 3 | `test_name` | THEATER | No assertion — only calls the method | Add assertion on return value or observable state |

### Horizontal Slice Signals
[commit evidence or "not detectable from test file alone — run Mode 2 (commit history)"]

Scope (Mode 1)

Pass a single file: evaluate all tests in it. Pass a directory: evaluate the most-changed test files first; ask before evaluating more than 20 tests in one turn.

For concrete rewrite examples by language: see Coupling Patterns.

Mode 2 — TDD Compliance (commit-history audit)

Workflow

Gather evidence — chronological git log, test + implementation file contents, coverage report (if available), test-run results.
Analyze commit order — did a failing-test commit precede each implementation commit? Flag commits containing both test and impl ("should be separate") and impl commits with no preceding test.

| Commit | Type | TDD compliant? | |--------|------|----------------| | abc123 | Test | N/A (first) | | def456 | Impl | Yes (test first) | | ghi789 | Both | No (should be separate) | | jkl012 | Impl | No (no preceding test) |
Analyze test quality — run Mode 1's classification on the changed tests.
Check coverage quality — distinguish real coverage (tests fail when behavior breaks; edge and error paths covered) from theater (passes even with broken behavior; no assertions; happy path only).
Generate the scorecard.

AI Anti-Patterns to Detect

Compliance Scorecard

## TDD Compliance Scorecard: [repo/branch]
**Period**: [date range] | **Commits analyzed**: N

| Category | Score | Status |
|----------|-------|--------|
| Test-First Development | X/5 | GREEN/YELLOW/RED |
| Behavioral Testing | X/5 | GREEN/YELLOW/RED |
| Minimal Implementation | X/5 | GREEN/YELLOW/RED |
| Refactoring Discipline | X/5 | GREEN/YELLOW/RED |
| Coverage Quality | X/5 | GREEN/YELLOW/RED |
**Overall**: X/25 ([percentage]%)

Anti-patterns: [list or "none"]
Recommendations: Immediate: [...] | Short-term: [...] | Ongoing: [...]

Full scoring methodology: Compliance Scoring. AI-specific anti-pattern catalog: AI Anti-Patterns.

Related Skills

michaelalber/grilling

development

VerifiedTrustedCommunity

Interviews the user relentlessly about a plan, decision, or idea — one question at a time, each with a recommended answer. Shared engine behind "grill-me" and "grill-with-docs". Use on any "grill" trigger phrase or to stress-test thinking. Do NOT use to build the plan; it ends at shared understanding, not implementation.

1SKILL.mdUpdated Jul 23, 2026

michaelalber/grilling

michaelalber/grill-with-docs

testing

VerifiedTrustedCommunity

Runs a relentless interview to sharpen a plan or design, capturing the decisions as ADRs and a glossary along the way. Use when the user wants to be grilled AND wants the session to leave durable domain documentation behind. Do NOT use for a throwaway stress-test with no artifacts; use grill-me instead.

1SKILL.mdUpdated Jul 23, 2026

michaelalber/grill-with-docs

michaelalber/vue-security-review

tools

VerifiedTrustedCommunity

OWASP-based security review of Vue/TypeScript front-ends. Detects framework (Vite/Vue CLI/Nuxt), entry points, and data flows; scans the OWASP Top 10 (2025) mapped to Vue client-side risks (raw-HTML XSS via v-html, URL/protocol injection, bundled secrets, insecure token storage, dependency CVEs, missing CSP, open redirects, router guard bypass); emits an exec summary plus graded findings. Use to audit Vue for vulnerabilities. Not for architecture grading (vue-architecture-checklist).

1SKILL.mdUpdated Jul 20, 2026

michaelalber/vue-security-review

michaelalber/vue-modernization-analyzer

tools

VerifiedTrustedCommunity

Analyzes legacy Vue codebases and produces actionable modernization plans. Primary migration paths include Options API to Composition API, Vue 2 to Vue 3, Vue CLI to Vite, JavaScript to TypeScript, Vue Test Utils/Karma/Mocha to Vitest + Vue Testing Library, legacy Vuex to Pinia, and removed-in-Vue-3 pattern cleanup (filters, event bus, `$listeners`). Does NOT perform the migration — assesses, quantifies risk, and plans.

1SKILL.mdUpdated Jul 20, 2026

michaelalber/vue-modernization-analyzer

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/michaelalber/ai-toolkit.git

# Copy into Claude Code skills folder (global)
cp -r ai-toolkit/skills/team/evaluate-tests ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

michaelalber/ai-toolkit

1 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT