skills/eval-harness/SKILL.md
Build evaluation harnesses for AI-assisted implementations — capability evals, regression tests, graders (exact-match, LLM-judge, code-exec, rubric), and standardized metrics (pass@k, accuracy, latency, cost). Use when the user wants to set up or run evals, benchmark agents/skills, create regression test suites for prompts or AI features, compare model/prompt variants, or measure implementation quality. Triggers on "set up evals", "create eval harness", "benchmark this skill", "regression test the prompt", "run evals", "/eval-harness", or any request to systematically measure AI output quality.
npx skillsauth add mhylle/claude-skills-collection eval-harnessInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
A skill for building and running evaluation harnesses on AI features. Covers two eval types (capability, regression), three grader types (code, model, human), and the metrics that summarize them (pass@k, pass^k).
Evals are defined before or alongside implementation, not after. Success criteria become explicit, measurable, and testable from day one.
Core principles:
Benefits: clear success criteria (no ambiguity about what "done" means), early problem detection, confidence in changes, tests double as executable specifications, quantitative progress tracking.
The mindset shift:
Traditional: "Build it, then figure out if it works"
EDD: "Define what 'works' means, then build to pass those evals"
| Type | Purpose | When to use | |---|---|---| | Capability | Verify new functionality works | Adding features, improving functionality, testing edge cases | | Regression | Protect against unintended breakage | After code changes, before PR merge, after dependency updates |
Full structures and YAML examples → references/eval-types.md.
| Type | Strength | When to use | |---|---|---| | Code (exact match, regex, function output) | Fast, deterministic, cheap | Any codifiable correctness check | | Model (LLM judge with rubric) | Handles semantic variation, tone, quality | Natural-language outputs, multiple valid answers | | Human (Likert, binary, ranking) | Catches what machines miss | Creative content, safety-critical judgments, calibrating auto-graders |
Full configs, implementation sketches, and rubric templates → references/graders.md.
| Metric | Question answered | Use case | |---|---|---| | pass@1 | "Can it solve this at all?" | Single-shot capability | | pass@k | "Can it ever solve this?" | Best-of-k capability, retries cheap | | pass^k | "Can it always solve this?" | Consistency, reliability, production |
Formulas, implementations, and the picking-guide → references/metrics.md.
Four phases. Run them in order when bringing up a new eval suite; rerun phase 3 every time you want a fresh measurement.
Create the evaluation specification before or alongside implementation.
eval_suite:
name: "Feature X Evaluation Suite"
version: "1.0.0"
description: "Comprehensive evaluation for Feature X"
config:
parallel_execution: true
max_workers: 4
timeout_global: 3600
retry_failed: true
retry_count: 2
evals:
- $ref: "./evals/capability/cap-001.yaml"
- $ref: "./evals/capability/cap-002.yaml"
- $ref: "./evals/regression/reg-001.yaml"
reporting:
format: ["json", "markdown", "html"]
output_dir: "./eval-results"
include_details: true
notify_on_failure: true
Checklist:
Build the feature, using evals as guidance.
Implementation loop:
1. Write/modify code
2. Run relevant evals locally
3. Fix failures
4. Repeat until evals pass
5. Commit changes
Best practices:
Run the full evaluation suite and collect results.
# Run all evals
eval-harness run --suite ./eval-suite.yaml
# Run specific category
eval-harness run --suite ./eval-suite.yaml --category capability
# Run with verbose output
eval-harness run --suite ./eval-suite.yaml --verbose
# Run with specific grader override
eval-harness run --suite ./eval-suite.yaml --grader-type model
Execution flow: load suite → validate eval definitions → initialize graders (code/model/human queues) → execute evals (parallel where possible) → collect results → aggregate scores and metrics → generate reports.
Generate comprehensive evaluation reports.
eval_report:
metadata:
suite_name: "Feature X Evaluation Suite"
run_id: "eval-20240120-143052"
timestamp: "2024-01-20T14:30:52Z"
duration_seconds: 127
environment:
model: "claude-3-5-sonnet"
version: "1.2.3"
summary:
total_evals: 25
passed: 22
failed: 3
skipped: 0
pass_rate: 0.88
by_category:
capability: { total: 15, passed: 13, pass_rate: 0.867 }
regression: { total: 10, passed: 9, pass_rate: 0.90 }
by_grader:
code: { total: 18, passed: 17 }
model: { total: 5, passed: 4 }
human: { total: 2, passed: 1 }
metrics:
pass_at_1: 0.88
pass_at_3: 0.96
pass_hat_3: 0.68
failures:
- eval_id: "cap-003"
name: "Complex nested extraction"
reason: "Missed nested array handling"
- eval_id: "cap-011"
name: "Edge case: empty input"
reason: "Threw exception instead of returning empty"
- eval_id: "reg-007"
name: "API response time regression"
reason: "Response time 2.3s exceeds 2.0s threshold"
recommendations:
- "Improve nested data structure handling"
- "Add empty input validation"
- "Investigate response time regression"
Suites and evals can be authored as YAML (preferred) or JSON. Full structures, schemas, and directory layouts → references/file-format.md.
The eval harness integrates as an optional quality gate. Configure it in the plan's phase definition:
phase:
name: "Implement user authentication"
steps:
- type: "implement"
description: "Build authentication logic"
- type: "test"
description: "Run unit tests"
- type: "eval"
description: "Run capability evals"
config:
suite: "./evals/auth-capability.yaml"
required_pass_rate: 0.9
blocking: true # Fail phase if evals fail
metrics:
- "pass@1"
- "pass^3"
Triggering from implement-phase:
eval_trigger:
when: "after_tests_pass"
suite: "./evals/feature-x.yaml"
filters:
categories: ["capability"]
tags: ["critical", "feature-x"]
thresholds:
pass_at_1: 0.85
pass_hat_3: 0.70
on_failure:
action: "block"
notification: true
report_path: "./eval-results/phase-{phase_id}.md"
on_success:
action: "continue"
archive_results: true
In the phase completion report (auto-included when eval step ran):
### Eval Results
| Metric | Value | Threshold | Status |
|--------|-------|-----------|--------|
| pass@1 | 0.92 | 0.85 | PASS |
| pass^3 | 0.78 | 0.70 | PASS |
#### Capability Evals (12/13 passed)
- [x] cap-auth-001: Login flow
- [x] cap-auth-002: Token refresh
- [ ] cap-auth-003: Session timeout (FAILED)
- Expected: Session expires after 30 min
- Actual: Session persists indefinitely
#### Regression Evals (8/8 passed)
- [x] reg-auth-001: Password hashing unchanged
- [x] reg-auth-002: Token format stable
| Output type | Recommended grader | Rationale | |---|---|---| | Exact values | Code (exact match) | Fast, deterministic | | Formatted text | Code (regex) | Validates structure | | Code output | Code (function) | Tests correctness | | Natural language | Model | Handles variation | | Creative content | Human | Subjective judgment | | Safety-critical | Human + Model | Double verification |
| Scenario | Recommended metric | |---|---| | Code generation (any solution works) | pass@k | | Production system reliability | pass^k | | Single-shot capability | pass@1 | | Best-effort with retries | pass@5 or pass@10 | | Consistency requirements | pass^3 or pass^5 |
| Context | pass@1 | pass@k | pass^k | |---|---|---|---| | Development | 0.70 | 0.90 | 0.50 | | Staging | 0.80 | 0.95 | 0.65 | | Production | 0.90 | 0.99 | 0.80 | | Critical | 0.95 | 0.999 | 0.90 |
references/eval-types.md — capability and regression eval structures with YAML examplesreferences/graders.md — code / model / human grader configs, prompt templates, rubric examples, calibrationreferences/metrics.md — pass@k and pass^k formulas, Python implementations, picking guidereferences/file-format.md — full YAML/JSON schemas and recommended directory layouttesting
One-command issue-to-merge pipeline orchestrator. Drives a GitHub issue through nine stages (preflight, plan, implement, review, ci, cloud_review, deploy, e2e, logs) with two human gates, persisting all run state to files so a crashed or interrupted run resumes losslessly. Triggers on "/ship-issue" with an issue number or URL. User-invoked only.
tools
--- name: tt-workflow-build description: Tasktracker-native trigger for a PARALLEL build via the Claude Code Workflow tool. Thin by design — it does two things, then drives to done: (1) ensure a tasktracker project exists (use the existing one, or create one), then (2) start a dynamic `Workflow` that builds it, tracking the work in tasktracker and using the build + verify skills. It does NOT analyze parallelism up front, ask the user to choose a mode, hand back, or fall back to a sequential skil
tools
--- name: grumpy-reviewer description: A single grumpy, nitpicky structural code reviewer that runs as an isolated subagent and treats the code as third-party work submitted by a junior programmer for validation. It cares about exactly one thing — maintainability — judged through separation of concerns, service-oriented design, helper-method extraction, small files, and the rule of 7 (as any grouping nears 7 members, it pushes for sub-groupings). It is deliberately kept OUT of the implementation
development
--- name: tt-workflow-run description: Tasktracker-native autonomous build-loop orchestrator. Drives a first-class `workflow_run` end-to-end — create the run (Gate 1 lifecycle completeness + Gate 2 zero-defects-in), then loop while `getNextReadyTask(projectId)` returns a slice — `setActiveTask` → record a pre-slice `scanArchitectureDrift` baseline → delegate the slice to `/tt-implement-phase` (which does the code work, registers the architecture delta in-slice, and auto-logs defects/learnings/fr