skills/eval-harness/SKILL.md
# eval-harness — AI Agent Evaluation Skill Automate evaluation (evals) for AI agents. Covers coding, conversational, research, computer-use, and sub-agent types. Implements Swiss Cheese Model (6 layers). **Based on:** Anthropic Engineering — Demystifying Evals for AI Agents (2026-01-09) --- ## When This Skill Applies Use when the user wants to: - Set up an eval suite for an AI agent or task - Run evals with pass@k or pass^k metrics - Add graders (code-based, model-based, human) - Generate C
npx skillsauth add dvduongth/skills skills/eval-harnessInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Automate evaluation (evals) for AI agents. Covers coding, conversational, research, computer-use, and sub-agent types. Implements Swiss Cheese Model (6 layers).
Based on: Anthropic Engineering — Demystifying Evals for AI Agents (2026-01-09)
Use when the user wants to:
Before taking action, understand these terms:
| Term | Meaning | |------|---------| | Task | One test scenario (YAML file) with input + graders | | Trial | One run of a task (non-deterministic — run k times) | | Grader | Logic that scores one aspect of a trial output | | Outcome | REAL environment state — not what agent claims | | pass@k | ≥1 of k trials passes — use for capability building | | pass^k | ALL k trials pass — use for production reliability | | Saturation | Score >80% consistently → need harder tasks | | Regression | Score drops >5% vs previous run → alert |
Key rules:
Scaffold eval suite for current project.
Usage: /eval-harness init [--type coding|conv|research|cu|all]
What to do:
evals/ directory with full structureevals/eval.config.yaml with defaultsskills/eval-harness/templates/evals/runs/ dir (SQLite DB lives here)evals/human-review/pending/ dirpython -m runner.storage init --config evals/eval.config.yamlevals/eval.config.yaml then run /eval-harness task create"evals/ structure to create:
evals/
├── eval.config.yaml
├── tasks/
│ ├── coding/
│ ├── conversational/
│ ├── research/
│ └── computer-use/
├── graders/
│ └── rubrics/
├── runs/
├── reports/
└── human-review/
└── pending/
eval.config.yaml to generate:
suite_name: "<project-name> Evals"
db_path: evals/runs/eval.db
default_model: claude-sonnet-4-6
grader_model: claude-sonnet-4-6
default_k: 3
default_metric: pass-at-k
default_timeout_seconds: 300
human_review_notify: true
Create a new eval task from template.
Usage: /eval-harness task create [--agent-type coding|conv|research|cu|sub-agent] [--name TASK_NAME]
What to do:
skills/eval-harness/templates/<type>.yamlevals/tasks/<type>/<task-name>.yamltask_id, description, difficulty based on user inputinput.prompt and graders sections, then run /eval-harness task validate"Read references/agent-types.md for agent-type-specific guidance.
Check that a task is well-formed and not broken.
Usage: /eval-harness task validate [TASK_ID|--all]
What to do:
python -m runner.core validate --task <task-id> --config evals/eval.config.yaml--all: validate all tasks in evals/tasks/Critical: "Frontier model 0% = task is broken, NOT agent is bad" — always explain this.
Run eval suite or specific task.
Usage: /eval-harness run [--suite NAME] [--task TASK_ID] [--k 3] [--mode pass-at-k|pass-pow-k]
What to do:
python -m runner.core run --config evals/eval.config.yaml [--task TASK_ID] [--k K]--k overrides per-task k: valuepython -m runner.report generate --run-id <id> for full reportRead references/metrics.md for pass@k vs pass^k guidance.
Run graders on an existing run (or re-grade).
Usage: /eval-harness grade [--run-id RUN_ID] [--grader code|model|human|all]
What to do:
python -m runner.core grade --run-id <id> --grader <type>code grader: runs synchronously (fast)model grader: calls LLM with rubric (async, show progress)human grader: enqueues to evals/human-review/pending/ — does NOT blockAdd a grader to an existing task.
Usage: /eval-harness grader add --type code|model|human [--task TASK_ID]
What to do:
--type:
code: Ask "What to check?" → add type: unit_tests or type: state_check blockmodel: Ask "What rubric criteria?" → create rubrics/<task>-rubric.md, add type: llm_rubric blockhuman: Add type: human block with queue: reviewweight: — ensure all grader weights in task sum to 1.0/eval-harness grader test --task <id> to verifyRead references/graders.md for grader configuration patterns.
Test a grader with a sample transcript.
Usage: /eval-harness grader test [--task TASK_ID] [--grader-id ID]
What to do:
python -m runner.graders.code_grader test --task <id>Generate evaluation report.
Usage: /eval-harness report [--run-id RUN_ID] [--format md|json|html] [--compare RUN_ID]
What to do:
python -m runner.report generate --run-id <id> --format <fmt>evals/reports/YYYY-MM-DD-run-<id>.<fmt>--compareView metrics over time.
Usage: /eval-harness metrics [--suite NAME] [--trend] [--saturation-check]
What to do:
python -m runner.metrics summary --config evals/eval.config.yaml--trend: show score progression over time--saturation-check: flag tasks with score >80% consistentlyView trial transcript/trace.
Usage: /eval-harness transcript [--run-id RUN_ID] [--task-id ID] [--failed-only]
What to do:
SELECT transcript_json FROM trials WHERE ...--failed-only: only show trials where passed=falseCreate human review queue.
Usage: /eval-harness human-review schedule [--sample N] [--reviewer NAME]
What to do:
python -m runner.graders.human_grader schedule --sample <N> --reviewer <name>evals/human-review/pending/review-<id>.md for each (see review file format in spec)human_review_notify: true in config: create a scheduled task reminder/eval-harness human-review list to see them."List pending review files.
Usage: /eval-harness human-review list
What to do:
.md files in evals/human-review/pending/Import human review results into DB.
Usage: /eval-harness human-review submit [--review-file PATH]
What to do:
python -m runner.graders.human_grader submit --file <path>human_reviews table in DBpending/ to completed/Generate CI adapter configuration.
Usage: /eval-harness ci setup --platform github|gitlab|jenkins|generic
What to do:
references/ci-adapters.md for platform templatesgithub → .github/workflows/eval.ymlgitlab → .gitlab-ci.yml (append eval job)jenkins → Jenkinsfile (append eval stage)generic → scripts/run-evals.shRequired env vars: ANTHROPIC_API_KEY, EVAL_DB_PATH, EVAL_SUITE, EVAL_MODEL
Run evals in CI context.
Usage: /eval-harness ci run [--mode regression|capability] [--fail-threshold 80]
What to do:
python -m runner.core run --config evals/eval.config.yaml --mode <mode>regression mode: run only eval_mode: regression tasksProduction monitoring integration.
Usage: /eval-harness monitor --source logs|api|feedback [--alert-on-drop]
What to do:
references/swiss-cheese.md for monitoring layer guidancelogs: Show how to parse application logs for agent failure signalsapi: Show how to hook into API response logging for monitoringfeedback: Show how to import user feedback ratings into DB--alert-on-drop: configure threshold alertsThis is Swiss Cheese layer 2 (production monitoring). Provide guidance, not automation — this is project-specific.
Show eval suite overview.
Usage: /eval-harness status
What to do:
python -m runner.metrics status --config evals/eval.config.yamlAlways verify weights sum to 1.0 when adding/editing graders. If they don't, redistribute proportionally.
# Validation in runner
total = sum(g['weight'] for g in task['graders'])
assert abs(total - 1.0) < 0.01, f"Grader weights sum to {total}, must be 1.0"
passed=false, error="timeout" — do NOT retry automaticallypassed=false, error=traceback — show traceback to usergrader_score=null, error=msg — do NOT fail the whole runreferences/agent-types.md — per-agent-type YAML patterns and grader guidancereferences/graders.md — code/model/human grader configurationreferences/metrics.md — pass@k vs pass^k, saturation, regression formulasreferences/ci-adapters.md — CI platform templatesreferences/swiss-cheese.md — 6-layer monitoring guidedevelopment
Hiểu sâu bất kỳ codebase nào đã được GitNexus index — architecture, execution flows, symbol relationships, blast radius. Dùng khi hỏi về codebase architecture, symbol context, impact analysis, hoặc index status.
tools
Search GIF providers with CLI/TUI, download results, and extract stills/sheets.
documentation
Fetch GitHub issues, spawn sub-agents to implement fixes and open PRs, then monitor and address PR review comments. Usage: /gh-issues [owner/repo] [--label bug] [--limit 5] [--milestone v1.0] [--assignee @me] [--fork user/repo] [--watch] [--interval 5] [--reviews-only] [--cron] [--dry-run] [--model glm-5] [--notify-channel -1002381931352]
tools
Gemini CLI for one-shot Q&A, summaries, and generation.