plugins/agentv-dev/skills/agentv-eval-review/SKILL.md
Use when reviewing eval YAML files for quality issues, linting eval files before committing, checking eval schema compliance, or when asked to "review these evals", "check eval quality", "lint eval files", or "validate eval structure". Do NOT use for writing evals (use agentv-eval-writer) or running evals (use agentv-bench).
npx skillsauth add entityprocess/agentv agentv-eval-reviewInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Lint and review AgentV eval YAML files for structural issues, schema compliance, and quality problems. Runs deterministic checks via script, then applies LLM judgment for semantic issues the script cannot catch.
Execute scripts/lint_eval.py against the target eval files:
python scripts/lint_eval.py <path-to-evals-dir-or-file> --json
The script checks:
.eval.yaml extensiondescription field presentid, input, and at least one of criteria/expected_output/assertionstype: file use leading /assertions blocks present (flags tests relying solely on expected_output)expected_output prose detection (flags "The agent should..." patterns)input)Report the script findings grouped by severity (error > warning > info). For each finding, include the file path and a concrete fix.
The script catches structural issues but cannot assess:
Read the relevant SKILL.md files and cross-check against the eval content for these issues.
scripts/lint_eval.py — Deterministic eval linter (Python 3.11+, stdlib only)tools
Analyze AgentV evaluation traces and result JSONL files using `agentv inspect` and `agentv compare` CLI commands. Use when asked to inspect AgentV eval results, find regressions between AgentV evaluation runs, identify failure patterns in AgentV trace data, analyze tool trajectories, or compute cost/latency/score statistics from AgentV result files. Do NOT use for benchmarking skill trigger accuracy, analyzing skill-creator eval performance, or measuring skill description quality — those tasks belong to the skill-creator skill.
development
Author, edit, and lint `governance:` blocks in `*.eval.yaml` files. Use when creating or updating evaluation suites that carry AI-governance metadata (OWASP LLM Top 10, OWASP Agentic Top 10, MITRE ATLAS, EU AI Act, ISO 42001). Also use non-interactively (e.g., from a GitHub Action) to lint changed eval files and report violations against the rules in `references/lint-rules.md`. Do NOT use for running evals or benchmarking — that belongs to agentv-bench.
development
Write, edit, review, and validate AgentV EVAL.yaml / .eval.yaml evaluation files. Use when asked to create new eval files, update or fix existing ones, add or remove test cases, configure graders (`llm-grader`, `code-grader`, `rubrics`), review whether an eval is correct or complete, convert between EVAL.yaml and evals.json using `agentv convert`, or generate eval test cases from chat transcripts (markdown conversation or JSON messages). Do NOT use for creating SKILL.md files, writing skill definitions, or running evals — running and benchmarking belongs to agentv-bench.
development
Use when reviewing eval YAML files for quality issues, linting eval files before committing, checking eval schema compliance, or when asked to "review these evals", "check eval quality", "lint eval files", or "validate eval structure". Do NOT use for writing evals (use agentv-eval-writer) or running evals (use agentv-bench).