plugins/agentv-dev/skills/agentv-trace-analyst/SKILL.md
Analyze AgentV evaluation traces and result JSONL files using `agentv inspect` and `agentv compare` CLI commands. Use when asked to inspect AgentV eval results, find regressions between AgentV evaluation runs, identify failure patterns in AgentV trace data, analyze tool trajectories, or compute cost/latency/score statistics from AgentV result files. Do NOT use for benchmarking skill trigger accuracy, analyzing skill-creator eval performance, or measuring skill description quality — those tasks belong to the skill-creator skill.
npx skillsauth add entityprocess/agentv agentv-trace-analystInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Analyze evaluation traces headlessly using agentv inspect primitives and jq.
# List result files (most recent first)
agentv inspect list [--limit N] [--format json|table]
# Show results with trace details
agentv inspect show <result-file> [--test-id <id>] [--tree] [--format json|table]
# Percentile statistics
agentv inspect stats <result-file> [--group-by target|suite|test-id] [--format json|table]
# A/B comparison between runs
agentv compare <baseline.jsonl> <candidate.jsonl> [--threshold 0.1] [--format json|table]
agentv inspect list
Pick the result file to analyze. Most recent is first.
agentv inspect stats <result-file>
Read the percentile table. Key signals:
agentv inspect show <result-file> --format json | jq '[.[] | select(.score < 0.8) | {test_id, score, assertions: [.assertions[] | select(.passed | not)], trace: {tools: (.trace.tool_calls | keys)}, duration_ms, cost_usd}]'
For each failing test, examine:
passed: false)# Flat view with trace summary
agentv inspect show <result-file> --test-id <id>
# Tree view (if output messages available)
agentv inspect show <result-file> --test-id <id> --tree
The tree view shows the agent's execution path — LLM calls interspersed with tool invocations. Look for:
agentv compare <baseline.jsonl> <candidate.jsonl>
Look for:
# By target provider
agentv inspect stats <result-file> --group-by target
# By suite
agentv inspect stats <result-file> --group-by suite
Compare providers side-by-side: which is cheaper, faster, more accurate?
All commands support --format json for piping to jq:
# Top 3 most expensive tests
agentv inspect show <result-file> --format json \
| jq 'sort_by(-.cost_usd) | .[0:3] | .[] | {test_id, cost: .cost_usd, score}'
# Tests where token usage exceeds 10k
agentv inspect show <result-file> --format json \
| jq '[.[] | select(.token_usage.input + .token_usage.output > 10000) | {test_id, tokens: (.token_usage.input + .token_usage.output)}]'
# Score distribution by suite
agentv inspect show <result-file> --format json \
| jq 'group_by(.suite) | .[] | {suite: .[0].suite, count: length, avg_score: ([.[].score] | add / length)}'
# Tool usage frequency across all tests
agentv inspect show <result-file> --format json \
| jq '[.[].trace.tool_calls // {} | to_entries[]] | group_by(.key) | .[] | {tool: .[0].key, total_calls: ([.[].value] | add)}'
# Find regressions > 0.1 between two runs
agentv compare baseline.jsonl candidate.jsonl --format json \
| jq '.matched[] | select(.delta < -0.1) | {test_id: .testId, delta, from: .score1, to: .score2}'
When analyzing traces, think about:
Efficiency: Are tool calls/tokens proportional to task complexity? High tokens-per-tool may indicate verbose prompts or unnecessary context.
Error patterns: Do failures cluster by target, suite, or tool usage? Common patterns:
Cost optimization: Identify tests with high cost but acceptable scores — can they use a cheaper model? Compare --group-by target stats.
Latency distribution: P50 vs P99 spread indicates consistency. Large spread means unpredictable performance — investigate P99 outliers.
Regression detection: After a prompt/config change, compare before/after. Mean delta > 0 is good, but check individual test regressions — a few large losses can hide behind many small wins.
tools
Analyze AgentV evaluation traces and result JSONL files using `agentv inspect` and `agentv compare` CLI commands. Use when asked to inspect AgentV eval results, find regressions between AgentV evaluation runs, identify failure patterns in AgentV trace data, analyze tool trajectories, or compute cost/latency/score statistics from AgentV result files. Do NOT use for benchmarking skill trigger accuracy, analyzing skill-creator eval performance, or measuring skill description quality — those tasks belong to the skill-creator skill.
development
Author, edit, and lint `governance:` blocks in `*.eval.yaml` files. Use when creating or updating evaluation suites that carry AI-governance metadata (OWASP LLM Top 10, OWASP Agentic Top 10, MITRE ATLAS, EU AI Act, ISO 42001). Also use non-interactively (e.g., from a GitHub Action) to lint changed eval files and report violations against the rules in `references/lint-rules.md`. Do NOT use for running evals or benchmarking — that belongs to agentv-bench.
development
Write, edit, review, and validate AgentV EVAL.yaml / .eval.yaml evaluation files. Use when asked to create new eval files, update or fix existing ones, add or remove test cases, configure graders (`llm-grader`, `code-grader`, `rubrics`), review whether an eval is correct or complete, convert between EVAL.yaml and evals.json using `agentv convert`, or generate eval test cases from chat transcripts (markdown conversation or JSON messages). Do NOT use for creating SKILL.md files, writing skill definitions, or running evals — running and benchmarking belongs to agentv-bench.
development
Use when reviewing eval YAML files for quality issues, linting eval files before committing, checking eval schema compliance, or when asked to "review these evals", "check eval quality", "lint eval files", or "validate eval structure". Do NOT use for writing evals (use agentv-eval-writer) or running evals (use agentv-bench).