Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

entityprocess/agentv-trace-analyst

Name: agentv-trace-analyst
Author: entityprocess

plugins/agentv-dev/skills/agentv-trace-analyst/SKILL.md

npx skillsauth add entityprocess/agentv agentv-trace-analyst

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

AgentV Trace Analyst

Analyze evaluation traces headlessly using agentv inspect primitives and jq.

Primitives

# List result files (most recent first)
agentv inspect list [--limit N] [--format json|table]

# Show results with trace details
agentv inspect show <result-file> [--test-id <id>] [--tree] [--format json|table]

# Percentile statistics
agentv inspect stats <result-file> [--group-by target|suite|test-id] [--format json|table]

# A/B comparison between runs
agentv compare <baseline.jsonl> <candidate.jsonl> [--threshold 0.1] [--format json|table]

Analysis Workflow

1. Discover results

agentv inspect list

Pick the result file to analyze. Most recent is first.

2. Get overview

agentv inspect stats <result-file>

Read the percentile table. Key signals:

score p50 < 0.8: Significant quality issues
latency p90 > 30s: Performance bottleneck
cost p99 spike: Outlier cost tests to investigate
tool_calls p90 >> p50: Some tests are much chattier

3. Investigate failures

agentv inspect show <result-file> --format json | jq '[.[] | select(.score < 0.8) | {test_id, score, assertions: [.assertions[] | select(.passed | not)], trace: {tools: (.trace.tool_calls | keys)}, duration_ms, cost_usd}]'

For each failing test, examine:

assertions (failed): What criteria were not met? (filter for passed: false)
trace.tool_calls: Did the agent use expected tools?
duration_ms: Did it time out or run too long?
reasoning: Why did the grader score it low?

4. Inspect specific tests

# Flat view with trace summary
agentv inspect show <result-file> --test-id <id>

# Tree view (if output messages available)
agentv inspect show <result-file> --test-id <id> --tree

The tree view shows the agent's execution path — LLM calls interspersed with tool invocations. Look for:

Excessive tool calls: Agent looping or exploring unnecessarily
Missing tools: Expected tool not called
Long durations: Specific tool calls that are slow

5. Compare runs

agentv compare <baseline.jsonl> <candidate.jsonl>

Look for:

Wins vs losses: Net improvement or regression?
Mean delta: Overall direction of change
Per-test deltas: Which tests regressed?

6. Group analysis

# By target provider
agentv inspect stats <result-file> --group-by target

# By suite
agentv inspect stats <result-file> --group-by suite

Compare providers side-by-side: which is cheaper, faster, more accurate?

Advanced Queries with jq

All commands support --format json for piping to jq:

# Top 3 most expensive tests
agentv inspect show <result-file> --format json \
  | jq 'sort_by(-.cost_usd) | .[0:3] | .[] | {test_id, cost: .cost_usd, score}'

# Tests where token usage exceeds 10k
agentv inspect show <result-file> --format json \
  | jq '[.[] | select(.token_usage.input + .token_usage.output > 10000) | {test_id, tokens: (.token_usage.input + .token_usage.output)}]'

# Score distribution by suite
agentv inspect show <result-file> --format json \
  | jq 'group_by(.suite) | .[] | {suite: .[0].suite, count: length, avg_score: ([.[].score] | add / length)}'

# Tool usage frequency across all tests
agentv inspect show <result-file> --format json \
  | jq '[.[].trace.tool_calls // {} | to_entries[]] | group_by(.key) | .[] | {tool: .[0].key, total_calls: ([.[].value] | add)}'

# Find regressions > 0.1 between two runs
agentv compare baseline.jsonl candidate.jsonl --format json \
  | jq '.matched[] | select(.delta < -0.1) | {test_id: .testId, delta, from: .score1, to: .score2}'

Reasoning Patterns

When analyzing traces, think about:

Efficiency: Are tool calls/tokens proportional to task complexity? High tokens-per-tool may indicate verbose prompts or unnecessary context.
Error patterns: Do failures cluster by target, suite, or tool usage? Common patterns:
- Tool errors → agent can't access required resources
- High LLM calls with low tool calls → agent stuck in reasoning loop
- Missing tool calls → wrong tool routing
Cost optimization: Identify tests with high cost but acceptable scores — can they use a cheaper model? Compare --group-by target stats.
Latency distribution: P50 vs P99 spread indicates consistency. Large spread means unpredictable performance — investigate P99 outliers.
Regression detection: After a prompt/config change, compare before/after. Mean delta > 0 is good, but check individual test regressions — a few large losses can hide behind many small wins.

entityprocess/agentv-trace-analyst

plugins/agentv-dev/skills/agentv-trace-analyst/SKILL.md

Analyze AgentV evaluation traces and result JSONL files using `agentv inspect` and `agentv compare` CLI commands. Use when asked to inspect AgentV eval results, find regressions between AgentV evaluation runs, identify failure patterns in AgentV trace data, analyze tool trajectories, or compute cost/latency/score statistics from AgentV result files. Do NOT use for benchmarking skill trigger accuracy, analyzing skill-creator eval performance, or measuring skill description quality — those tasks belong to the skill-creator skill.

12 stars

tools

Updated Apr 17, 2026

$ install --global

skillsauth

npx skillsauth add entityprocess/agentv agentv-trace-analyst

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 17, 2026, 12:25 PM42.3s1 file scanned

SKILL.md

name:: agentv-trace-analyst
description:: >-

AgentV Trace Analyst

Analyze evaluation traces headlessly using agentv inspect primitives and jq.

Primitives

# List result files (most recent first)
agentv inspect list [--limit N] [--format json|table]

# Show results with trace details
agentv inspect show <result-file> [--test-id <id>] [--tree] [--format json|table]

# Percentile statistics
agentv inspect stats <result-file> [--group-by target|suite|test-id] [--format json|table]

# A/B comparison between runs
agentv compare <baseline.jsonl> <candidate.jsonl> [--threshold 0.1] [--format json|table]

Analysis Workflow

1. Discover results

agentv inspect list

Pick the result file to analyze. Most recent is first.

2. Get overview

agentv inspect stats <result-file>

Read the percentile table. Key signals:

score p50 < 0.8: Significant quality issues
latency p90 > 30s: Performance bottleneck
cost p99 spike: Outlier cost tests to investigate
tool_calls p90 >> p50: Some tests are much chattier

3. Investigate failures

agentv inspect show <result-file> --format json | jq '[.[] | select(.score < 0.8) | {test_id, score, assertions: [.assertions[] | select(.passed | not)], trace: {tools: (.trace.tool_calls | keys)}, duration_ms, cost_usd}]'

For each failing test, examine:

assertions (failed): What criteria were not met? (filter for passed: false)
trace.tool_calls: Did the agent use expected tools?
duration_ms: Did it time out or run too long?
reasoning: Why did the grader score it low?

4. Inspect specific tests

# Flat view with trace summary
agentv inspect show <result-file> --test-id <id>

# Tree view (if output messages available)
agentv inspect show <result-file> --test-id <id> --tree

The tree view shows the agent's execution path — LLM calls interspersed with tool invocations. Look for:

Excessive tool calls: Agent looping or exploring unnecessarily
Missing tools: Expected tool not called
Long durations: Specific tool calls that are slow

5. Compare runs

agentv compare <baseline.jsonl> <candidate.jsonl>

Look for:

Wins vs losses: Net improvement or regression?
Mean delta: Overall direction of change
Per-test deltas: Which tests regressed?

6. Group analysis

# By target provider
agentv inspect stats <result-file> --group-by target

# By suite
agentv inspect stats <result-file> --group-by suite

Compare providers side-by-side: which is cheaper, faster, more accurate?

Advanced Queries with jq

All commands support --format json for piping to jq:

# Top 3 most expensive tests
agentv inspect show <result-file> --format json \
  | jq 'sort_by(-.cost_usd) | .[0:3] | .[] | {test_id, cost: .cost_usd, score}'

# Tests where token usage exceeds 10k
agentv inspect show <result-file> --format json \
  | jq '[.[] | select(.token_usage.input + .token_usage.output > 10000) | {test_id, tokens: (.token_usage.input + .token_usage.output)}]'

# Score distribution by suite
agentv inspect show <result-file> --format json \
  | jq 'group_by(.suite) | .[] | {suite: .[0].suite, count: length, avg_score: ([.[].score] | add / length)}'

# Tool usage frequency across all tests
agentv inspect show <result-file> --format json \
  | jq '[.[].trace.tool_calls // {} | to_entries[]] | group_by(.key) | .[] | {tool: .[0].key, total_calls: ([.[].value] | add)}'

# Find regressions > 0.1 between two runs
agentv compare baseline.jsonl candidate.jsonl --format json \
  | jq '.matched[] | select(.delta < -0.1) | {test_id: .testId, delta, from: .score1, to: .score2}'

Reasoning Patterns

When analyzing traces, think about:

Efficiency: Are tool calls/tokens proportional to task complexity? High tokens-per-tool may indicate verbose prompts or unnecessary context.
Error patterns: Do failures cluster by target, suite, or tool usage? Common patterns:
- Tool errors → agent can't access required resources
- High LLM calls with low tool calls → agent stuck in reasoning loop
- Missing tool calls → wrong tool routing
Cost optimization: Identify tests with high cost but acceptable scores — can they use a cheaper model? Compare --group-by target stats.
Latency distribution: P50 vs P99 spread indicates consistency. Large spread means unpredictable performance — investigate P99 outliers.
Regression detection: After a prompt/config change, compare before/after. Mean delta > 0 is good, but check individual test regressions — a few large losses can hide behind many small wins.

Related Skills

entityprocess/agentv-trace-analyst

tools

VerifiedTrustedCommunity

12SKILL.mdUpdated May 25, 2026

entityprocess/agentv-trace-analyst

entityprocess/agentv-governance

development

VerifiedTrustedCommunity

Author, edit, and lint `governance:` blocks in `*.eval.yaml` files. Use when creating or updating evaluation suites that carry AI-governance metadata (OWASP LLM Top 10, OWASP Agentic Top 10, MITRE ATLAS, EU AI Act, ISO 42001). Also use non-interactively (e.g., from a GitHub Action) to lint changed eval files and report violations against the rules in `references/lint-rules.md`. Do NOT use for running evals or benchmarking — that belongs to agentv-bench.

12SKILL.mdUpdated May 25, 2026

entityprocess/agentv-governance

entityprocess/agentv-eval-writer

development

VerifiedTrustedCommunity

Write, edit, review, and validate AgentV EVAL.yaml / .eval.yaml evaluation files. Use when asked to create new eval files, update or fix existing ones, add or remove test cases, configure graders (`llm-grader`, `code-grader`, `rubrics`), review whether an eval is correct or complete, convert between EVAL.yaml and evals.json using `agentv convert`, or generate eval test cases from chat transcripts (markdown conversation or JSON messages). Do NOT use for creating SKILL.md files, writing skill definitions, or running evals — running and benchmarking belongs to agentv-bench.

12SKILL.mdUpdated May 25, 2026

entityprocess/agentv-eval-writer

entityprocess/agentv-eval-review

development

VerifiedTrustedCommunity

Use when reviewing eval YAML files for quality issues, linting eval files before committing, checking eval schema compliance, or when asked to "review these evals", "check eval quality", "lint eval files", or "validate eval structure". Do NOT use for writing evals (use agentv-eval-writer) or running evals (use agentv-bench).

12SKILL.mdUpdated May 25, 2026

entityprocess/agentv-eval-review

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/entityprocess/agentv.git

# Copy into Claude Code skills folder (global)
cp -r agentv/plugins/agentv-dev/skills/agentv-trace-analyst ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

entityprocess/agentv

12 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT