Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

jmagly/eval-report

Name: eval-report
Author: jmagly

agentic/code/addons/aiwg-evals/skills/eval-report/SKILL.md

npx skillsauth add jmagly/aiwg eval-report

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Evaluation Report

Generate a quality report from accumulated evaluation results.

Research Foundation

REF-001: BP-9 - Continuous evaluation of agent performance
REF-002: KAMI benchmark methodology for real agentic task evaluation

Usage

/eval-report
/eval-report --output .aiwg/reports/quality-report.md
/eval-report --compare previous-report.json
/eval-report --mode sdlc --format json

Options

| Option | Default | Description | |--------|---------|-------------| | --output | stdout | Output file path | | --compare | none | Previous report to diff against | | --mode | all | Agent category: sdlc, marketing, forensics, all | | --format | markdown | Output format: markdown, json | | --since | none | Only include results after this date (ISO 8601) | | --threshold | 0.85 | Score below this triggers a warning |

Process

Collect Results: Read all eval-*.json files from .aiwg/reports/
Aggregate Scores: Compute per-agent and per-archetype scores
Detect Regressions: Compare against --compare baseline if provided
Rank Agents: Sort by overall score, flag below-threshold agents
Build Recommendations: Surface specific agents and archetypes needing attention
Output Report: Write markdown or JSON to --output or stdout

Report Sections

Summary Dashboard

Overall health at a glance — total agents tested, aggregate score, regression count.

By Archetype

Pass rates per Roig (2025) failure archetype across all agents.

Agents Needing Attention

Agents below the --threshold, with consecutive-failure streaks flagged.

Regression Analysis

When --compare is provided: agents whose scores dropped since the baseline.

Recommendations

Prioritized action list: which agents to review, which archetypes to harden.

Output Format (Markdown)

# Agent Quality Report

**Generated**: 2026-04-01T10:30:00Z
**Agents Tested**: 58
**Overall Score**: 87%
**Regressions**: 2

## By Archetype

| Archetype | Pass Rate | Trend |
|-----------|-----------|-------|
| #1 Grounding | 92% | ↑ |
| #2 Substitution | 88% | → |
| #3 Distractor | 78% | ↓ |
| #4 Recovery | 90% | ↑ |

## Agents Needing Attention

| Agent | Score | Consecutive Failures | Issue |
|-------|-------|---------------------|-------|
| data-analyst | 72% | 3 | distractor-test |
| api-designer | 79% | 1 | latency regression (+40%) |

## Recommendations

1. Review `data-analyst` context filtering — failed distractor-test 3 consecutive runs
2. Investigate `api-designer` tool selection — latency regression
3. Increase distractor-test scenarios for marketing agents (78% pass rate below 80% target)

Output Format (JSON)

{
  "generated": "2026-04-01T10:30:00Z",
  "summary": {
    "agents_tested": 58,
    "overall_score": 0.87,
    "regressions": 2
  },
  "by_archetype": {
    "grounding": 0.92,
    "substitution": 0.88,
    "distractor": 0.78,
    "recovery": 0.90
  },
  "agents_needing_attention": [
    {"agent": "data-analyst", "score": 0.72, "consecutive_failures": 3, "issue": "distractor-test"}
  ],
  "recommendations": [
    "Review data-analyst context filtering"
  ]
}

Examples

# Standard report to stdout
/eval-report

# Save to file
/eval-report --output .aiwg/reports/quality-$(date +%Y%m%d).md

# Compare against baseline
/eval-report --compare .aiwg/reports/quality-20260301.json

# JSON for CI consumption
/eval-report --format json --threshold 0.80

# SDLC agents only
/eval-report --mode sdlc

Related Commands

/eval-agent - Test individual agents
/eval-workflow - Test multi-agent workflows
aiwg lint agents - Static validation

Generate evaluation report: $ARGUMENTS

References

@$AIWG_ROOT/agentic/code/addons/aiwg-evals/README.md — aiwg-evals addon overview
@$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/vague-discretion.md — Concrete threshold and scoring requirements
@$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/README.md — SDLC framework context for agent evaluation scope
@$AIWG_ROOT/docs/cli-reference.md — CLI reference for evaluation-related commands

jmagly/eval-report

agentic/code/addons/aiwg-evals/skills/eval-report/SKILL.md

Generate an aggregate agent quality report from evaluation results, showing scores, regressions, and recommendations

127 stars

testing

Updated May 2, 2026

$ install --global

skillsauth

npx skillsauth add jmagly/aiwg eval-report

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 2, 2026, 6:53 AM148.6s1 file scanned

SKILL.md

namespace:: aiwg
name:: eval-report
platforms:: [all]
description:: Generate an aggregate agent quality report from evaluation results, showing scores, regressions, and recommendations

Evaluation Report

Generate a quality report from accumulated evaluation results.

Research Foundation

REF-001: BP-9 - Continuous evaluation of agent performance
REF-002: KAMI benchmark methodology for real agentic task evaluation

Usage

/eval-report
/eval-report --output .aiwg/reports/quality-report.md
/eval-report --compare previous-report.json
/eval-report --mode sdlc --format json

Options

Process

Collect Results: Read all eval-*.json files from .aiwg/reports/
Aggregate Scores: Compute per-agent and per-archetype scores
Detect Regressions: Compare against --compare baseline if provided
Rank Agents: Sort by overall score, flag below-threshold agents
Build Recommendations: Surface specific agents and archetypes needing attention
Output Report: Write markdown or JSON to --output or stdout

Report Sections

Summary Dashboard

Overall health at a glance — total agents tested, aggregate score, regression count.

By Archetype

Pass rates per Roig (2025) failure archetype across all agents.

Agents Needing Attention

Agents below the --threshold, with consecutive-failure streaks flagged.

Regression Analysis

When --compare is provided: agents whose scores dropped since the baseline.

Recommendations

Prioritized action list: which agents to review, which archetypes to harden.

Output Format (Markdown)

# Agent Quality Report

**Generated**: 2026-04-01T10:30:00Z
**Agents Tested**: 58
**Overall Score**: 87%
**Regressions**: 2

## By Archetype

| Archetype | Pass Rate | Trend |
|-----------|-----------|-------|
| #1 Grounding | 92% | ↑ |
| #2 Substitution | 88% | → |
| #3 Distractor | 78% | ↓ |
| #4 Recovery | 90% | ↑ |

## Agents Needing Attention

| Agent | Score | Consecutive Failures | Issue |
|-------|-------|---------------------|-------|
| data-analyst | 72% | 3 | distractor-test |
| api-designer | 79% | 1 | latency regression (+40%) |

## Recommendations

1. Review `data-analyst` context filtering — failed distractor-test 3 consecutive runs
2. Investigate `api-designer` tool selection — latency regression
3. Increase distractor-test scenarios for marketing agents (78% pass rate below 80% target)

Output Format (JSON)

{
  "generated": "2026-04-01T10:30:00Z",
  "summary": {
    "agents_tested": 58,
    "overall_score": 0.87,
    "regressions": 2
  },
  "by_archetype": {
    "grounding": 0.92,
    "substitution": 0.88,
    "distractor": 0.78,
    "recovery": 0.90
  },
  "agents_needing_attention": [
    {"agent": "data-analyst", "score": 0.72, "consecutive_failures": 3, "issue": "distractor-test"}
  ],
  "recommendations": [
    "Review data-analyst context filtering"
  ]
}

Examples

# Standard report to stdout
/eval-report

# Save to file
/eval-report --output .aiwg/reports/quality-$(date +%Y%m%d).md

# Compare against baseline
/eval-report --compare .aiwg/reports/quality-20260301.json

# JSON for CI consumption
/eval-report --format json --threshold 0.80

# SDLC agents only
/eval-report --mode sdlc

Related Commands

/eval-agent - Test individual agents
/eval-workflow - Test multi-agent workflows
aiwg lint agents - Static validation

Generate evaluation report: $ARGUMENTS

References

@$AIWG_ROOT/agentic/code/addons/aiwg-evals/README.md — aiwg-evals addon overview
@$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/vague-discretion.md — Concrete threshold and scoring requirements
@$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/README.md — SDLC framework context for agent evaluation scope
@$AIWG_ROOT/docs/cli-reference.md — CLI reference for evaluation-related commands

Related Skills

jmagly/radar-status

data-ai

VerifiedTrustedCommunity

Report which research-corpus radar sidecars are overdue for refresh. Computes staleness (days since last refresh vs the cadence window) for every radar, sorted most-overdue-first. Runs via `aiwg corpus radar-status`.

140SKILL.mdUpdated May 28, 2026

jmagly/radar-report

data-ai

VerifiedTrustedCommunity

Aggregate research-corpus radar sidecars into a corpus or per-cluster freshness report — totals, overdue count, per-cluster / per-GRADE / per-trajectory breakdowns, an overdue table, and per-radar rationale snippets. Runs via `aiwg corpus radar-report`.

140SKILL.mdUpdated May 28, 2026

jmagly/radar-init

testing

VerifiedTrustedCommunity

Scaffold radar/freshness sidecars for research-corpus REFs. Pulls title/authors from the citation sidecar and GRADE from the analysis doc, defaults the refresh cadence from GRADE and the cluster from a corpus-local map, and stamps documentation/radar/REF-XXX-radar.md. Runs via `aiwg corpus radar-init`.

140SKILL.mdUpdated May 28, 2026

jmagly/profile-temporal

data-ai

VerifiedTrustedCommunity

Compute an entity's publication trajectory — per-year paper counts, topic drift, hot-streak detection (≥3 consecutive A-grade years), and career phase. Runs via `aiwg corpus profile-temporal`.

140SKILL.mdUpdated May 28, 2026

jmagly/profile-temporal

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/jmagly/aiwg.git

# Copy into Claude Code skills folder (global)
cp -r aiwg/agentic/code/addons/aiwg-evals/skills/eval-report ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

jmagly/aiwg

127 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT