agentic/code/addons/aiwg-evals/skills/eval-report/SKILL.md
Generate an aggregate agent quality report from evaluation results, showing scores, regressions, and recommendations
npx skillsauth add jmagly/aiwg eval-reportInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Generate a quality report from accumulated evaluation results.
/eval-report
/eval-report --output .aiwg/reports/quality-report.md
/eval-report --compare previous-report.json
/eval-report --mode sdlc --format json
| Option | Default | Description | |--------|---------|-------------| | --output | stdout | Output file path | | --compare | none | Previous report to diff against | | --mode | all | Agent category: sdlc, marketing, forensics, all | | --format | markdown | Output format: markdown, json | | --since | none | Only include results after this date (ISO 8601) | | --threshold | 0.85 | Score below this triggers a warning |
eval-*.json files from .aiwg/reports/Overall health at a glance — total agents tested, aggregate score, regression count.
Pass rates per Roig (2025) failure archetype across all agents.
Agents below the --threshold, with consecutive-failure streaks flagged.
When --compare is provided: agents whose scores dropped since the baseline.
Prioritized action list: which agents to review, which archetypes to harden.
# Agent Quality Report
**Generated**: 2026-04-01T10:30:00Z
**Agents Tested**: 58
**Overall Score**: 87%
**Regressions**: 2
## By Archetype
| Archetype | Pass Rate | Trend |
|-----------|-----------|-------|
| #1 Grounding | 92% | ↑ |
| #2 Substitution | 88% | → |
| #3 Distractor | 78% | ↓ |
| #4 Recovery | 90% | ↑ |
## Agents Needing Attention
| Agent | Score | Consecutive Failures | Issue |
|-------|-------|---------------------|-------|
| data-analyst | 72% | 3 | distractor-test |
| api-designer | 79% | 1 | latency regression (+40%) |
## Recommendations
1. Review `data-analyst` context filtering — failed distractor-test 3 consecutive runs
2. Investigate `api-designer` tool selection — latency regression
3. Increase distractor-test scenarios for marketing agents (78% pass rate below 80% target)
{
"generated": "2026-04-01T10:30:00Z",
"summary": {
"agents_tested": 58,
"overall_score": 0.87,
"regressions": 2
},
"by_archetype": {
"grounding": 0.92,
"substitution": 0.88,
"distractor": 0.78,
"recovery": 0.90
},
"agents_needing_attention": [
{"agent": "data-analyst", "score": 0.72, "consecutive_failures": 3, "issue": "distractor-test"}
],
"recommendations": [
"Review data-analyst context filtering"
]
}
# Standard report to stdout
/eval-report
# Save to file
/eval-report --output .aiwg/reports/quality-$(date +%Y%m%d).md
# Compare against baseline
/eval-report --compare .aiwg/reports/quality-20260301.json
# JSON for CI consumption
/eval-report --format json --threshold 0.80
# SDLC agents only
/eval-report --mode sdlc
/eval-agent - Test individual agents/eval-workflow - Test multi-agent workflowsaiwg lint agents - Static validationGenerate evaluation report: $ARGUMENTS
data-ai
Report which research-corpus radar sidecars are overdue for refresh. Computes staleness (days since last refresh vs the cadence window) for every radar, sorted most-overdue-first. Runs via `aiwg corpus radar-status`.
data-ai
Aggregate research-corpus radar sidecars into a corpus or per-cluster freshness report — totals, overdue count, per-cluster / per-GRADE / per-trajectory breakdowns, an overdue table, and per-radar rationale snippets. Runs via `aiwg corpus radar-report`.
testing
Scaffold radar/freshness sidecars for research-corpus REFs. Pulls title/authors from the citation sidecar and GRADE from the analysis doc, defaults the refresh cadence from GRADE and the cluster from a corpus-local map, and stamps documentation/radar/REF-XXX-radar.md. Runs via `aiwg corpus radar-init`.
data-ai
Compute an entity's publication trajectory — per-year paper counts, topic drift, hot-streak detection (≥3 consecutive A-grade years), and career phase. Runs via `aiwg corpus profile-temporal`.