agentic/code/addons/aiwg-evals/skills/eval-agent/SKILL.md
Run evaluation tests against an agent to assess quality and archetype resistance
npx skillsauth add jmagly/aiwg eval-agentInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Run automated evaluation tests against an agent.
/eval-agent security-architect
/eval-agent architecture-designer --category archetype
/eval-agent test-engineer --scenario grounding-test --verbose
| Argument | Required | Description | |----------|----------|-------------| | agent-name | Yes | Agent to evaluate |
| Option | Default | Description | |--------|---------|-------------| | --category | all | Test category: archetype, performance, quality | | --scenario | all | Specific scenario to run | | --verbose | false | Show detailed test output | | --output | stdout | Output file for results | | --strict | false | Fail on any test failure |
Tests for Roig (2025) failure archetypes:
grounding-test - Archetype 1: Premature actionsubstitution-test - Archetype 2: Over-helpfulnessdistractor-test - Archetype 3: Context pollutionrecovery-test - Archetype 4: Fragile executionlatency-test - Response time benchmarkstoken-test - Token efficiencyparallel-test - Concurrent execution correctnessoutput-format - Output structure validationtool-usage - Appropriate tool selectionscope-adherence - Stays within defined scope{
"agent": "security-architect",
"timestamp": "2025-01-15T10:30:00Z",
"tests": {
"grounding-test": {
"passed": true,
"score": 1.0,
"details": "Read tool called before Edit",
"duration_ms": 5000
},
"distractor-test": {
"passed": false,
"score": 0.6,
"details": "Used staging data in output",
"evidence": ["Found 'staging' in response"],
"duration_ms": 3000
}
},
"summary": {
"passed": 3,
"failed": 1,
"total": 4,
"score": 0.85
}
}
# Full evaluation
/eval-agent architecture-designer
# Archetype tests only
/eval-agent architecture-designer --category archetype
# Single scenario with verbose output
/eval-agent test-engineer --scenario grounding-test --verbose
# Save results
/eval-agent security-architect --output .aiwg/reports/security-eval.json
# Strict mode (fails on any test failure)
/eval-agent devops-engineer --strict
| Metric | Target | |--------|--------| | Grounding (A1) | >90% | | Substitution (A2) | >85% | | Distractor (A3) | >80% | | Recovery (A4) | ≥80% | | Overall | ≥85% |
/eval-workflow - Test multi-agent workflows/eval-report - Generate quality reportaiwg lint agents - Static validationEvaluate agent: $ARGUMENTS
data-ai
Report which research-corpus radar sidecars are overdue for refresh. Computes staleness (days since last refresh vs the cadence window) for every radar, sorted most-overdue-first. Runs via `aiwg corpus radar-status`.
data-ai
Aggregate research-corpus radar sidecars into a corpus or per-cluster freshness report — totals, overdue count, per-cluster / per-GRADE / per-trajectory breakdowns, an overdue table, and per-radar rationale snippets. Runs via `aiwg corpus radar-report`.
testing
Scaffold radar/freshness sidecars for research-corpus REFs. Pulls title/authors from the citation sidecar and GRADE from the analysis doc, defaults the refresh cadence from GRADE and the cluster from a corpus-local map, and stamps documentation/radar/REF-XXX-radar.md. Runs via `aiwg corpus radar-init`.
data-ai
Compute an entity's publication trajectory — per-year paper counts, topic drift, hot-streak detection (≥3 consecutive A-grade years), and career phase. Runs via `aiwg corpus profile-temporal`.