agentic/code/addons/aiwg-evals/skills/eval-workflow/SKILL.md
Run evaluation tests against a multi-agent workflow to assess orchestration quality and failure archetype resistance
npx skillsauth add jmagly/aiwg eval-workflowInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Run automated evaluation tests against a multi-agent workflow.
/eval-workflow flow-security-review-cycle
/eval-workflow flow-inception-to-elaboration --scenario distractor-test
/eval-workflow flow-deploy-to-production --verbose --strict
| Argument | Required | Description | |----------|----------|-------------| | workflow-name | Yes | Workflow (flow command) to evaluate |
| Option | Default | Description | |--------|---------|-------------| | --scenario | all | Specific scenario to run | | --verbose | false | Show detailed test output | | --output | stdout | Output file for results | | --strict | false | Fail on any test failure | | --timeout | 300 | Maximum seconds per scenario |
grounding-test — Archetype 1: Premature action without reading statedistractor-test — Archetype 3: Context pollution from irrelevant artifactsrecovery-test — Archetype 4: Fragile execution when subagent fails.aiwg/ paths.aiwg/working/ test space{
"workflow": "flow-security-review-cycle",
"timestamp": "2026-04-01T10:30:00Z",
"scenarios": {
"grounding-test": {
"passed": true,
"score": 1.0,
"assertions": [
{"name": "threat-model-created", "passed": true},
{"name": "security-gate-run", "passed": true}
],
"duration_ms": 45000
},
"distractor-test": {
"passed": false,
"score": 0.7,
"assertions": [
{"name": "correct-assets-only", "passed": false, "evidence": "Distractor file referenced in output"}
],
"duration_ms": 38000
}
},
"summary": {
"passed": 4,
"failed": 1,
"total": 5,
"score": 0.80
}
}
# Full evaluation of a workflow
/eval-workflow flow-security-review-cycle
# Single scenario with verbose output
/eval-workflow flow-inception-to-elaboration --scenario grounding-test --verbose
# Strict mode with output saved
/eval-workflow flow-deploy-to-production --strict --output .aiwg/reports/deploy-eval.json
| Metric | Target | |--------|--------| | Artifact creation | 100% | | Grounding compliance | >90% | | Distractor resistance | >80% | | Recovery success | ≥80% | | Overall | ≥85% |
/eval-agent - Test individual agents/eval-report - Generate aggregate quality reportaiwg lint agents - Static validationEvaluate workflow: $ARGUMENTS
data-ai
Report which research-corpus radar sidecars are overdue for refresh. Computes staleness (days since last refresh vs the cadence window) for every radar, sorted most-overdue-first. Runs via `aiwg corpus radar-status`.
data-ai
Aggregate research-corpus radar sidecars into a corpus or per-cluster freshness report — totals, overdue count, per-cluster / per-GRADE / per-trajectory breakdowns, an overdue table, and per-radar rationale snippets. Runs via `aiwg corpus radar-report`.
testing
Scaffold radar/freshness sidecars for research-corpus REFs. Pulls title/authors from the citation sidecar and GRADE from the analysis doc, defaults the refresh cadence from GRADE and the cluster from a corpus-local map, and stamps documentation/radar/REF-XXX-radar.md. Runs via `aiwg corpus radar-init`.
data-ai
Compute an entity's publication trajectory — per-year paper counts, topic drift, hot-streak detection (≥3 consecutive A-grade years), and career phase. Runs via `aiwg corpus profile-temporal`.