agentic/code/addons/nlp-prod/skills/eval-loop/SKILL.md
Configure and run the isolated eval loop pattern — generate, evaluate, refine until pass threshold met
npx skillsauth add jmagly/aiwg eval-loopInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You are the Eval Loop Orchestrator — configuring and running production quality gates for LLM inference pipelines.
Path to pipeline directory containing pipeline.config.yaml and prompts/.
Pass threshold (0.0–1.0). Cases below this score trigger refinement.
Maximum generation attempts per case before marking as failed.
Override test case file path (default: eval/cases.jsonl).
Pause after each batch to review failures before iterating.
Before running, verify:
prompts/evaluator.prompt.md exists and is separate from generator prompts{{input}} and {{output}} only — no generator contextIf isolation check fails:
ERROR: Evaluator isolation violation detected.
The evaluator prompt at prompts/evaluator.prompt.md contains
generator context (found: "{{steps}}" on line 12).
Fix: Remove all generator-internal variables from evaluator prompt.
Only {{input}} and {{output}} are allowed.
Read eval/cases.jsonl. Each line is a test case:
{"id": "case_001", "input": "...", "expected": "...", "tags": ["happy-path"]}
Minimum recommended: 5 cases (3 happy path, 1 edge case, 1 failure/adversarial).
For each test case:
attempt = 1
while attempt <= max_attempts:
output = generator(case.input)
result = evaluator(case.input, output) ← isolated call
if result.pass:
record(PASS, attempt, result)
break
else:
if attempt < max_attempts:
output = refine(output, result.feedback)
else:
record(FAIL, attempt, result)
attempt += 1
Write each result to eval/results.jsonl (append-only, validated against eval-result schema).
After all cases:
Eval Results: pipelines/<name>/
✓ 21/23 passed (91.3%)
✗ 2 failures:
case_004: score 0.40 — missing 'variant' field
case_019: score 0.20 — hallucinated 'brand' from partial input
Avg score: 0.94
Avg attempts: 1.3
Total cost: $0.0041 (23 cases × haiku)
Top recommendation:
Tighten extract.prompt.md lines 12-15 re: variant extraction
If pass rate < threshold, aggregate feedback and suggest targeted prompt changes:
failure_categorysuggested_fixThe evaluator is a separate agent call from the generator. These invariants are enforced:
| Invariant | Enforcement |
|-----------|------------|
| Evaluator has no generator system prompt | Separate prompt file; no shared context |
| Evaluator has no chain-of-thought | Only {{input}} and {{output}} passed |
| Evaluator has no intermediate steps | Single call with final output only |
| Evaluator uses a cheaper model | eval_model: haiku in eval_config |
If you detect contamination mid-run, stop and flag it rather than continue with compromised results.
data-ai
Report which research-corpus radar sidecars are overdue for refresh. Computes staleness (days since last refresh vs the cadence window) for every radar, sorted most-overdue-first. Runs via `aiwg corpus radar-status`.
data-ai
Aggregate research-corpus radar sidecars into a corpus or per-cluster freshness report — totals, overdue count, per-cluster / per-GRADE / per-trajectory breakdowns, an overdue table, and per-radar rationale snippets. Runs via `aiwg corpus radar-report`.
testing
Scaffold radar/freshness sidecars for research-corpus REFs. Pulls title/authors from the citation sidecar and GRADE from the analysis doc, defaults the refresh cadence from GRADE and the cluster from a corpus-local map, and stamps documentation/radar/REF-XXX-radar.md. Runs via `aiwg corpus radar-init`.
data-ai
Compute an entity's publication trajectory — per-year paper counts, topic drift, hot-streak detection (≥3 consecutive A-grade years), and career phase. Runs via `aiwg corpus profile-temporal`.