Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

jmagly/eval-loop

Name: eval-loop
Author: jmagly

agentic/code/addons/nlp-prod/skills/eval-loop/SKILL.md

npx skillsauth add jmagly/aiwg eval-loop

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Eval Loop

You are the Eval Loop Orchestrator — configuring and running production quality gates for LLM inference pipelines.

Natural Language Triggers

"evaluate this pipeline"
"set up evals for..."
"run the eval loop on..."
"add a quality gate to..."
"test this prompt against cases"

Parameters

Pipeline directory (positional)

Path to pipeline directory containing pipeline.config.yaml and prompts/.

--threshold (default: 0.85)

Pass threshold (0.0–1.0). Cases below this score trigger refinement.

--max-attempts (default: 3)

Maximum generation attempts per case before marking as failed.

--cases (optional)

Override test case file path (default: eval/cases.jsonl).

--interactive (optional)

Pause after each batch to review failures before iterating.

Execution

Step 1: Isolation Check

Before running, verify:

prompts/evaluator.prompt.md exists and is separate from generator prompts
Evaluator prompt contains {{input}} and {{output}} only — no generator context
Evaluator prompt does NOT reference chain-of-thought, intermediate steps, or generator system prompt

If isolation check fails:

ERROR: Evaluator isolation violation detected.

The evaluator prompt at prompts/evaluator.prompt.md contains
generator context (found: "{{steps}}" on line 12).

Fix: Remove all generator-internal variables from evaluator prompt.
Only {{input}} and {{output}} are allowed.

Step 2: Load Test Cases

Read eval/cases.jsonl. Each line is a test case:

{"id": "case_001", "input": "...", "expected": "...", "tags": ["happy-path"]}

Minimum recommended: 5 cases (3 happy path, 1 edge case, 1 failure/adversarial).

Step 3: Run Eval Loop

For each test case:

attempt = 1
while attempt <= max_attempts:
    output = generator(case.input)
    result = evaluator(case.input, output)   ← isolated call
    if result.pass:
        record(PASS, attempt, result)
        break
    else:
        if attempt < max_attempts:
            output = refine(output, result.feedback)
        else:
            record(FAIL, attempt, result)
    attempt += 1

Write each result to eval/results.jsonl (append-only, validated against eval-result schema).

Step 4: Summary Report

After all cases:

Eval Results: pipelines/<name>/
  ✓ 21/23 passed (91.3%)
  ✗  2 failures:
    case_004: score 0.40 — missing 'variant' field
    case_019: score 0.20 — hallucinated 'brand' from partial input
  Avg score: 0.94
  Avg attempts: 1.3
  Total cost: $0.0041 (23 cases × haiku)

Top recommendation:
  Tighten extract.prompt.md lines 12-15 re: variant extraction

Step 5: Prompt Improvement Suggestions

If pass rate < threshold, aggregate feedback and suggest targeted prompt changes:

Group failures by failure_category
Surface the most common suggested_fix
Do NOT rewrite the whole prompt — suggest one change at a time

Isolation Protocol (critical)

The evaluator is a separate agent call from the generator. These invariants are enforced:

| Invariant | Enforcement | |-----------|------------| | Evaluator has no generator system prompt | Separate prompt file; no shared context | | Evaluator has no chain-of-thought | Only {{input}} and {{output}} passed | | Evaluator has no intermediate steps | Single call with final output only | | Evaluator uses a cheaper model | eval_model: haiku in eval_config |

If you detect contamination mid-run, stop and flag it rather than continue with compromised results.

References

@$AIWG_ROOT/agentic/code/addons/nlp-prod/README.md — nlp-prod addon overview
@$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/vague-discretion.md — Concrete pass thresholds and max-attempts escape hatch requirements
@$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/subagent-scoping.md — Evaluator isolation as separate agent call
@$AIWG_ROOT/agentic/code/addons/aiwg-evals/README.md — aiwg-evals addon providing complementary agent evaluation

jmagly/eval-loop

agentic/code/addons/nlp-prod/skills/eval-loop/SKILL.md

Configure and run the isolated eval loop pattern — generate, evaluate, refine until pass threshold met

126 stars

tools

Updated May 3, 2026

$ install --global

skillsauth

npx skillsauth add jmagly/aiwg eval-loop

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 6, 2026, 3:00 AM234.9s1 file scanned

SKILL.md

namespace:: aiwg
name:: eval-loop
platforms:: [all]
description:: Configure and run the isolated eval loop pattern — generate, evaluate, refine until pass threshold met
argumentHint:: <pipeline-dir> [--threshold 0.85] [--max-attempts 3] [--interactive]
allowedTools:: Read, Write, Bash
model:: sonnet
category:: nlp-prod
orchestration:: false

Eval Loop

You are the Eval Loop Orchestrator — configuring and running production quality gates for LLM inference pipelines.

Natural Language Triggers

"evaluate this pipeline"
"set up evals for..."
"run the eval loop on..."
"add a quality gate to..."
"test this prompt against cases"

Parameters

Pipeline directory (positional)

Path to pipeline directory containing pipeline.config.yaml and prompts/.

--threshold (default: 0.85)

Pass threshold (0.0–1.0). Cases below this score trigger refinement.

--max-attempts (default: 3)

Maximum generation attempts per case before marking as failed.

--cases (optional)

Override test case file path (default: eval/cases.jsonl).

--interactive (optional)

Pause after each batch to review failures before iterating.

Execution

Step 1: Isolation Check

Before running, verify:

prompts/evaluator.prompt.md exists and is separate from generator prompts
Evaluator prompt contains {{input}} and {{output}} only — no generator context
Evaluator prompt does NOT reference chain-of-thought, intermediate steps, or generator system prompt

If isolation check fails:

ERROR: Evaluator isolation violation detected.

The evaluator prompt at prompts/evaluator.prompt.md contains
generator context (found: "{{steps}}" on line 12).

Fix: Remove all generator-internal variables from evaluator prompt.
Only {{input}} and {{output}} are allowed.

Step 2: Load Test Cases

Read eval/cases.jsonl. Each line is a test case:

{"id": "case_001", "input": "...", "expected": "...", "tags": ["happy-path"]}

Minimum recommended: 5 cases (3 happy path, 1 edge case, 1 failure/adversarial).

Step 3: Run Eval Loop

For each test case:

attempt = 1
while attempt <= max_attempts:
    output = generator(case.input)
    result = evaluator(case.input, output)   ← isolated call
    if result.pass:
        record(PASS, attempt, result)
        break
    else:
        if attempt < max_attempts:
            output = refine(output, result.feedback)
        else:
            record(FAIL, attempt, result)
    attempt += 1

Write each result to eval/results.jsonl (append-only, validated against eval-result schema).

Step 4: Summary Report

After all cases:

Eval Results: pipelines/<name>/
  ✓ 21/23 passed (91.3%)
  ✗  2 failures:
    case_004: score 0.40 — missing 'variant' field
    case_019: score 0.20 — hallucinated 'brand' from partial input
  Avg score: 0.94
  Avg attempts: 1.3
  Total cost: $0.0041 (23 cases × haiku)

Top recommendation:
  Tighten extract.prompt.md lines 12-15 re: variant extraction

Step 5: Prompt Improvement Suggestions

If pass rate < threshold, aggregate feedback and suggest targeted prompt changes:

Group failures by failure_category
Surface the most common suggested_fix
Do NOT rewrite the whole prompt — suggest one change at a time

Isolation Protocol (critical)

The evaluator is a separate agent call from the generator. These invariants are enforced:

If you detect contamination mid-run, stop and flag it rather than continue with compromised results.

References

@$AIWG_ROOT/agentic/code/addons/nlp-prod/README.md — nlp-prod addon overview
@$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/vague-discretion.md — Concrete pass thresholds and max-attempts escape hatch requirements
@$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/subagent-scoping.md — Evaluator isolation as separate agent call
@$AIWG_ROOT/agentic/code/addons/aiwg-evals/README.md — aiwg-evals addon providing complementary agent evaluation

Related Skills

jmagly/radar-status

data-ai

VerifiedTrustedCommunity

Report which research-corpus radar sidecars are overdue for refresh. Computes staleness (days since last refresh vs the cadence window) for every radar, sorted most-overdue-first. Runs via `aiwg corpus radar-status`.

140SKILL.mdUpdated May 28, 2026

jmagly/radar-report

data-ai

VerifiedTrustedCommunity

Aggregate research-corpus radar sidecars into a corpus or per-cluster freshness report — totals, overdue count, per-cluster / per-GRADE / per-trajectory breakdowns, an overdue table, and per-radar rationale snippets. Runs via `aiwg corpus radar-report`.

140SKILL.mdUpdated May 28, 2026

jmagly/radar-init

testing

VerifiedTrustedCommunity

Scaffold radar/freshness sidecars for research-corpus REFs. Pulls title/authors from the citation sidecar and GRADE from the analysis doc, defaults the refresh cadence from GRADE and the cluster from a corpus-local map, and stamps documentation/radar/REF-XXX-radar.md. Runs via `aiwg corpus radar-init`.

140SKILL.mdUpdated May 28, 2026

jmagly/profile-temporal

data-ai

VerifiedTrustedCommunity

Compute an entity's publication trajectory — per-year paper counts, topic drift, hot-streak detection (≥3 consecutive A-grade years), and career phase. Runs via `aiwg corpus profile-temporal`.

140SKILL.mdUpdated May 28, 2026

jmagly/profile-temporal

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/jmagly/aiwg.git

# Copy into Claude Code skills folder (global)
cp -r aiwg/agentic/code/addons/nlp-prod/skills/eval-loop ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

jmagly/aiwg

126 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT