Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

maragudk/llm-as-a-judge

Name: llm-as-a-judge
Author: maragudk

skills/llm-as-a-judge/SKILL.md

npx skillsauth add maragudk/evals-skills llm-as-a-judge

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

LLM-as-a-Judge

Build reliable automated evaluators that use an LLM to judge the outputs of another LLM pipeline. Each judge targets a single, binary (Pass/Fail) failure mode identified during error analysis.

When to Use LLM-as-Judge vs. Code

Choose the right evaluator type for each failure mode:

Use code-based evaluators when the failure is objective and deterministic:

JSON/SQL syntax validity, regex/string matching, structural constraints, execution errors, logical checks.
These are fast, cheap, deterministic, and interpretable.

Use LLM-as-Judge when the failure requires interpretation or nuance:

Tone appropriateness, summary faithfulness, response helpfulness, explanation clarity, creative quality.
These require a separate LLM (distinct from the application) to judge outputs.

Each failure mode gets its own dedicated evaluator. Never combine multiple criteria into a single judge prompt—this introduces ambiguity and makes diagnosis harder.

The Full Workflow

1. Write Prompt Template
2. Split Labeled Data (Train / Dev / Test)
3. Iteratively Refine Prompt (measure TPR/TNR on Dev)
4. Estimate & Correct Success Rate (on Test + Unlabeled)

Step 1: Write the Judge Prompt

A well-structured judge prompt has four essential components. Read references/prompt-template.md for a complete annotated example.

1. Clear Task and Evaluation Criterion

Focus on ONE well-scoped failure mode. Vague tasks lead to unreliable judgments.

❌ "Is this email good?"
✅ "Is the tone appropriate for a luxury buyer persona?"

2. Precise Pass/Fail Definitions

Define what counts as Pass (failure absent) and Fail (failure present), grounded in the failure descriptions from error analysis. Be specific about boundary conditions.

3. Few-Shot Examples

Include labeled examples that clearly Pass and clearly Fail. These calibrate the judge's decision boundary. Best drawn from human-labeled traces.

Use clear-cut cases, not edge cases, for initial examples.
For binary judgments, include at least one Pass and one Fail example.
If using finer-grained scales (e.g., 1–3 severity), include examples for every point on the scale.

4. Structured Output Format

The judge responds in a consistent, machine-readable format:

{
  "reasoning": "1-2 sentence explanation for the decision.",
  "answer": "Pass"
}

The reasoning field comes first—this induces chain-of-thought before the verdict, improving accuracy.

Step 2: Split Labeled Data

Designing a judge resembles training a classifier, except "training" happens through prompt engineering. Split your human-labeled traces into three disjoint sets:

| Set | Purpose | Typical Allocation | |---|---|---| | Training | Pool of candidates for few-shot examples in the prompt | 10–20% | | Dev | Iteratively refine the prompt; measure agreement with human labels | 40–45% | | Test | Final, unbiased measurement of judge accuracy (TPR/TNR) | 40–45% |

Key rules:

Dev examples must never appear in the prompt. This ensures generalization measurement.
Test examples are held out until the prompt is finalized. Never look at them during development.
In-context learning typically saturates after 1–8 well-chosen examples. Allocate more data to evaluation.
Both Dev and Test should contain enough Pass and Fail examples—ideally 30–50 of each.
Reusing examples across splits leads to overfitting and inflated accuracy.

If you have ~100 labeled traces (50 Pass, 50 Fail), a reasonable split: 10 training, 40 dev, 50 test.

Step 3: Iteratively Refine the Prompt

This is the core loop. Think of it as tuning a classifier, but by revising text instead of adjusting parameters.

The Refinement Loop

Write a baseline prompt using the four components above, with a few examples from the Training set.
Run the judge on the Dev set. Compare each judgment to human ground truth.
Measure agreement using TPR and TNR:
- TPR = (actual Passes correctly judged Pass) / (total actual Passes)
- TNR = (actual Fails correctly judged Fail) / (total actual Fails)
Inspect disagreements. Review false passes (judge said Pass, human said Fail) and false fails. Identify ambiguous criteria or missing edge cases.
Refine the prompt: Clarify Pass/Fail definitions, swap in better few-shot examples from Training, add representative edge cases.
Repeat until TPR and TNR stabilize at acceptable levels.

Why TPR and TNR (Not Precision/Recall)

The end goal is estimating the true pass rate of the pipeline. A judge can only mis-estimate this in two ways: missing real Passes (lowers the observed rate) or passing real Fails (inflates it). TPR and TNR capture these two error modes directly.

When to Stop

Stop when TPR and TNR reach satisfactory levels (typically >90%). Missing a real failure may be costlier than flagging a false one—adjust thresholds to your application's risk tolerance.

If Alignment Stalls

Use a more capable LLM — a larger model may resolve subtle errors.
Decompose the criterion — break a complex failure into smaller, atomic checks.
Improve labeled data — add diverse, high-quality examples, especially edge cases.
Verify label quality — sometimes the issue is inconsistent or incorrect human labels.

Manual iteration is recommended before automation (e.g., DSPy). It builds intuition about both the failure mode and the judge's behavior. Writing the prompt forces you to externalize your specification.

Step 4: Estimate True Success Rates

After finalizing the prompt, freeze it and run on the Test set to get TPR and TNR. Then use the judge on unlabeled production traces with bias correction.

Read references/success-rate-estimation.md for the full procedure, formula, Python code, and confidence interval calculation.

Quick Reference

Measure judge accuracy on Test set → TPR, TNR
Observe raw success rate on unlabeled data → p_obs = k/m

Correct for bias using Rogan-Gladen formula:

θ̂ = (p_obs + TNR - 1) / (TPR + TNR - 1)    [clipped to 0,1]

Bootstrap confidence interval — resample Test set labels B times, recompute corrected rate each time, take 2.5th/97.5th percentiles.

If TPR + TNR - 1 ≈ 0, the judge is no better than random chance and correction is invalid.

Key Insight

Improving TPR (the judge's ability to identify true successes) narrows the confidence interval the most. Judge errors mainly inflate uncertainty rather than shifting the corrected estimate.

Common Pitfalls

Omitting examples from the prompt. Without concrete examples, the judge lacks grounding. This is the most common mistake.
Evaluating multiple criteria in a single prompt. Break complex metrics into narrower, specific prompts for better alignment and diagnosability.
Skipping alignment validation. Don't assume the judge "just works." Domain-specific criteria require prompt refinement and human-labeled validation.
Overfitting to labeled traces. If few-shot examples also appear in the evaluation set, TPR/TNR will be inflated. Any trace used in the prompt must be excluded from Dev and Test.
Never revisiting the judge. Production data drifts, new failure modes emerge, and LLM updates shift behavior. Periodically re-validate.
Not pinning the judge model version. In CI pipelines, pin the exact model version (e.g., claude-sonnet-4-5-20250929) to prevent results from fluctuating due to unannounced updates.

Long-Document Considerations

When judging outputs from long-document pipelines:

Don't feed the full document into the judge — use only the relevant portion (e.g., the source paragraph a summary came from).
Consider chunk-level evaluation with aggregated per-chunk judgments.
Make rubrics especially clear about what "correct" means since the judge won't see the full context.

CI Integration

For continuous integration, build a golden dataset of curated input examples with reference outputs. On each pipeline change:

Run all golden inputs through the pipeline.
Evaluate outputs with your suite of automated evaluators (code-based + LLM-as-Judge).
Pin the judge model version to prevent CI flicker.
Include examples covering core features, known failure modes, and edge cases.

This catches regressions but does not predict overall production accuracy — its purpose is stability as the pipeline evolves.

maragudk/llm-as-a-judge

skills/llm-as-a-judge/SKILL.md

Build, validate, and deploy LLM-as-Judge evaluators for automated quality assessment of LLM pipeline outputs. Use this skill whenever the user wants to: create an automated evaluator for subjective or nuanced failure modes, write a judge prompt for Pass/Fail assessment, split labeled data for judge development, measure judge alignment (TPR/TNR), estimate true success rates with bias correction, or set up CI evaluation pipelines. Also trigger when the user mentions "judge prompt", "automated eval", "LLM evaluator", "grading prompt", "alignment metrics", "true positive rate", or wants to move from manual trace review to automated evaluation. This skill covers the full lifecycle: prompt design → data splitting → iterative refinement → success rate estimation.

8 stars

development

Updated Apr 7, 2026

$ install --global

skillsauth

npx skillsauth add maragudk/evals-skills llm-as-a-judge

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 7, 2026, 2:45 AM39.7s3 files scanned

SKILL.md

name:: llm-as-a-judge
description:: >
Use this skill whenever the user wants to:: create an automated evaluator for subjective or nuanced failure modes,
full lifecycle:: prompt design → data splitting → iterative refinement → success rate estimation.

LLM-as-a-Judge

Build reliable automated evaluators that use an LLM to judge the outputs of another LLM pipeline. Each judge targets a single, binary (Pass/Fail) failure mode identified during error analysis.

When to Use LLM-as-Judge vs. Code

Choose the right evaluator type for each failure mode:

Use code-based evaluators when the failure is objective and deterministic:

JSON/SQL syntax validity, regex/string matching, structural constraints, execution errors, logical checks.
These are fast, cheap, deterministic, and interpretable.

Use LLM-as-Judge when the failure requires interpretation or nuance:

Tone appropriateness, summary faithfulness, response helpfulness, explanation clarity, creative quality.
These require a separate LLM (distinct from the application) to judge outputs.

Each failure mode gets its own dedicated evaluator. Never combine multiple criteria into a single judge prompt—this introduces ambiguity and makes diagnosis harder.

The Full Workflow

1. Write Prompt Template
2. Split Labeled Data (Train / Dev / Test)
3. Iteratively Refine Prompt (measure TPR/TNR on Dev)
4. Estimate & Correct Success Rate (on Test + Unlabeled)

Step 1: Write the Judge Prompt

A well-structured judge prompt has four essential components. Read references/prompt-template.md for a complete annotated example.

1. Clear Task and Evaluation Criterion

Focus on ONE well-scoped failure mode. Vague tasks lead to unreliable judgments.

❌ "Is this email good?"
✅ "Is the tone appropriate for a luxury buyer persona?"

2. Precise Pass/Fail Definitions

Define what counts as Pass (failure absent) and Fail (failure present), grounded in the failure descriptions from error analysis. Be specific about boundary conditions.

3. Few-Shot Examples

Include labeled examples that clearly Pass and clearly Fail. These calibrate the judge's decision boundary. Best drawn from human-labeled traces.

Use clear-cut cases, not edge cases, for initial examples.
For binary judgments, include at least one Pass and one Fail example.
If using finer-grained scales (e.g., 1–3 severity), include examples for every point on the scale.

4. Structured Output Format

The judge responds in a consistent, machine-readable format:

{
  "reasoning": "1-2 sentence explanation for the decision.",
  "answer": "Pass"
}

The reasoning field comes first—this induces chain-of-thought before the verdict, improving accuracy.

Step 2: Split Labeled Data

Designing a judge resembles training a classifier, except "training" happens through prompt engineering. Split your human-labeled traces into three disjoint sets:

Key rules:

Dev examples must never appear in the prompt. This ensures generalization measurement.
Test examples are held out until the prompt is finalized. Never look at them during development.
In-context learning typically saturates after 1–8 well-chosen examples. Allocate more data to evaluation.
Both Dev and Test should contain enough Pass and Fail examples—ideally 30–50 of each.
Reusing examples across splits leads to overfitting and inflated accuracy.

If you have ~100 labeled traces (50 Pass, 50 Fail), a reasonable split: 10 training, 40 dev, 50 test.

Step 3: Iteratively Refine the Prompt

This is the core loop. Think of it as tuning a classifier, but by revising text instead of adjusting parameters.

The Refinement Loop

Write a baseline prompt using the four components above, with a few examples from the Training set.
Run the judge on the Dev set. Compare each judgment to human ground truth.
Measure agreement using TPR and TNR:
- TPR = (actual Passes correctly judged Pass) / (total actual Passes)
- TNR = (actual Fails correctly judged Fail) / (total actual Fails)
Inspect disagreements. Review false passes (judge said Pass, human said Fail) and false fails. Identify ambiguous criteria or missing edge cases.
Refine the prompt: Clarify Pass/Fail definitions, swap in better few-shot examples from Training, add representative edge cases.
Repeat until TPR and TNR stabilize at acceptable levels.

Why TPR and TNR (Not Precision/Recall)

When to Stop

Stop when TPR and TNR reach satisfactory levels (typically >90%). Missing a real failure may be costlier than flagging a false one—adjust thresholds to your application's risk tolerance.

If Alignment Stalls

Use a more capable LLM — a larger model may resolve subtle errors.
Decompose the criterion — break a complex failure into smaller, atomic checks.
Improve labeled data — add diverse, high-quality examples, especially edge cases.
Verify label quality — sometimes the issue is inconsistent or incorrect human labels.

Step 4: Estimate True Success Rates

After finalizing the prompt, freeze it and run on the Test set to get TPR and TNR. Then use the judge on unlabeled production traces with bias correction.

Read references/success-rate-estimation.md for the full procedure, formula, Python code, and confidence interval calculation.

Quick Reference

Measure judge accuracy on Test set → TPR, TNR
Observe raw success rate on unlabeled data → p_obs = k/m

Correct for bias using Rogan-Gladen formula:

θ̂ = (p_obs + TNR - 1) / (TPR + TNR - 1)    [clipped to 0,1]

Bootstrap confidence interval — resample Test set labels B times, recompute corrected rate each time, take 2.5th/97.5th percentiles.

If TPR + TNR - 1 ≈ 0, the judge is no better than random chance and correction is invalid.

Key Insight

Improving TPR (the judge's ability to identify true successes) narrows the confidence interval the most. Judge errors mainly inflate uncertainty rather than shifting the corrected estimate.

Common Pitfalls

Omitting examples from the prompt. Without concrete examples, the judge lacks grounding. This is the most common mistake.
Evaluating multiple criteria in a single prompt. Break complex metrics into narrower, specific prompts for better alignment and diagnosability.
Skipping alignment validation. Don't assume the judge "just works." Domain-specific criteria require prompt refinement and human-labeled validation.
Overfitting to labeled traces. If few-shot examples also appear in the evaluation set, TPR/TNR will be inflated. Any trace used in the prompt must be excluded from Dev and Test.
Never revisiting the judge. Production data drifts, new failure modes emerge, and LLM updates shift behavior. Periodically re-validate.
Not pinning the judge model version. In CI pipelines, pin the exact model version (e.g., claude-sonnet-4-5-20250929) to prevent results from fluctuating due to unannounced updates.

Long-Document Considerations

When judging outputs from long-document pipelines:

Don't feed the full document into the judge — use only the relevant portion (e.g., the source paragraph a summary came from).
Consider chunk-level evaluation with aggregated per-chunk judgments.
Make rubrics especially clear about what "correct" means since the judge won't see the full context.

CI Integration

For continuous integration, build a golden dataset of curated input examples with reference outputs. On each pipeline change:

Run all golden inputs through the pipeline.
Evaluate outputs with your suite of automated evaluators (code-based + LLM-as-Judge).
Pin the judge model version to prevent CI flicker.
Include examples covering core features, known failure modes, and edge cases.

This catches regressions but does not predict overall production accuracy — its purpose is stability as the pipeline evolves.

Related Skills

maragudk/trace-annotation-tool

tools

VerifiedTrustedCommunity

Generate a custom trace annotation web app for open coding during LLM error analysis. Use when the user wants to review LLM traces, annotate failures with freeform comments, and do first-pass qualitative labeling (open coding). Also use when the user mentions "annotate traces", "trace review tool", "open coding tool", "label traces", "build an annotation interface", "review LLM outputs", or wants to manually inspect pipeline traces before building a failure taxonomy. This skill produces a tailored Python web application using FastHTML, TailwindCSS, and HTMX.

8SKILL.mdUpdated Apr 7, 2026

maragudk/trace-annotation-tool

maragudk/prompt-engineering

development

VerifiedTrustedCommunity

Use this skill when crafting, reviewing, or improving prompts for LLM pipelines — including task prompts, system prompts, and LLM-as-Judge prompts. Triggers include: requests to write or refine a prompt, diagnose why an LLM produces inconsistent or incorrect outputs, bridge the gap between intent and model behavior, reduce ambiguity in instructions, add few-shot examples, structure complex prompts, or improve output formatting. Also use when the user needs help distinguishing specification failures (unclear instructions) from generalization failures (model limitations), or when iterating on prompts based on observed failure modes. Do NOT use for general coding tasks, document creation, or non-LLM writing.

8SKILL.mdUpdated Apr 7, 2026

maragudk/prompt-engineering

maragudk/failure-taxonomy

development

VerifiedTrustedCommunity

Build a structured taxonomy of failure modes from open-coded trace annotations. Use this skill whenever the user has freeform annotations from reviewing LLM traces and wants to cluster them into a coherent, non-overlapping set of binary failure categories (axial coding). Also use when the user mentions "failure modes", "error taxonomy", "axial coding", "cluster annotations", "categorize errors", "failure analysis", or wants to go from raw observation notes to structured evaluation criteria. This skill covers the full pipeline: grouping open codes, defining failure modes, re-labeling traces, and quantifying error rates.

8SKILL.mdUpdated Apr 7, 2026

maragudk/failure-taxonomy

openclaw/openclaw-secret-scanning-maintainer

development

VerifiedTrustedCommunity

Maintainer-only workflow for handling GitHub Secret Scanning alerts on OpenClaw. Use when Codex needs to triage, redact, clean up, and resolve secret leakage found in issue comments, issue bodies, PR comments, or other GitHub content.

357,764SKILL.mdUpdated Apr 15, 2026

openclaw/openclaw-secret-scanning-maintainer

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/maragudk/evals-skills.git

# Copy into Claude Code skills folder (global)
cp -r evals-skills/skills/llm-as-a-judge ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

maragudk/evals-skills

8 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT