skills/llm-as-a-judge/SKILL.md
Build, validate, and deploy LLM-as-Judge evaluators for automated quality assessment of LLM pipeline outputs. Use this skill whenever the user wants to: create an automated evaluator for subjective or nuanced failure modes, write a judge prompt for Pass/Fail assessment, split labeled data for judge development, measure judge alignment (TPR/TNR), estimate true success rates with bias correction, or set up CI evaluation pipelines. Also trigger when the user mentions "judge prompt", "automated eval", "LLM evaluator", "grading prompt", "alignment metrics", "true positive rate", or wants to move from manual trace review to automated evaluation. This skill covers the full lifecycle: prompt design → data splitting → iterative refinement → success rate estimation.
npx skillsauth add maragudk/evals-skills llm-as-a-judgeInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Build reliable automated evaluators that use an LLM to judge the outputs of another LLM pipeline. Each judge targets a single, binary (Pass/Fail) failure mode identified during error analysis.
Choose the right evaluator type for each failure mode:
Use code-based evaluators when the failure is objective and deterministic:
Use LLM-as-Judge when the failure requires interpretation or nuance:
Each failure mode gets its own dedicated evaluator. Never combine multiple criteria into a single judge prompt—this introduces ambiguity and makes diagnosis harder.
1. Write Prompt Template
2. Split Labeled Data (Train / Dev / Test)
3. Iteratively Refine Prompt (measure TPR/TNR on Dev)
4. Estimate & Correct Success Rate (on Test + Unlabeled)
A well-structured judge prompt has four essential components. Read references/prompt-template.md for a complete annotated example.
Focus on ONE well-scoped failure mode. Vague tasks lead to unreliable judgments.
Define what counts as Pass (failure absent) and Fail (failure present), grounded in the failure descriptions from error analysis. Be specific about boundary conditions.
Include labeled examples that clearly Pass and clearly Fail. These calibrate the judge's decision boundary. Best drawn from human-labeled traces.
The judge responds in a consistent, machine-readable format:
{
"reasoning": "1-2 sentence explanation for the decision.",
"answer": "Pass"
}
The reasoning field comes first—this induces chain-of-thought before the verdict, improving accuracy.
Designing a judge resembles training a classifier, except "training" happens through prompt engineering. Split your human-labeled traces into three disjoint sets:
| Set | Purpose | Typical Allocation | |---|---|---| | Training | Pool of candidates for few-shot examples in the prompt | 10–20% | | Dev | Iteratively refine the prompt; measure agreement with human labels | 40–45% | | Test | Final, unbiased measurement of judge accuracy (TPR/TNR) | 40–45% |
Key rules:
If you have ~100 labeled traces (50 Pass, 50 Fail), a reasonable split: 10 training, 40 dev, 50 test.
This is the core loop. Think of it as tuning a classifier, but by revising text instead of adjusting parameters.
The end goal is estimating the true pass rate of the pipeline. A judge can only mis-estimate this in two ways: missing real Passes (lowers the observed rate) or passing real Fails (inflates it). TPR and TNR capture these two error modes directly.
Stop when TPR and TNR reach satisfactory levels (typically >90%). Missing a real failure may be costlier than flagging a false one—adjust thresholds to your application's risk tolerance.
Manual iteration is recommended before automation (e.g., DSPy). It builds intuition about both the failure mode and the judge's behavior. Writing the prompt forces you to externalize your specification.
After finalizing the prompt, freeze it and run on the Test set to get TPR and TNR. Then use the judge on unlabeled production traces with bias correction.
Read references/success-rate-estimation.md for the full procedure, formula, Python code, and confidence interval calculation.
θ̂ = (p_obs + TNR - 1) / (TPR + TNR - 1) [clipped to 0,1]
If TPR + TNR - 1 ≈ 0, the judge is no better than random chance and correction is invalid.
Improving TPR (the judge's ability to identify true successes) narrows the confidence interval the most. Judge errors mainly inflate uncertainty rather than shifting the corrected estimate.
Omitting examples from the prompt. Without concrete examples, the judge lacks grounding. This is the most common mistake.
Evaluating multiple criteria in a single prompt. Break complex metrics into narrower, specific prompts for better alignment and diagnosability.
Skipping alignment validation. Don't assume the judge "just works." Domain-specific criteria require prompt refinement and human-labeled validation.
Overfitting to labeled traces. If few-shot examples also appear in the evaluation set, TPR/TNR will be inflated. Any trace used in the prompt must be excluded from Dev and Test.
Never revisiting the judge. Production data drifts, new failure modes emerge, and LLM updates shift behavior. Periodically re-validate.
Not pinning the judge model version. In CI pipelines, pin the exact model version (e.g., claude-sonnet-4-5-20250929) to prevent results from fluctuating due to unannounced updates.
When judging outputs from long-document pipelines:
For continuous integration, build a golden dataset of curated input examples with reference outputs. On each pipeline change:
This catches regressions but does not predict overall production accuracy — its purpose is stability as the pipeline evolves.
tools
Generate a custom trace annotation web app for open coding during LLM error analysis. Use when the user wants to review LLM traces, annotate failures with freeform comments, and do first-pass qualitative labeling (open coding). Also use when the user mentions "annotate traces", "trace review tool", "open coding tool", "label traces", "build an annotation interface", "review LLM outputs", or wants to manually inspect pipeline traces before building a failure taxonomy. This skill produces a tailored Python web application using FastHTML, TailwindCSS, and HTMX.
development
Use this skill when crafting, reviewing, or improving prompts for LLM pipelines — including task prompts, system prompts, and LLM-as-Judge prompts. Triggers include: requests to write or refine a prompt, diagnose why an LLM produces inconsistent or incorrect outputs, bridge the gap between intent and model behavior, reduce ambiguity in instructions, add few-shot examples, structure complex prompts, or improve output formatting. Also use when the user needs help distinguishing specification failures (unclear instructions) from generalization failures (model limitations), or when iterating on prompts based on observed failure modes. Do NOT use for general coding tasks, document creation, or non-LLM writing.
development
Build a structured taxonomy of failure modes from open-coded trace annotations. Use this skill whenever the user has freeform annotations from reviewing LLM traces and wants to cluster them into a coherent, non-overlapping set of binary failure categories (axial coding). Also use when the user mentions "failure modes", "error taxonomy", "axial coding", "cluster annotations", "categorize errors", "failure analysis", or wants to go from raw observation notes to structured evaluation criteria. This skill covers the full pipeline: grouping open codes, defining failure modes, re-labeling traces, and quantifying error rates.
development
Maintainer-only workflow for handling GitHub Secret Scanning alerts on OpenClaw. Use when Codex needs to triage, redact, clean up, and resolve secret leakage found in issue comments, issue bodies, PR comments, or other GitHub content.