skills/llm-as-judge-zheng-2023/SKILL.md
Using LLMs as automated evaluators for comparing chatbot responses with Chatbot Arena methodology
npx skillsauth add curiositech/windags-skills llm-as-judge-zheng-2023Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
license: Apache-2.0
metadata:
name: llm-as-a-judge
version: 1.0
source: "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (Zheng et al., NeurIPS 2023)"
description: >
Empirically-grounded principles for designing automated evaluation systems
where LLMs judge other LLMs. Covers bias taxonomy, mitigation patterns,
judge selection, benchmark design, and failure mode prevention.
activation_triggers:
- "evaluation pipeline" OR "eval system" OR "quality assessment"
- "LLM judge" OR "automated grading" OR "model comparison"
- designing agent systems with routing, scoring, or selection components
- "how do I know if output X is better than output Y"
- building reward models, rankers, or preference systems
- "position bias" OR "verbosity bias" OR "self-enhancement bias"
- chatbot arena, MT-bench, or preference benchmarking discussions
- any multi-agent system where one agent evaluates another's output
Load this skill when the task involves:
Do NOT use this skill for: training ML models, RLHF pipeline implementation, or general prompt engineering unrelated to evaluation.
Small model quality improvements yield dramatic evaluation reliability gains — this relationship is not linear. GPT-4-class models achieve ~80%+ agreement with human judgments. GPT-3.5-class models introduce systematic distortions that corrupt rankings. Weaker still, and the judge is worse than useless — it adds confident noise.
Implication: Judge agent selection is not a cost-optimization decision. Assigning a weak model to evaluation roles corrupts the entire system's feedback loop. The judge must be at or above the capability level of the models it evaluates.
Without active mitigation, LLM judges exhibit four empirically-documented biases:
| Bias | Mechanism | Severity | |------|-----------|----------| | Position bias | Favors whichever response appears first (or second, by model) | Extreme — Claude-v1 only 23.8% consistent | | Verbosity bias | Longer responses rated higher regardless of quality | Near-universal — 91.3% failure rate under attack | | Self-enhancement bias | Models prefer their own outputs when judging | Measurable but model-dependent | | Reasoning contamination | Judge's reasoning hijacked by the answer in context | Causes failures even when judge "knows" correct answer |
These are not failure modes to watch for — they are the default state of an uncorrected LLM judge.
A model that can correctly solve a problem does not reliably judge whether another solution to that problem is correct. GPT-4 can solve math problems but still rates incorrect answers as correct when those answers appear in context. The cognitive task of generating a correct answer and evaluating a presented answer are distinct capabilities that don't automatically co-transfer.
Implication: Never assume that a model's task performance predicts its judging performance on the same task class.
Capability benchmarks (MMLU-style: knowledge, reasoning under standardized conditions) and preference benchmarks (Chatbot Arena-style: human-judged open-ended quality) measure different things. A model can rank highly on one while underperforming on the other. Systems that route or evaluate based on a single benchmark type will systematically misevaluate certain model classes.
Implication: Any serious evaluation system needs both axes. Aggregate metrics can mask within-category variation that matters for routing decisions.
Human majority votes (~90% agreement with ground truth) outperform individual human raters (~81%). The same principle applies to LLM judges. A single judge evaluation is a noisy signal. Ensembled evaluations — either through multiple judge runs, position-swapped dual evaluation, or multi-judge panels — produce more reliable outputs.
Implication: Design evaluation components to output distributions or ensembled verdicts, not single-shot scores.
IF evaluation results will influence production decisions
→ Use GPT-4-class model minimum; validate against human agreement rate
→ Never use the same model family as the model being judged (self-enhancement)
IF you cannot afford GPT-4-class for every eval
→ Use strong judge for calibration samples; weaker judge for bulk screening
→ Measure agreement rate of cheap judge against strong judge on calibration set
IF judging outputs from a model stronger than your judge
→ Results are unreliable by default; explicitly flag this in system design
IF evaluating pairwise comparisons (A vs B)
→ ALWAYS run both orderings (A,B) and (B,A) and check consistency
→ Inconsistent verdict = no verdict; require re-evaluation or human review
IF judging on tasks with objective correct answers (math, code, logic)
→ Use reference-guided judging: give judge the correct answer/solution
→ This reduces math grading failures from ~70% to ~15%
→ Add chain-of-thought pre-solving: require judge to solve before seeing responses
IF verbosity is a risk (open-ended generation tasks)
→ Explicitly instruct judge to ignore length; reward conciseness
→ Include rubric criterion: "Do not favor longer responses"
→ Adversarially test with repetitive list attack to verify resistance
IF judging a model with same family as judge
→ Add self-enhancement mitigation: use judge from different model family
→ Or: blind the judge to model identity (already standard, but verify)
IF building a new automated eval pipeline
→ Start with: What is my human agreement baseline? (measure it)
→ Instrument for: position consistency rate, length-score correlation
→ Include adversarial probes in test suite (position swap, repetitive list)
IF aggregate metrics look surprisingly flat between model versions
→ Disaggregate by task category — aggregate can mask within-category wins
→ Run direct pairwise comparison, not just score comparison
IF evaluation is being used for routing decisions in a multi-agent system
→ Treat judge as a component with a known reliability profile
→ Define acceptable consistency threshold before routing goes live
→ Build fallback path for low-confidence judge verdicts
IF position swap test yields consistent verdicts → moderate confidence
IF position swap test is inconsistent → discard verdict, escalate
IF judging math/code/logic without reference answer → low confidence; use reference
IF judging subjective quality (writing, tone) with strong model → reasonable confidence
IF task type is outside judge's training distribution → treat as unreliable
Load these on demand based on the specific sub-problem being addressed:
| File | When to Load |
|------|-------------|
| references/llm-judge-bias-taxonomy.md | Need precise definitions, severity data, or detection methods for any of the four biases; designing bias audits |
| references/judge-vs-solve-capability-gap.md | Task involves math, code, or logic evaluation; debugging cases where a capable model still misjudges; understanding reasoning contamination |
| references/benchmark-complementarity-evaluation-design.md | Designing a benchmark suite; comparing models across task types; results from single benchmark seem misleading |
| references/scalable-explainable-evaluation-architecture.md | Building production evaluation pipelines; need explainability in judge outputs; scaling evaluation cost-effectively |
| references/crowdsourced-vs-controlled-evaluation.md | Deciding between expert annotation and crowdsourced judgment; designing human evaluation ground truth; understanding human agreement rates |
| references/prompting-patterns-for-reliable-llm-judgment.md | Writing judge system prompts; implementing position swap, reference-guided judging, or chain-of-thought mitigation; need validated prompt templates |
| references/the-quality-gap-between-raw-capability-and-aligned-output.md | Confused by MMLU vs. preference benchmark divergence; understanding what fine-tuning changes (and doesn't); selecting models for conversational quality |
| references/failure-modes-in-automated-evaluation-systems.md | Auditing an existing eval system; building adversarial test suite; system producing unexpected or gameable results; pre-launch evaluation checklist |
These are the mistakes this research specifically identifies and warns against:
🚫 Treating evaluation as an afterthought Evaluation infrastructure requires the same design rigor as worker agents. A weak or biased judge corrupts the entire system's learning signal.
🚫 Using weaker models as judges to save cost The cost of an unreliable judge is systematic corruption of rankings. The bias isn't random noise — it's directional, making worse outputs look better in predictable ways.
🚫 Single-pass, single-order pairwise evaluation Running (A vs B) once and taking the verdict is known to be unreliable. Position bias alone makes this unacceptable. Always run both orderings.
🚫 Assuming capability = judging ability "GPT-4 is smart, so it will be a good judge" is not sufficient reasoning. Judging is a distinct capability. Validate empirically against human agreement rates for your specific task types.
🚫 Using a single aggregate benchmark to route decisions Aggregate scores mask category-level variation. A model that wins on aggregate math can still fail specific math subtypes that matter for your use case.
🚫 Judging math/code without reference answers This is the most documented failure mode in the paper. Correct answers in context corrupt the judge's reasoning. Always use reference-guided judging for tasks with objective answers.
🚫 Not instrumenting for bias If you aren't measuring position consistency rate and length-score correlation in your eval system, you don't know if it's working. These are required monitoring metrics, not optional.
🚫 Treating human judgment as a perfect gold standard Individual human agreement is ~81%. The target is human majority judgment (~90%), not any individual rater. Design accordingly — ensemble human judgments just as you would ensemble LLM judgments.
How to tell if someone has actually internalized this book vs. skimmed it:
They have internalized it if they:
They haven't internalized it if they:
This skill covers the Zheng et al. NeurIPS 2023 paper as canonical reference. For adjacent topics — RLHF training, constitutional AI, reward model design — load separate skills.
tools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.