skills/evals/llm-as-judge/SKILL.md
Use an LLM as an evaluator for open-ended outputs — rubrics, pairwise comparison, calibration with human labels, bias mitigation. Covers when LLM-judge works, when it fails, and how to trust its scores. Use this skill when evaluating generative outputs at scale, building eval pipelines, or replacing expensive human review for non-critical judgments. Activate when: LLM as judge, LLM evaluator, automated evaluation, pairwise comparison, rubric evaluation, eval model.
npx skillsauth add latestaiagents/agent-skills llm-as-judgeInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use a strong LLM to evaluate another LLM's output. Done right, it's fast, cheap, and correlates with human judgment. Done wrong, it's biased, inconsistent, and misleading.
exact_match is cheaperJudge rates one output against explicit criteria on a 1-5 scale.
const prompt = `You are evaluating a response. Rate it 1-5 on each criterion.
<user_query>${query}</user_query>
<response>${response}</response>
Criteria:
- accuracy: factually correct?
- helpfulness: addresses what the user asked?
- conciseness: no unnecessary verbosity?
Return JSON: {"accuracy": N, "helpfulness": N, "conciseness": N, "reasoning": "..."}`;
const judgment = await client.messages.create({
model: "claude-opus-4-6",
max_tokens: 500,
messages: [{ role: "user", content: prompt }],
});
Use a stronger model as judge than the one you're evaluating. Opus judges Sonnet; Sonnet judges Haiku.
Show two outputs, judge picks which is better. Most reliable pattern.
const prompt = `Compare two responses to the same query. Pick which is better overall.
<query>${query}</query>
<response_A>${responseA}</response_A>
<response_B>${responseB}</response_B>
Return JSON: {"winner": "A" | "B" | "tie", "reasoning": "..."}`;
To control for position bias, run each pair TWICE with order swapped. Average the judgments.
Compare output to a gold-standard reference:
const prompt = `Is the generated answer equivalent to the reference answer?
<reference>${reference}</reference>
<generated>${generated}</generated>
"Equivalent" means factually consistent — wording can differ.
Return: {"equivalent": true|false, "reasoning": "..."}`;
Cheaper than rubric but requires good references.
| Bias | Description | Mitigation | |---|---|---| | Position bias | Judge prefers first or second option | Randomize; run pairs twice with swapped order | | Length bias | Judge prefers longer responses | Include "conciseness" in rubric; normalize by length | | Self-preference | Judge prefers its own model's style | Use a DIFFERENT model family as judge | | Verbosity bias | Judge prefers confident/flowery language | Rubric explicitly penalizes vagueness | | Format bias | Prefers markdown/bullets over prose | Rubric targets content, not format |
Name the biases in your judge prompt — it reduces them: "Do not prefer longer responses; judge only on accuracy."
Don't trust judge scores in isolation. Calibrate:
from sklearn.metrics import cohen_kappa_score
kappa = cohen_kappa_score(human_labels, judge_labels)
Re-calibrate quarterly or whenever you change judge prompts or models.
Get judgments as JSON so you can aggregate:
const judgment = await client.messages.create({
model: "claude-opus-4-6",
max_tokens: 500,
messages: [
{ role: "user", content: judgePrompt },
{ role: "assistant", content: "{" },
],
});
const parsed = JSON.parse("{" + judgment.content[0].text);
Prefilling "{" nudges valid JSON. Validate with a schema before aggregating.
Per dataset:
Report confidence intervals (bootstrap) — a 2% score gap on 100 samples is likely noise.
Judging is expensive. Reduce cost:
prompt-caching-ttl)development
Test skills for correct activation, content quality, and regression — both automated checks (frontmatter validity, lint) and manual verification (query-suite activation testing). Covers CI integration and how to catch skill regressions before users do. Use this skill when adding skills to a repo, setting up CI for a skill library, or debugging "the skill exists but doesn't work". Activate when: test skills, validate skills, skill CI, skill linting, skill activation test, skill regression.
documentation
Write the YAML frontmatter for a SKILL.md file so it activates reliably — name, description, and activation keywords that the model matches against. Covers length, tone, and the most common frontmatter mistakes. Use this skill when authoring a new skill, fixing a skill that isn't auto-activating, or reviewing skills for publication. Activate when: SKILL.md frontmatter, skill description, skill activation, skill YAML, write a skill, author a skill.
development
Design skills that fire at the right moment — neither over-eager (noise) nor under-eager (silent). Covers activation specificity, trigger phrases, disambiguation between overlapping skills, and debugging activation. Use this skill when multiple skills could fire on the same query, a skill never fires, or a skill fires too often. Activate when: skill won't activate, skill over-activates, overlapping skills, skill triggers, skill selection, skill disambiguation.
development
Structure SKILL.md content so the model reads just enough — concise summary up front, progressively deeper detail, examples on demand. Covers section ordering, length budgets, when to split into multiple skills. Use this skill when writing or refactoring a skill body, one skill has grown too long, or a skill is wordy but not useful. Activate when: SKILL.md structure, skill content, skill too long, split skill, progressive disclosure, skill body.