Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

latestaiagents/llm-as-judge

Name: llm-as-judge
Author: latestaiagents

skills/evals/llm-as-judge/SKILL.md

npx skillsauth add latestaiagents/agent-skills llm-as-judge

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

LLM-as-Judge

Use a strong LLM to evaluate another LLM's output. Done right, it's fast, cheap, and correlates with human judgment. Done wrong, it's biased, inconsistent, and misleading.

When to Use

Scaling eval beyond what humans can review
Measuring open-ended outputs (summaries, code quality, helpfulness) where rule-based metrics fail
Pairwise model comparison (A vs B on the same input)
CI checks on agent outputs

When NOT to Use

High-stakes decisions (medical, legal) — need humans
When the judge is the same model as the generator — biased toward its own style
Very short outputs where a rule can decide — exact_match is cheaper
Tasks the judge can't do itself — if it can't write good code, it can't judge code well

Three Common Patterns

1. Rubric Scoring

Judge rates one output against explicit criteria on a 1-5 scale.

const prompt = `You are evaluating a response. Rate it 1-5 on each criterion.

<user_query>${query}</user_query>
<response>${response}</response>

Criteria:
- accuracy: factually correct?
- helpfulness: addresses what the user asked?
- conciseness: no unnecessary verbosity?

Return JSON: {"accuracy": N, "helpfulness": N, "conciseness": N, "reasoning": "..."}`;

const judgment = await client.messages.create({
  model: "claude-opus-4-6",
  max_tokens: 500,
  messages: [{ role: "user", content: prompt }],
});

Use a stronger model as judge than the one you're evaluating. Opus judges Sonnet; Sonnet judges Haiku.

2. Pairwise Comparison

Show two outputs, judge picks which is better. Most reliable pattern.

const prompt = `Compare two responses to the same query. Pick which is better overall.

<query>${query}</query>
<response_A>${responseA}</response_A>
<response_B>${responseB}</response_B>

Return JSON: {"winner": "A" | "B" | "tie", "reasoning": "..."}`;

To control for position bias, run each pair TWICE with order swapped. Average the judgments.

3. Reference-Based

Compare output to a gold-standard reference:

const prompt = `Is the generated answer equivalent to the reference answer?

<reference>${reference}</reference>
<generated>${generated}</generated>

"Equivalent" means factually consistent — wording can differ.
Return: {"equivalent": true|false, "reasoning": "..."}`;

Cheaper than rubric but requires good references.

Known Biases

| Bias | Description | Mitigation | |---|---|---| | Position bias | Judge prefers first or second option | Randomize; run pairs twice with swapped order | | Length bias | Judge prefers longer responses | Include "conciseness" in rubric; normalize by length | | Self-preference | Judge prefers its own model's style | Use a DIFFERENT model family as judge | | Verbosity bias | Judge prefers confident/flowery language | Rubric explicitly penalizes vagueness | | Format bias | Prefers markdown/bullets over prose | Rubric targets content, not format |

Name the biases in your judge prompt — it reduces them: "Do not prefer longer responses; judge only on accuracy."

Calibrating Against Humans

Don't trust judge scores in isolation. Calibrate:

Sample 50-200 outputs. Have humans label them.
Run the judge on the same set. Collect its scores.
Compute Cohen's kappa or Pearson correlation between human and judge.
If kappa > 0.6, the judge is reliable for this task. If < 0.4, rewrite the prompt.

from sklearn.metrics import cohen_kappa_score
kappa = cohen_kappa_score(human_labels, judge_labels)

Re-calibrate quarterly or whenever you change judge prompts or models.

Prompt Design for Judges

Role first — "You are a code review expert"
Criteria explicit — don't leave "quality" undefined
Output format constrained — JSON or a fixed label set
Reasoning field included — forces the model to justify, reduces careless judgments
Examples — 2-3 few-shots of correct judgments on your criteria

Structured Output

Get judgments as JSON so you can aggregate:

const judgment = await client.messages.create({
  model: "claude-opus-4-6",
  max_tokens: 500,
  messages: [
    { role: "user", content: judgePrompt },
    { role: "assistant", content: "{" },
  ],
});
const parsed = JSON.parse("{" + judgment.content[0].text);

Prefilling "{" nudges valid JSON. Validate with a schema before aggregating.

Aggregating at Scale

Per dataset:

Mean rubric score per criterion — easy but loses variance
Win rate in pairwise — e.g., "model B wins 62% of the time"
Score distribution — detect regressions in the tail (a few catastrophic outputs even if mean is OK)

Report confidence intervals (bootstrap) — a 2% score gap on 100 samples is likely noise.

Cost Management

Judging is expensive. Reduce cost:

Sample — judge 200 items, not 10,000
Cheap judge first pass, expensive judge for disagreements
Cache judgments — if input didn't change, reuse the last score
Prompt caching on the judge prompt (see prompt-caching-ttl)

Anti-Patterns

Judge = generator — model evaluates itself; huge positive bias
No calibration — you're reporting numbers with no grounding
No position-bias control in pairwise
Vague rubric — "rate quality 1-5" — scores will be noisy
Single-run judgments — use multiple runs for critical evals; measure variance

Best Practices

Use a stronger, different model family as judge
Pairwise comparison > rubric when you have two outputs to compare
Always control for position bias — swap and average
Calibrate against human labels; target Cohen's kappa > 0.6
Name biases in the prompt to mitigate them
Return structured JSON with a reasoning field
Report confidence intervals, not just means
Cache and sample to keep costs bounded

latestaiagents/llm-as-judge

skills/evals/llm-as-judge/SKILL.md

Use an LLM as an evaluator for open-ended outputs — rubrics, pairwise comparison, calibration with human labels, bias mitigation. Covers when LLM-judge works, when it fails, and how to trust its scores. Use this skill when evaluating generative outputs at scale, building eval pipelines, or replacing expensive human review for non-critical judgments. Activate when: LLM as judge, LLM evaluator, automated evaluation, pairwise comparison, rubric evaluation, eval model.

2 stars

development

Updated Apr 23, 2026

$ install --global

skillsauth

npx skillsauth add latestaiagents/agent-skills llm-as-judge

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 24, 2026, 2:55 AM7.1s1 file scanned

SKILL.md

name:: llm-as-judge
description:: |
Activate when:: LLM as judge, LLM evaluator, automated evaluation, pairwise comparison, rubric evaluation, eval model.

LLM-as-Judge

Use a strong LLM to evaluate another LLM's output. Done right, it's fast, cheap, and correlates with human judgment. Done wrong, it's biased, inconsistent, and misleading.

When to Use

Scaling eval beyond what humans can review
Measuring open-ended outputs (summaries, code quality, helpfulness) where rule-based metrics fail
Pairwise model comparison (A vs B on the same input)
CI checks on agent outputs

When NOT to Use

High-stakes decisions (medical, legal) — need humans
When the judge is the same model as the generator — biased toward its own style
Very short outputs where a rule can decide — exact_match is cheaper
Tasks the judge can't do itself — if it can't write good code, it can't judge code well

Three Common Patterns

1. Rubric Scoring

Judge rates one output against explicit criteria on a 1-5 scale.

const prompt = `You are evaluating a response. Rate it 1-5 on each criterion.

<user_query>${query}</user_query>
<response>${response}</response>

Criteria:
- accuracy: factually correct?
- helpfulness: addresses what the user asked?
- conciseness: no unnecessary verbosity?

Return JSON: {"accuracy": N, "helpfulness": N, "conciseness": N, "reasoning": "..."}`;

const judgment = await client.messages.create({
  model: "claude-opus-4-6",
  max_tokens: 500,
  messages: [{ role: "user", content: prompt }],
});

Use a stronger model as judge than the one you're evaluating. Opus judges Sonnet; Sonnet judges Haiku.

2. Pairwise Comparison

Show two outputs, judge picks which is better. Most reliable pattern.

const prompt = `Compare two responses to the same query. Pick which is better overall.

<query>${query}</query>
<response_A>${responseA}</response_A>
<response_B>${responseB}</response_B>

Return JSON: {"winner": "A" | "B" | "tie", "reasoning": "..."}`;

To control for position bias, run each pair TWICE with order swapped. Average the judgments.

3. Reference-Based

Compare output to a gold-standard reference:

const prompt = `Is the generated answer equivalent to the reference answer?

<reference>${reference}</reference>
<generated>${generated}</generated>

"Equivalent" means factually consistent — wording can differ.
Return: {"equivalent": true|false, "reasoning": "..."}`;

Cheaper than rubric but requires good references.

Known Biases

Name the biases in your judge prompt — it reduces them: "Do not prefer longer responses; judge only on accuracy."

Calibrating Against Humans

Don't trust judge scores in isolation. Calibrate:

Sample 50-200 outputs. Have humans label them.
Run the judge on the same set. Collect its scores.
Compute Cohen's kappa or Pearson correlation between human and judge.
If kappa > 0.6, the judge is reliable for this task. If < 0.4, rewrite the prompt.

from sklearn.metrics import cohen_kappa_score
kappa = cohen_kappa_score(human_labels, judge_labels)

Re-calibrate quarterly or whenever you change judge prompts or models.

Prompt Design for Judges

Role first — "You are a code review expert"
Criteria explicit — don't leave "quality" undefined
Output format constrained — JSON or a fixed label set
Reasoning field included — forces the model to justify, reduces careless judgments
Examples — 2-3 few-shots of correct judgments on your criteria

Structured Output

Get judgments as JSON so you can aggregate:

const judgment = await client.messages.create({
  model: "claude-opus-4-6",
  max_tokens: 500,
  messages: [
    { role: "user", content: judgePrompt },
    { role: "assistant", content: "{" },
  ],
});
const parsed = JSON.parse("{" + judgment.content[0].text);

Prefilling "{" nudges valid JSON. Validate with a schema before aggregating.

Aggregating at Scale

Per dataset:

Mean rubric score per criterion — easy but loses variance
Win rate in pairwise — e.g., "model B wins 62% of the time"
Score distribution — detect regressions in the tail (a few catastrophic outputs even if mean is OK)

Report confidence intervals (bootstrap) — a 2% score gap on 100 samples is likely noise.

Cost Management

Judging is expensive. Reduce cost:

Sample — judge 200 items, not 10,000
Cheap judge first pass, expensive judge for disagreements
Cache judgments — if input didn't change, reuse the last score
Prompt caching on the judge prompt (see prompt-caching-ttl)

Anti-Patterns

Judge = generator — model evaluates itself; huge positive bias
No calibration — you're reporting numbers with no grounding
No position-bias control in pairwise
Vague rubric — "rate quality 1-5" — scores will be noisy
Single-run judgments — use multiple runs for critical evals; measure variance

Best Practices

Use a stronger, different model family as judge
Pairwise comparison > rubric when you have two outputs to compare
Always control for position bias — swap and average
Calibrate against human labels; target Cohen's kappa > 0.6
Name biases in the prompt to mitigate them
Return structured JSON with a reasoning field
Report confidence intervals, not just means
Cache and sample to keep costs bounded

Related Skills

latestaiagents/skill-testing

development

VerifiedTrustedCommunity

Test skills for correct activation, content quality, and regression — both automated checks (frontmatter validity, lint) and manual verification (query-suite activation testing). Covers CI integration and how to catch skill regressions before users do. Use this skill when adding skills to a repo, setting up CI for a skill library, or debugging "the skill exists but doesn't work". Activate when: test skills, validate skills, skill CI, skill linting, skill activation test, skill regression.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/skill-testing

latestaiagents/skill-frontmatter

documentation

VerifiedTrustedCommunity

Write the YAML frontmatter for a SKILL.md file so it activates reliably — name, description, and activation keywords that the model matches against. Covers length, tone, and the most common frontmatter mistakes. Use this skill when authoring a new skill, fixing a skill that isn't auto-activating, or reviewing skills for publication. Activate when: SKILL.md frontmatter, skill description, skill activation, skill YAML, write a skill, author a skill.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/skill-frontmatter

latestaiagents/skill-activation-patterns

development

VerifiedTrustedCommunity

Design skills that fire at the right moment — neither over-eager (noise) nor under-eager (silent). Covers activation specificity, trigger phrases, disambiguation between overlapping skills, and debugging activation. Use this skill when multiple skills could fire on the same query, a skill never fires, or a skill fires too often. Activate when: skill won't activate, skill over-activates, overlapping skills, skill triggers, skill selection, skill disambiguation.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/skill-activation-patterns

latestaiagents/progressive-disclosure

development

VerifiedTrustedCommunity

Structure SKILL.md content so the model reads just enough — concise summary up front, progressively deeper detail, examples on demand. Covers section ordering, length budgets, when to split into multiple skills. Use this skill when writing or refactoring a skill body, one skill has grown too long, or a skill is wordy but not useful. Activate when: SKILL.md structure, skill content, skill too long, split skill, progressive disclosure, skill body.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/progressive-disclosure

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/latestaiagents/agent-skills.git

# Copy into Claude Code skills folder (global)
cp -r agent-skills/skills/evals/llm-as-judge ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

latestaiagents/agent-skills

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT