Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

hamelsmu/write-judge-prompt

Name: write-judge-prompt
Author: hamelsmu

skills/write-judge-prompt/SKILL.md

npx skillsauth add hamelsmu/evals-skills write-judge-prompt

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Error

VirusTotalMulti-engine malware detection

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Write LLM-as-Judge Prompt

Design a binary Pass/Fail LLM-as-Judge evaluator for one specific failure mode. Each judge checks exactly one thing.

Prerequisites

Error analysis is complete. The failure mode is identified.
You have human-labeled traces for this failure mode (at least 20 Pass and 20 Fail examples).
A code-based evaluator cannot check this failure mode. Exhaust code-based options before reaching for a judge — many failure modes that seem subjective reduce to keyword checks, regex, or API calls when you understand the domain. Example: detecting whether an AI interviewing coach suggests "general" questions (asking about typical behavior instead of a specific past event) seems to require semantic understanding, but in practice a keyword check for words like "usually," "typical," and "normally" could work quite well.

The Four Components

Every judge prompt requires exactly four components:

1. Task and Evaluation Criterion

State what the judge evaluates. One failure mode per judge.

You are an evaluator assessing whether a real estate assistant's email
uses the appropriate tone for the client's persona.

Not: "Evaluate whether the email is good" or "Rate the email quality from 1-5."

2. Pass/Fail Definitions

Outcomes are strictly binary: Pass or Fail. No Likert scales, no letter grades, no partial credit. Define exactly what constitutes Pass and Fail. These definitions come from your error analysis failure mode descriptions.

## Definitions

PASS: The email matches the expected communication style for the client persona:
- Luxury Buyers: formal language, emphasis on exclusive features, premium
  market positioning, no casual slang
- First-Time Homebuyers: warm and encouraging tone, educational explanations,
  avoids jargon, patient and supportive
- Investors: data-driven language, ROI-focused, market analytics, concise
  and professional

FAIL: The email uses a tone mismatched to the client persona. Examples:
- Using casual slang ("hey, check out this pad!") for a luxury buyer
- Using heavy financial jargon for a first-time homebuyer
- Using overly emotional language for an investor

3. Few-Shot Examples

Include labeled Pass and Fail examples from your human-labeled data.

## Examples

### Example 1: PASS
Client Persona: Luxury Buyer
Email: "Dear Mr. Harrington, I am pleased to present an exclusive listing
at 1200 Pacific Heights Drive. This distinguished property features..."
Critique: The email opens with a formal salutation and uses language
consistent with luxury positioning — "exclusive listing," "distinguished
property." No casual slang or informal phrasing. The tone matches the
luxury buyer persona throughout.
Result: Pass

### Example 2: FAIL
Client Persona: Luxury Buyer
Email: "Hey! Just found this awesome place you might like. It's got a
pool and stuff, super cool neighborhood..."
Critique: The greeting "Hey!" is informal. Phrases like "awesome place,"
"got a pool and stuff," and "super cool" are casual slang inappropriate
for a luxury buyer. The email reads like a text message, not a
professional communication for a high-end client.
Result: Fail

### Example 3: PASS (borderline)
Client Persona: First-Time Homebuyer
Email: "Hi Sarah, I found a property that might be a great fit for your
first home. The neighborhood has good schools nearby, and the monthly
payment would be similar to what you're currently paying in rent..."
Critique: The greeting is warm but not overly casual. The email explains
the property in relatable terms — comparing mortgage to rent, mentioning
schools — which is educational without being condescending. It avoids
jargon like "amortization" or "LTV ratio." While not deeply technical,
this matches the supportive tone expected for a first-time buyer.
Result: Pass

Rules for selecting examples:

Include at least one clear Pass, one clear Fail, and one borderline case. Borderline examples are the most valuable — they teach nuance.
Draw examples from the training split (10-20% of labeled data set aside for this purpose).
Any example used in the judge prompt must be excluded from dev and test sets. Using dev/test examples is data leakage.
2-4 examples is typical. Performance plateaus after 4-8.

4. Structured Output Format

Enforce structured output using your LLM provider's schema enforcement (e.g., response_format in OpenAI, tool definitions in Anthropic) or a library like Instructor or Outlines. If the provider doesn't support schema enforcement, specify the JSON schema in the prompt.

The output must include a critique before the verdict. Placing the critique first forces the judge to articulate its assessment before committing to a decision.

{
  "critique": "string — detailed assessment of the output against the criterion",
  "result": "Pass or Fail"
}

Critiques must be detailed, not terse. A good critique explains what specifically was correct or incorrect and references concrete evidence from the output. The critiques in your few-shot examples set the bar for the level of detail the judge will produce.

Choosing What to Pass to the Judge

Feed only what the judge needs for an accurate decision:

| Failure Mode | What the Judge Needs | |-------------|---------------------| | Tone mismatch | Client persona + generated email | | Answer faithfulness | Retrieved context + generated answer | | SQL correctness | User query + generated SQL + schema | | Instruction following | System prompt rules + generated response | | Tool call justification | Conversation history + tool call + tool result |

For long documents, feed only the relevant snippet, not the entire document.

Model Selection

Start with the most capable model available. The same model used for the main task works as judge (the judge performs a different, narrower task). Optimize for cost later once alignment is confirmed.

Anti-Patterns

Vague criteria like "is this helpful?" Target a specific, observable failure mode from error analysis.
Holistic judge for the entire trace. A single judge covering multiple dimensions produces unactionable verdicts.
No few-shot examples. Without examples, the model won't know what counts as a failure in your application.
Dev/test examples used as few-shot. This is data leakage. Use only the training split.
Likert scales (1-5, letter grades, etc.). Binary pass/fail only. Likert scales produce scores that sound precise but can't be calibrated: annotators disagree on the difference between a 3 and a 4, and the judge inherits that noise. Binary forces you to define a clear decision boundary upfront, which makes inter-annotator agreement measurable and the judge's errors actionable. If you need to capture severity, use multiple binary judges (e.g., "factually wrong" and "dangerously wrong") rather than one ordinal scale.
Skipping validation. Measure alignment with human labels using validate-evaluator before trusting the judge.
Judges for specification failures without fixing the prompt first. If the prompt never asked for the behavior, add the instruction before building an evaluator. For critical requirements, a judge can still serve as a regression guard.

hamelsmu/write-judge-prompt

skills/write-judge-prompt/SKILL.md

Design LLM-as-Judge evaluators for subjective criteria that code-based checks cannot handle. Use when a failure mode requires interpretation (tone, faithfulness, relevance, completeness). Do NOT use when the failure mode can be checked with code (regex, schema validation, execution tests). Do NOT use when you need to validate or calibrate the judge — use validate-evaluator instead.

900 stars

development

Updated Mar 20, 2026

$ install --global

skillsauth

npx skillsauth add hamelsmu/evals-skills write-judge-prompt

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Error

VirusTotalMulti-engine malware detection

70%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Mar 20, 2026, 5:25 PM250.4s1 file scanned

SKILL.md

name:: write-judge-prompt
description:: >

Write LLM-as-Judge Prompt

Design a binary Pass/Fail LLM-as-Judge evaluator for one specific failure mode. Each judge checks exactly one thing.

Prerequisites

Error analysis is complete. The failure mode is identified.
You have human-labeled traces for this failure mode (at least 20 Pass and 20 Fail examples).
A code-based evaluator cannot check this failure mode. Exhaust code-based options before reaching for a judge — many failure modes that seem subjective reduce to keyword checks, regex, or API calls when you understand the domain. Example: detecting whether an AI interviewing coach suggests "general" questions (asking about typical behavior instead of a specific past event) seems to require semantic understanding, but in practice a keyword check for words like "usually," "typical," and "normally" could work quite well.

The Four Components

Every judge prompt requires exactly four components:

1. Task and Evaluation Criterion

State what the judge evaluates. One failure mode per judge.

You are an evaluator assessing whether a real estate assistant's email
uses the appropriate tone for the client's persona.

Not: "Evaluate whether the email is good" or "Rate the email quality from 1-5."

2. Pass/Fail Definitions

## Definitions

PASS: The email matches the expected communication style for the client persona:
- Luxury Buyers: formal language, emphasis on exclusive features, premium
  market positioning, no casual slang
- First-Time Homebuyers: warm and encouraging tone, educational explanations,
  avoids jargon, patient and supportive
- Investors: data-driven language, ROI-focused, market analytics, concise
  and professional

FAIL: The email uses a tone mismatched to the client persona. Examples:
- Using casual slang ("hey, check out this pad!") for a luxury buyer
- Using heavy financial jargon for a first-time homebuyer
- Using overly emotional language for an investor

3. Few-Shot Examples

Include labeled Pass and Fail examples from your human-labeled data.

## Examples

### Example 1: PASS
Client Persona: Luxury Buyer
Email: "Dear Mr. Harrington, I am pleased to present an exclusive listing
at 1200 Pacific Heights Drive. This distinguished property features..."
Critique: The email opens with a formal salutation and uses language
consistent with luxury positioning — "exclusive listing," "distinguished
property." No casual slang or informal phrasing. The tone matches the
luxury buyer persona throughout.
Result: Pass

### Example 2: FAIL
Client Persona: Luxury Buyer
Email: "Hey! Just found this awesome place you might like. It's got a
pool and stuff, super cool neighborhood..."
Critique: The greeting "Hey!" is informal. Phrases like "awesome place,"
"got a pool and stuff," and "super cool" are casual slang inappropriate
for a luxury buyer. The email reads like a text message, not a
professional communication for a high-end client.
Result: Fail

### Example 3: PASS (borderline)
Client Persona: First-Time Homebuyer
Email: "Hi Sarah, I found a property that might be a great fit for your
first home. The neighborhood has good schools nearby, and the monthly
payment would be similar to what you're currently paying in rent..."
Critique: The greeting is warm but not overly casual. The email explains
the property in relatable terms — comparing mortgage to rent, mentioning
schools — which is educational without being condescending. It avoids
jargon like "amortization" or "LTV ratio." While not deeply technical,
this matches the supportive tone expected for a first-time buyer.
Result: Pass

Rules for selecting examples:

Include at least one clear Pass, one clear Fail, and one borderline case. Borderline examples are the most valuable — they teach nuance.
Draw examples from the training split (10-20% of labeled data set aside for this purpose).
Any example used in the judge prompt must be excluded from dev and test sets. Using dev/test examples is data leakage.
2-4 examples is typical. Performance plateaus after 4-8.

4. Structured Output Format

The output must include a critique before the verdict. Placing the critique first forces the judge to articulate its assessment before committing to a decision.

{
  "critique": "string — detailed assessment of the output against the criterion",
  "result": "Pass or Fail"
}

Choosing What to Pass to the Judge

Feed only what the judge needs for an accurate decision:

For long documents, feed only the relevant snippet, not the entire document.

Model Selection

Start with the most capable model available. The same model used for the main task works as judge (the judge performs a different, narrower task). Optimize for cost later once alignment is confirmed.

Anti-Patterns

Vague criteria like "is this helpful?" Target a specific, observable failure mode from error analysis.
Holistic judge for the entire trace. A single judge covering multiple dimensions produces unactionable verdicts.
No few-shot examples. Without examples, the model won't know what counts as a failure in your application.
Dev/test examples used as few-shot. This is data leakage. Use only the training split.
Likert scales (1-5, letter grades, etc.). Binary pass/fail only. Likert scales produce scores that sound precise but can't be calibrated: annotators disagree on the difference between a 3 and a 4, and the judge inherits that noise. Binary forces you to define a clear decision boundary upfront, which makes inter-annotator agreement measurable and the judge's errors actionable. If you need to capture severity, use multiple binary judges (e.g., "factually wrong" and "dangerously wrong") rather than one ordinal scale.
Skipping validation. Measure alignment with human labels using validate-evaluator before trusting the judge.
Judges for specification failures without fixing the prompt first. If the prompt never asked for the behavior, add the instruction before building an evaluator. For critical requirements, a judge can still serve as a regression guard.

Related Skills

hamelsmu/validate-evaluator

development

VerifiedTrustedCommunity

Calibrate an LLM judge against human labels using data splits, TPR/TNR, and bias correction. Use after writing a judge prompt (write-judge-prompt) when you need to verify alignment before trusting its outputs. Do NOT use for code-based evaluators (those are deterministic; test with standard unit tests).

900SKILL.mdUpdated Mar 20, 2026

hamelsmu/validate-evaluator

hamelsmu/generate-synthetic-data

testing

VerifiedTrustedCommunity

Create diverse synthetic test inputs for LLM pipeline evaluation using dimension-based tuple generation. Use when bootstrapping an eval dataset, when real user data is sparse, or when stress-testing specific failure hypotheses. Do NOT use when you already have 100+ representative real traces (use stratified sampling instead), or when the task is collecting production logs.

900SKILL.mdUpdated Mar 20, 2026

hamelsmu/generate-synthetic-data

hamelsmu/evaluate-rag

testing

VerifiedTrustedCommunity

Guides evaluation of RAG pipeline retrieval and generation quality. Use when evaluating a retrieval-augmented generation system, measuring retrieval quality, assessing generation faithfulness or relevance, generating synthetic QA pairs for retrieval testing, or optimizing chunking strategies.

900SKILL.mdUpdated Mar 20, 2026

hamelsmu/evaluate-rag

hamelsmu/eval-audit

development

VerifiedTrustedCommunity

Audit an LLM eval pipeline and surface problems: missing error analysis, unvalidated judges, vanity metrics, etc. Use when inheriting an eval system, when unsure whether evals are trustworthy, or as a starting point when no eval infrastructure exists. Do NOT use when the goal is to build a new evaluator from scratch (use error-analysis, write-judge-prompt, or validate-evaluator instead).

900SKILL.mdUpdated Mar 20, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/hamelsmu/evals-skills.git

# Copy into Claude Code skills folder (global)
cp -r evals-skills/skills/write-judge-prompt ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

hamelsmu/evals-skills

900 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT