skills/llm-evaluation/SKILL.md
Implement comprehensive evaluation strategies for LLM applications using automated metrics, LLM-as-judge, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, comparing prompts/models, or establishing evaluation frameworks. Covers RAGAS for RAG pipelines, evals-as-code CI/CD integration, and modern 2025/2026 practices including structured output evaluation and agentic task success measurement.
npx skillsauth add ckorhonen/claude-skills llm-evaluationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.
Fast, repeatable, scalable evaluation using computed scores.
Text Generation:
Classification:
Retrieval (RAG):
Manual assessment for quality aspects difficult to automate.
Dimensions:
Use stronger LLMs to evaluate weaker model outputs. This is the dominant approach in 2025/2026 for open-ended tasks.
Approaches:
Key challenges:
For Retrieval-Augmented Generation pipelines, use RAGAS metrics:
from ragas import evaluate
from ragas.metrics import (
faithfulness, # Is answer grounded in retrieved context?
answer_relevancy, # Does answer address the question?
context_precision, # Is retrieved context relevant?
context_recall, # Is all necessary info retrieved?
)
from datasets import Dataset
# Prepare evaluation dataset
data = {
"question": ["What is the capital of France?"],
"answer": ["Paris is the capital of France."],
"contexts": [["Paris is a city in France. It is the capital."]],
"ground_truth": ["Paris"]
}
dataset = Dataset.from_dict(data)
result = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall])
print(result)
RAGAS metric interpretation:
For agents that execute multi-step tasks:
class AgentTaskEvaluator:
"""Evaluate agentic task completion."""
def evaluate_task(self, task, agent_trajectory, expected_result):
return {
"task_success": self._check_task_success(agent_trajectory, expected_result),
"tool_use_accuracy": self._check_tool_selection(agent_trajectory),
"step_efficiency": self._measure_step_efficiency(agent_trajectory),
"hallucination_rate": self._check_for_hallucinations(agent_trajectory),
}
def _check_task_success(self, trajectory, expected):
# Did agent achieve the goal? (binary or partial credit)
final_state = trajectory[-1]["state"]
return compare_states(final_state, expected)
def _measure_step_efficiency(self, trajectory):
# How many extra steps did agent take? (vs. optimal path)
actual_steps = len(trajectory)
optimal_steps = self.get_optimal_path_length(trajectory[0]["task"])
return optimal_steps / actual_steps # 1.0 = optimal
def _check_tool_selection(self, trajectory):
# Did agent use correct tools in correct order?
correct_tools = sum(1 for step in trajectory if step["tool_correct"])
return correct_tools / len(trajectory)
Key agentic metrics:
from llm_eval import EvaluationSuite, Metric
# Define evaluation suite
suite = EvaluationSuite([
Metric.accuracy(),
Metric.bleu(),
Metric.bertscore(),
Metric.custom(name="groundedness", fn=check_groundedness)
])
# Prepare test cases
test_cases = [
{
"input": "What is the capital of France?",
"expected": "Paris",
"context": "France is a country in Europe. Paris is its capital."
},
# ... more test cases
]
# Run evaluation
results = suite.evaluate(
model=your_model,
test_cases=test_cases
)
print(f"Overall Accuracy: {results.metrics['accuracy']}")
print(f"BLEU Score: {results.metrics['bleu']}")
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
def calculate_bleu(reference, hypothesis):
"""Calculate BLEU score between reference and hypothesis."""
smoothie = SmoothingFunction().method4
return sentence_bleu(
[reference.split()],
hypothesis.split(),
smoothing_function=smoothie
)
# Usage
bleu = calculate_bleu(
reference="The cat sat on the mat",
hypothesis="A cat is sitting on the mat"
)
from rouge_score import rouge_scorer
def calculate_rouge(reference, hypothesis):
"""Calculate ROUGE scores."""
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = scorer.score(reference, hypothesis)
return {
'rouge1': scores['rouge1'].fmeasure,
'rouge2': scores['rouge2'].fmeasure,
'rougeL': scores['rougeL'].fmeasure
}
from bert_score import score
def calculate_bertscore(references, hypotheses):
"""Calculate BERTScore using pre-trained BERT."""
P, R, F1 = score(
hypotheses,
references,
lang='en',
model_type='microsoft/deberta-xlarge-mnli'
)
return {
'precision': P.mean().item(),
'recall': R.mean().item(),
'f1': F1.mean().item()
}
def calculate_groundedness(response, context):
"""Check if response is grounded in provided context."""
# Use NLI model to check entailment
from transformers import pipeline
nli = pipeline("text-classification", model="microsoft/deberta-large-mnli")
result = nli(f"{context} [SEP] {response}")[0]
# Return confidence that response is entailed by context
return result['score'] if result['label'] == 'ENTAILMENT' else 0.0
def calculate_toxicity(text):
"""Measure toxicity in generated text."""
from detoxify import Detoxify
results = Detoxify('original').predict(text)
return max(results.values()) # Return highest toxicity score
def calculate_factuality(claim, knowledge_base):
"""Verify factual claims against knowledge base."""
# Implementation depends on your knowledge base
# Could use retrieval + NLI, or fact-checking API
pass
from openai import OpenAI
import json
client = OpenAI()
def llm_judge_quality(response, question):
"""Use GPT-4.1 to judge response quality with structured output."""
prompt = f"""You are an impartial evaluator. Rate the following response on a scale of 1-10 for each criterion.
**Criteria:**
1. Accuracy (1=many factual errors, 10=completely correct)
2. Helpfulness (1=doesn't address question, 10=fully resolves question)
3. Clarity (1=confusing/unclear, 10=perfectly clear and well-structured)
**Question:** {question}
**Response:** {response}
Evaluate objectively. Provide ratings in JSON format:
{{
"accuracy": <1-10>,
"helpfulness": <1-10>,
"clarity": <1-10>,
"reasoning": "<2-3 sentence justification>",
"overall": <1-10>
}}
"""
result = client.chat.completions.create(
model="gpt-4.1",
messages=[{"role": "user", "content": prompt}],
temperature=0,
response_format={"type": "json_object"} # Structured output
)
return json.loads(result.choices[0].message.content)
### Pairwise Comparison (with position bias mitigation)
```python
def compare_responses(question, response_a, response_b):
"""Compare two responses using LLM judge with position bias mitigation."""
def _judge(q, r1, r2, label1, label2):
prompt = f"""Compare these two responses to the question. Which is better?
Question: {q}
Response {label1}: {r1}
Response {label2}: {r2}
Which response is better and why? Consider accuracy, helpfulness, and clarity.
Answer with JSON:
{{
"winner": "{label1}" or "{label2}" or "tie",
"reasoning": "<explanation>",
"confidence": <1-10>
}}
"""
result = client.chat.completions.create(
model="gpt-4.1",
messages=[{"role": "user", "content": prompt}],
temperature=0,
response_format={"type": "json_object"}
)
return json.loads(result.choices[0].message.content)
# Run twice with swapped order to detect position bias
result_ab = _judge(question, response_a, response_b, "A", "B")
result_ba = _judge(question, response_b, response_a, "B", "A") # swapped
# Normalize result_ba back to A/B labels
winner_ba_normalized = "A" if result_ba["winner"] == "B" else ("B" if result_ba["winner"] == "A" else "tie")
# Check for consistency
consistent = result_ab["winner"] == winner_ba_normalized
if not consistent:
final_winner = "tie" # Disagree = call it a tie
else:
final_winner = result_ab["winner"]
return {
"winner": final_winner,
"consistent": consistent,
"reasoning_ab": result_ab["reasoning"],
"reasoning_ba": result_ba["reasoning"],
}
class AnnotationTask:
"""Structure for human annotation task."""
def __init__(self, response, question, context=None):
self.response = response
self.question = question
self.context = context
def get_annotation_form(self):
return {
"question": self.question,
"context": self.context,
"response": self.response,
"ratings": {
"accuracy": {
"scale": "1-5",
"description": "Is the response factually correct?"
},
"relevance": {
"scale": "1-5",
"description": "Does it answer the question?"
},
"coherence": {
"scale": "1-5",
"description": "Is it logically consistent?"
}
},
"issues": {
"factual_error": False,
"hallucination": False,
"off_topic": False,
"unsafe_content": False
},
"feedback": ""
}
from sklearn.metrics import cohen_kappa_score
def calculate_agreement(rater1_scores, rater2_scores):
"""Calculate inter-rater agreement."""
kappa = cohen_kappa_score(rater1_scores, rater2_scores)
interpretation = {
kappa < 0: "Poor",
kappa < 0.2: "Slight",
kappa < 0.4: "Fair",
kappa < 0.6: "Moderate",
kappa < 0.8: "Substantial",
kappa <= 1.0: "Almost Perfect"
}
return {
"kappa": kappa,
"interpretation": interpretation[True]
}
from scipy import stats
import numpy as np
class ABTest:
def __init__(self, variant_a_name="A", variant_b_name="B"):
self.variant_a = {"name": variant_a_name, "scores": []}
self.variant_b = {"name": variant_b_name, "scores": []}
def add_result(self, variant, score):
"""Add evaluation result for a variant."""
if variant == "A":
self.variant_a["scores"].append(score)
else:
self.variant_b["scores"].append(score)
def analyze(self, alpha=0.05):
"""Perform statistical analysis."""
a_scores = self.variant_a["scores"]
b_scores = self.variant_b["scores"]
# T-test
t_stat, p_value = stats.ttest_ind(a_scores, b_scores)
# Effect size (Cohen's d)
pooled_std = np.sqrt((np.std(a_scores)**2 + np.std(b_scores)**2) / 2)
cohens_d = (np.mean(b_scores) - np.mean(a_scores)) / pooled_std
return {
"variant_a_mean": np.mean(a_scores),
"variant_b_mean": np.mean(b_scores),
"difference": np.mean(b_scores) - np.mean(a_scores),
"relative_improvement": (np.mean(b_scores) - np.mean(a_scores)) / np.mean(a_scores),
"p_value": p_value,
"statistically_significant": p_value < alpha,
"cohens_d": cohens_d,
"effect_size": self.interpret_cohens_d(cohens_d),
"winner": "B" if np.mean(b_scores) > np.mean(a_scores) else "A"
}
@staticmethod
def interpret_cohens_d(d):
"""Interpret Cohen's d effect size."""
abs_d = abs(d)
if abs_d < 0.2:
return "negligible"
elif abs_d < 0.5:
return "small"
elif abs_d < 0.8:
return "medium"
else:
return "large"
class RegressionDetector:
def __init__(self, baseline_results, threshold=0.05):
self.baseline = baseline_results
self.threshold = threshold
def check_for_regression(self, new_results):
"""Detect if new results show regression."""
regressions = []
for metric in self.baseline.keys():
baseline_score = self.baseline[metric]
new_score = new_results.get(metric)
if new_score is None:
continue
# Calculate relative change
relative_change = (new_score - baseline_score) / baseline_score
# Flag if significant decrease
if relative_change < -self.threshold:
regressions.append({
"metric": metric,
"baseline": baseline_score,
"current": new_score,
"change": relative_change
})
return {
"has_regression": len(regressions) > 0,
"regressions": regressions
}
class BenchmarkRunner:
def __init__(self, benchmark_dataset):
self.dataset = benchmark_dataset
def run_benchmark(self, model, metrics):
"""Run model on benchmark and calculate metrics."""
results = {metric.name: [] for metric in metrics}
for example in self.dataset:
# Generate prediction
prediction = model.predict(example["input"])
# Calculate each metric
for metric in metrics:
score = metric.calculate(
prediction=prediction,
reference=example["reference"],
context=example.get("context")
)
results[metric.name].append(score)
# Aggregate results
return {
metric: {
"mean": np.mean(scores),
"std": np.std(scores),
"min": min(scores),
"max": max(scores)
}
for metric, scores in results.items()
}
Run evaluations automatically in your CI/CD pipeline:
# .github/workflows/eval.yml
name: LLM Evaluation
on:
pull_request:
paths: ['prompts/**', 'src/llm/**']
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run evaluation suite
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
pip install -r requirements-eval.txt
python scripts/run_evaluations.py --baseline main --compare HEAD
- name: Check for regression
run: |
python scripts/check_regression.py \
--threshold 0.05 \
--fail-on-regression
# scripts/run_evaluations.py
import argparse
import json
from pathlib import Path
def run_eval_suite(model_fn, test_cases, metrics):
"""Run complete evaluation suite and return results."""
results = []
for case in test_cases:
prediction = model_fn(case["input"])
scores = {m.name: m.calculate(prediction, case["reference"]) for m in metrics}
results.append({"case": case["id"], "scores": scores})
aggregated = {
metric: sum(r["scores"][metric] for r in results) / len(results)
for metric in results[0]["scores"]
}
return aggregated
def main():
# Load test cases
test_cases = json.loads(Path("evals/test_cases.json").read_text())
# Run evaluation
results = run_eval_suite(your_model, test_cases, your_metrics)
# Save results with git commit hash
import subprocess
commit = subprocess.check_output(["git", "rev-parse", "HEAD"]).decode().strip()
output = {"commit": commit, "metrics": results}
Path(f"eval_results/{commit[:8]}.json").write_text(json.dumps(output, indent=2))
print(json.dumps(results, indent=2))
if __name__ == "__main__":
main()
For applications that require structured outputs (JSON schemas, function calls):
from pydantic import BaseModel, ValidationError
class ExpectedOutput(BaseModel):
name: str
age: int
email: str
def evaluate_structured_output(model_response: str, expected: dict) -> dict:
"""Evaluate whether model output conforms to schema and matches expected values."""
# 1. Schema compliance
try:
parsed = ExpectedOutput.model_validate_json(model_response)
schema_valid = True
except ValidationError as e:
return {"schema_valid": False, "error": str(e), "field_accuracy": 0}
# 2. Field accuracy
expected_obj = ExpectedOutput(**expected)
fields = ExpectedOutput.model_fields.keys()
correct = sum(1 for f in fields if getattr(parsed, f) == getattr(expected_obj, f))
return {
"schema_valid": True,
"field_accuracy": correct / len(fields),
"fields_correct": correct,
"fields_total": len(fields),
}
documentation
Create or expand an Idea.md / IDEA.md file from a rough description, existing repo, conversation history, notes, or other early-stage product inputs. Use when the user asks to "write an Idea.md", "turn this into an idea file", "capture this product idea", "expand this concept", or wants a repo-grounded concept brief before validation, PRD, or implementation work.
development
Write structured implementation plans from specs or requirements before touching code. Use when given a spec, requirements doc, or feature description, when user says "plan this out", "write a plan for", "how should we implement", or before starting any multi-step coding task.
testing
Expert guidance for video editing with ffmpeg, encoding best practices, and quality optimization. Use when working with video files, transcoding, remuxing, encoding settings, color spaces, or troubleshooting video quality issues.
development
Opinionated constraints for building better interfaces with agents. Use when building UI components, implementing animations, designing layouts, reviewing frontend accessibility, or working with Tailwind CSS, motion/react, or accessible primitives like Radix/Base UI.