skills/ai-scoring/SKILL.md
Score, grade, or evaluate things using AI against a rubric. Use when grading essays, scoring code reviews, rating candidate responses, auditing support quality, evaluating compliance, building a quality rubric, running QA checks against criteria, assessing performance, rating content quality, or any task where you need numeric scores with justifications. Also use when building an LLM as a judge, automated grading system, AI rubric scoring, code review scoring automation, quality assessment automation, compliance scoring, NPS analysis with AI, performance review scoring, score and rank with explanations, build a rating system with AI, automated QA scoring, or judge AI outputs programmatically.
npx skillsauth add lebsral/dspy-programming-not-prompting-lms-skills ai-scoringInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Guide the user through building AI that scores, grades, or evaluates work against defined criteria. The pattern: define a rubric, score each criterion independently, calibrate with examples, and validate scorer quality.
Ask the user:
A good rubric has:
import dspy
from pydantic import BaseModel, Field
class CriterionScore(BaseModel):
criterion: str = Field(description="Name of the criterion being scored")
score: int = Field(ge=1, le=5, description="Score from 1 (poor) to 5 (excellent)")
justification: str = Field(description="Evidence from the input that supports this score")
class ScoringResult(BaseModel):
criterion_scores: list[CriterionScore] = Field(description="Score for each criterion")
overall_score: float = Field(ge=1.0, le=5.0, description="Weighted overall score")
summary: str = Field(description="Brief overall assessment")
Define what's being scored and the criteria:
CRITERIA = [
"clarity: Is the writing clear and easy to follow? (1=confusing, 5=crystal clear)",
"argument: Is the argument well-structured and logical? (1=no structure, 5=compelling)",
"evidence: Does the writing cite relevant evidence? (1=no evidence, 5=strong support)",
]
class ScoreCriterion(dspy.Signature):
"""Score the submission on a single criterion. Be specific — cite evidence from the text."""
submission: str = dspy.InputField(desc="The work being evaluated")
criterion: str = dspy.InputField(desc="The criterion to score, including scale description")
score: int = dspy.OutputField(desc="Score from 1 to 5")
justification: str = dspy.OutputField(desc="Specific evidence from the submission supporting this score")
Scoring all criteria at once causes "halo effect" — a strong first impression biases all scores. Instead, score each criterion in its own call:
class RubricScorer(dspy.Module):
def __init__(self, criteria: list[str], weights: list[float] = None):
self.criteria = criteria
self.weights = weights or [1.0 / len(criteria)] * len(criteria)
self.score_criterion = dspy.ChainOfThought(ScoreCriterion)
def forward(self, submission: str):
criterion_scores = []
for criterion in self.criteria:
result = self.score_criterion(
submission=submission,
criterion=criterion,
)
criterion_scores.append(CriterionScore(
criterion=criterion.split(":")[0],
score=result.score,
justification=result.justification,
))
overall = sum(
cs.score * w for cs, w in zip(criterion_scores, self.weights)
)
return dspy.Prediction(
criterion_scores=criterion_scores,
overall_score=round(overall, 2),
)
Using ChainOfThought here is important — reasoning through the evidence before assigning a score produces more calibrated results than jumping straight to a number.
Without anchors, the scorer doesn't know what a "2" vs a "4" looks like. Provide reference examples at each level:
ANCHORS = """
Score 2 example for clarity: "The thing with the data is that it does stuff and the results are what they are."
→ Vague language, no specific referents, reader can't follow what's being described.
Score 4 example for clarity: "The customer churn model reduced false positives by 30% compared to the rule-based approach, though it still struggles with seasonal patterns."
→ Specific claims with numbers, clear comparison, one caveat noted.
"""
class ScoreCriterionCalibrated(dspy.Signature):
"""Score the submission on a single criterion. Use the anchor examples to calibrate your scoring."""
submission: str = dspy.InputField(desc="The work being evaluated")
criterion: str = dspy.InputField(desc="The criterion to score, including scale description")
anchors: str = dspy.InputField(desc="Reference examples showing what different score levels look like")
score: int = dspy.OutputField(desc="Score from 1 to 5")
justification: str = dspy.OutputField(desc="Specific evidence from the submission supporting this score")
Then pass anchors per criterion:
class CalibratedScorer(dspy.Module):
def __init__(self, criteria: list[str], anchors: dict[str, str], weights: list[float] = None):
self.criteria = criteria
self.anchors = anchors
self.weights = weights or [1.0 / len(criteria)] * len(criteria)
self.score_criterion = dspy.ChainOfThought(ScoreCriterionCalibrated)
def forward(self, submission: str):
criterion_scores = []
for criterion in self.criteria:
criterion_name = criterion.split(":")[0]
result = self.score_criterion(
submission=submission,
criterion=criterion,
anchors=self.anchors.get(criterion_name, "No anchors provided."),
)
criterion_scores.append(CriterionScore(
criterion=criterion_name,
score=result.score,
justification=result.justification,
))
overall = sum(cs.score * w for cs, w in zip(criterion_scores, self.weights))
return dspy.Prediction(
criterion_scores=criterion_scores,
overall_score=round(overall, 2),
)
Writing good anchors takes effort, but it's the single biggest lever for scoring quality. Start with 2-3 anchors per criterion at the low, mid, and high ends of the scale.
The overall score should be consistent with per-criterion scores:
def validate_scores(criterion_scores, weights, overall_score):
expected = sum(cs.score * w for cs, w in zip(criterion_scores, weights))
if abs(expected - overall_score) >= 0.1:
raise ValueError(
f"Overall score {overall_score} doesn't match weighted criteria ({expected:.2f})"
)
Some criteria don't apply to every submission:
class CriterionScoreOptional(BaseModel):
criterion: str
score: int = Field(ge=0, le=5, description="Score 1-5, or 0 if not applicable")
justification: str
applicable: bool = Field(description="Whether this criterion applies to this submission")
def pass_fail(overall_score: float, threshold: float = 3.0) -> str:
if overall_score >= threshold:
return "pass"
return "fail"
# Or with a "needs review" band
def tiered_decision(overall_score: float) -> str:
if overall_score >= 4.0:
return "pass"
elif overall_score >= 2.5:
return "needs_review"
return "fail"
For high-stakes scoring, run multiple independent scorers and flag disagreements:
class EnsembleScorer(dspy.Module):
def __init__(self, criteria, anchors, num_raters=3, weights=None):
self.raters = [
CalibratedScorer(criteria, anchors, weights)
for _ in range(num_raters)
]
def forward(self, submission: str):
all_results = [rater(submission=submission) for rater in self.raters]
# Check for disagreement per criterion
flagged = []
for i, criterion in enumerate(self.raters[0].criteria):
criterion_name = criterion.split(":")[0]
scores = [r.criterion_scores[i].score for r in all_results]
spread = max(scores) - min(scores)
if spread > 1:
flagged.append({
"criterion": criterion_name,
"scores": scores,
"spread": spread,
})
# Average the overall scores
avg_overall = sum(r.overall_score for r in all_results) / len(all_results)
return dspy.Prediction(
overall_score=round(avg_overall, 2),
all_results=all_results,
flagged_disagreements=flagged,
needs_human_review=len(flagged) > 0,
)
When raters disagree by more than 1 point on any criterion, flag it for human review. This catches the submissions that are genuinely ambiguous — exactly where human judgment matters most.
You need human-scored examples to evaluate your AI scorer:
scored_examples = [
dspy.Example(
submission="...",
gold_scores={"clarity": 4, "argument": 3, "evidence": 5},
gold_overall=4.0,
).with_inputs("submission"),
# 20-50+ scored examples
]
def scoring_metric(example, prediction, trace=None):
"""Measures how close AI scores are to human gold scores."""
errors = []
for cs in prediction.criterion_scores:
gold = example.gold_scores.get(cs.criterion)
if gold is not None:
errors.append(abs(cs.score - gold))
if not errors:
return 0.0
mae = sum(errors) / len(errors)
# Convert to 0-1 scale (0 error = 1.0, 4 error = 0.0)
return max(0.0, 1.0 - mae / 4.0)
def agreement_metric(example, prediction, trace=None):
"""Score is 1.0 if all criteria are within 1 point of gold."""
for cs in prediction.criterion_scores:
gold = example.gold_scores.get(cs.criterion)
if gold is not None and abs(cs.score - gold) > 1:
return 0.0
return 1.0
from dspy.evaluate import Evaluate
evaluator = Evaluate(devset=scored_examples, metric=scoring_metric, num_threads=4)
baseline = evaluator(scorer)
optimizer = dspy.MIPROv2(metric=scoring_metric, auto="medium")
optimized_scorer = optimizer.compile(scorer, trainset=trainset)
optimized_score = evaluator(optimized_scorer)
print(f"Baseline agreement: {baseline:.1f}%")
print(f"Optimized agreement: {optimized_score:.1f}%")
# Typical improvement: 50-65% agreement -> 75-90% after MIPROv2 with 30+ gold examples
/ai-sorting instead. Scoring adds complexity when you just need buckets.| Approach | Best for | Tradeoffs |
|----------|----------|-----------|
| Single rater + Predict | Low-stakes, high-volume screening | Fast and cheap, but less calibrated |
| Single rater + ChainOfThought | Most scoring tasks | Better calibration, ~2x cost of Predict |
| Calibrated rater (with anchors) | Tasks with established standards | Best single-rater quality, requires anchor examples |
| Multi-rater ensemble (3 raters) | High-stakes decisions (hiring, compliance) | 3x cost, but catches ambiguous cases |
| BestOfN with scoring metric | When you have a reward function | Picks best of N attempts, not multi-perspective |
Field(ge=1, le=5) enforces valid score ranges automaticallyChainOfThought call, even though it costs more.Field(min_length=20) on the justification field, or wrap the scorer with dspy.Refine and a reward function that penalizes vague justifications.result.score can be a string instead of int depending on adapter. Always use int(result.score) or Pydantic Field(ge=1, le=5) to enforce the type. Comparing string "3" > int 2 silently passes in some contexts.Install any skill:
npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill <name>
/ai-sorting/ai-checking-outputs/ai-improving-accuracy/dspy-chain-of-thought/dspy-refine/ai-do if you do not have it — it routes any AI problem to the right skill and is the fastest way to work: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-dotools
See what is happening during optimizer.compile() instead of waiting blind. Use when you want to watch optimization progress, see scores as they come in, know if your optimizer is working, check if optimization is stuck, understand why optimization is taking too long, get live progress during compile, monitor convergence, detect overfitting during optimization, interpret optimization results, or pick the right tool for watching optimization. Also used for optimizer progress bar, is my optimizer doing anything, optimization seems stuck, how long will optimization take, watch GEPA run, watch MIPROv2 run, live optimization dashboard, optimizer not improving, scores not going up, optimization taking forever, see what optimizer is doing, debug slow optimization, optimization visibility, optimizer metrics, track compile progress, optimization observability.
testing
Use when you want the highest-quality prompt optimization DSPy offers — jointly optimizes instructions and few-shot demos, with auto=light/medium/heavy presets. Common scenarios - you want the best possible accuracy from prompt optimization, jointly tuning instructions and few-shot demonstrations, using auto presets for different compute budgets, or when COPRO or BootstrapFewShot alone are not reaching your accuracy target. Related - ai-improving-accuracy, dspy-copro, dspy-bootstrap-few-shot. Also used for dspy.MIPROv2, best DSPy optimizer, highest quality optimization, auto=light medium heavy, joint instruction and demo optimization, most powerful prompt optimizer, MIPROv2 vs COPRO vs BootstrapFewShot, which optimizer should I use, state of the art prompt optimization, when to use MIPROv2, optimize both instructions and examples, heavy optimization for production, best optimizer for accuracy.
testing
Use LangWatch for DSPy auto-tracing and real-time optimizer progress. Use when you want to set up LangWatch, langwatch.dspy.init, auto-tracing DSPy, real-time optimization dashboard, optimizer progress tracking, app.langwatch.ai, or DSPy optimizer dashboard. Also used for langwatch setup, pip install langwatch, langwatch trace, optimizer progress, real-time optimization, watch optimizer run, LangWatch self-hosted, langwatch docker, langwatch vs langtrace, langwatch autotrack_dspy.
data-ai
Use when you want to optimize instructions without few-shot examples — a lightweight alternative to COPRO when you do not have or do not want to use demonstrations. Common scenarios - optimizing instructions when you do not have or do not want to use few-shot demonstrations, lightweight instruction search as a first step, tasks where examples in the prompt confuse the model, or when you want fast instruction optimization without the cost of COPRO. Related - ai-improving-accuracy, dspy-copro, dspy-miprov2. Also used for dspy.GEPA, instruction optimization without demos, lightweight prompt optimization, optimize instructions only, no few-shot examples needed, GEPA vs COPRO, quick instruction search, when demonstrations hurt performance, zero-shot optimization, instruction-only optimizer, simplest instruction tuner, fast prompt optimization, skip few-shot and just tune instructions, optimize Pydantic field descriptions, GEPA structured output, GEPA does not optimize field desc.