Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

lebsral/dspy-evaluate

Name: dspy-evaluate
Author: lebsral

skills/dspy-evaluate/SKILL.md

npx skillsauth add lebsral/dspy-programming-not-prompting-lms-skills dspy-evaluate

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Evaluate Your DSPy Program

Guide the user through measuring AI quality with DSPy's Evaluate class. The pattern: pick a metric, prepare a devset, run the evaluator, interpret results, then feed the same metric into an optimizer.

What is dspy.Evaluate

dspy.Evaluate runs your program on every devset example, scores each with a metric, and reports the aggregate score. It handles threading and progress display. Returns a percentage (0-100).

Built-in metrics

DSPy provides answer_exact_match (normalized string equality) and answer_passage_match (substring check). Both expect an answer field on example and prediction.

SemanticF1

Measures token-level overlap between the predicted and expected answer using an F1 score. More forgiving than exact match — it gives partial credit for answers that are close but not identical:

from dspy.evaluate import SemanticF1

semantic_f1 = SemanticF1()

evaluator = Evaluate(devset=devset, metric=semantic_f1, num_threads=4)
score = evaluator(my_program)

SemanticF1 is a good default metric for open-ended QA tasks where exact match is too strict. It expects response fields on both the example and prediction (not answer). Constructor: SemanticF1(threshold=0.66, decompositional=False). The threshold controls the minimum score during optimization (when trace is set).

CompleteAndGrounded

Checks whether the predicted answer is both complete (covers all key claims in the gold answer) and grounded (doesn't hallucinate facts not in the gold answer):

from dspy.evaluate import CompleteAndGrounded

complete_and_grounded = CompleteAndGrounded()

evaluator = Evaluate(devset=devset, metric=complete_and_grounded, num_threads=4)
score = evaluator(my_program)

This is an LM-based metric — it uses the configured LM to judge completeness and groundedness. It expects response and context fields on the prediction, and question and response on the example. Constructor: CompleteAndGrounded(threshold=0.66). Useful for RAG tasks where you care about both recall and precision of facts.

Custom metrics

A metric is def metric(example, prediction, trace=None) returning bool, int, or float. The trace parameter is None during evaluation but set during optimization (use this to apply stricter requirements during training).

Multi-field scoring

def metric(example, prediction, trace=None):
    fields = ["name", "email", "phone"]
    correct = sum(
        1 for f in fields
        if getattr(prediction, f, "").strip().lower() == getattr(example, f, "").strip().lower()
    )
    return correct / len(fields)

LM-as-judge

For open-ended tasks (summaries, creative writing, complex QA), use an LM to judge quality. Define a signature for the judge, then call it inside your metric:

class AssessAnswer(dspy.Signature):
    """Assess if the predicted answer correctly addresses the question."""
    question: str = dspy.InputField()
    gold_answer: str = dspy.InputField(desc="The reference answer")
    predicted_answer: str = dspy.InputField(desc="The answer to evaluate")
    is_correct: bool = dspy.OutputField(desc="True if the prediction is correct and complete")

def llm_judge_metric(example, prediction, trace=None):
    judge = dspy.Predict(AssessAnswer)
    result = judge(
        question=example.question,
        gold_answer=example.answer,
        predicted_answer=prediction.answer,
    )
    return result.is_correct

Use a separate LM for the judge

To avoid the model grading its own work, use a different (often stronger) LM for the judge:

judge_lm = dspy.LM("openai/gpt-4o")  # or "anthropic/claude-sonnet-4-5-20250929", etc.

def llm_judge_metric(example, prediction, trace=None):
    judge = dspy.Predict(AssessAnswer)
    with dspy.context(lm=judge_lm):
        result = judge(
            question=example.question,
            gold_answer=example.answer,
            predicted_answer=prediction.answer,
        )
    return result.is_correct

Graded judge (float scores)

Return a float instead of a bool for partial credit:

class GradeAnswer(dspy.Signature):
    """Grade the predicted answer on a scale of 0 to 5."""
    question: str = dspy.InputField()
    gold_answer: str = dspy.InputField()
    predicted_answer: str = dspy.InputField()
    score: int = dspy.OutputField(desc="Score from 0 (completely wrong) to 5 (perfect)")
    justification: str = dspy.OutputField(desc="Why this score was given")

def graded_metric(example, prediction, trace=None):
    judge = dspy.ChainOfThought(GradeAnswer)
    result = judge(
        question=example.question,
        gold_answer=example.answer,
        predicted_answer=prediction.answer,
    )
    return result.score / 5.0  # normalize to 0.0-1.0

Composite metrics

Combine multiple signals into a single score with weights:

def composite_metric(example, prediction, trace=None):
    # Correctness (primary signal)
    correct = float(prediction.answer.strip().lower() == example.answer.strip().lower())

    # Conciseness (prefer shorter answers)
    concise = float(len(prediction.answer.split()) < 50)

    # Has reasoning (check that the model explained its thinking)
    has_reasoning = float(len(getattr(prediction, "reasoning", "")) > 20)

    # Weighted combination
    return 0.7 * correct + 0.2 * concise + 0.1 * has_reasoning

Mixing exact checks with LM judges

def hybrid_metric(example, prediction, trace=None):
    # Fast exact check
    if prediction.answer.strip().lower() == example.answer.strip().lower():
        return 1.0

    # Fall back to LM judge for partial credit
    judge = dspy.Predict(AssessAnswer)
    result = judge(
        question=example.question,
        gold_answer=example.answer,
        predicted_answer=prediction.answer,
    )
    return 0.5 if result.is_correct else 0.0

Debugging with per-example scores

Evaluate returns an EvaluationResult with .score (aggregate percentage) and .results (list of (example, prediction, score) tuples). Use .results to find failing examples and understand failure patterns.

Common patterns

Trace-aware metrics for optimization

The trace parameter is None during evaluation but set during optimization. Use this to apply stricter requirements during training:

def metric(example, prediction, trace=None):
    correct = prediction.answer.strip().lower() == example.answer.strip().lower()
    if trace is not None:
        # During optimization: also require good reasoning
        has_reasoning = len(getattr(prediction, "reasoning", "")) > 50
        return correct and has_reasoning
    # During evaluation: only check correctness
    return correct

This makes the optimizer filter for traces where the model both got the answer right and showed its work. The result is more robust few-shot demonstrations.

Before-and-after comparison

A common workflow for measuring the impact of optimization:

from dspy.evaluate import Evaluate

evaluator = Evaluate(devset=devset, metric=metric, num_threads=4, display_table=5)

# Baseline
baseline = evaluator(my_program)

# Optimize
optimizer = dspy.MIPROv2(metric=metric, auto="medium")
optimized = optimizer.compile(my_program, trainset=trainset)

# Compare
optimized_result = evaluator(optimized)
print(f"Baseline:  {baseline.score:.1f}%")
print(f"Optimized: {optimized_result.score:.1f}%")
print(f"Delta:     {optimized_result.score - baseline.score:+.1f}%")

Gotchas

Claude uses return_all_scores=True or return_outputs=True which no longer exist. Evaluate now returns an EvaluationResult object. Access .score for the aggregate percentage and .results for per-example (example, prediction, score) tuples. Do not pass return_all_scores or return_outputs.
Claude uses SemanticF1 with answer fields but it expects response. SemanticF1 looks for response on both the example and prediction, not answer. If your signature uses answer, either rename the field or write a wrapper metric.
Claude uses CompleteAndGrounded without providing context. CompleteAndGrounded expects response and context on the prediction, and question and response on the example. Without context, it cannot check groundedness.
Metrics must return a float or bool, not a string -- returning a string silently breaks scoring.
Small dev sets (<30 examples) give unreliable scores -- results can swing 10-20% between runs. Aim for 50+ examples for stable evaluation.

Additional resources

dspy.Evaluate API docs
dspy.SemanticF1 API docs
dspy.CompleteAndGrounded API docs
For constructor signatures and method reference, see reference.md
For worked examples (exact match, LM judge, composite), see examples.md

Cross-references

Install any skill: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill <name>

Need to prepare training and evaluation data? Use /dspy-data
Ready to optimize with few-shot examples? Use /dspy-bootstrap-few-shot
Want the best prompt optimization? Use /dspy-miprov2
For the full measure-improve-verify loop, see /ai-improving-accuracy
For decomposed RAG evaluation (faithfulness, context precision/recall) see /dspy-ragas
For worked examples (exact match, LM judge, composite), see examples.md
Install /ai-do if you do not have it — it routes any AI problem to the right skill and is the fastest way to work: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-do

lebsral/dspy-evaluate

skills/dspy-evaluate/SKILL.md

Use when you need to measure how well your DSPy program performs — writing metrics, scoring against a dev set, or comparing before/after optimization. Common scenarios - measuring accuracy before and after optimization, writing custom metrics for your task, scoring a program against a held-out dev set, comparing two prompt strategies, building a test suite for AI quality, or running regression tests on AI outputs. Related - ai-improving-accuracy, ai-scoring, ai-monitoring. Also used for dspy.Evaluate, dspy.evaluate, write DSPy metric function, measure AI accuracy, evaluate DSPy program, dev set evaluation, before and after optimization comparison, custom scoring function, test AI quality systematically, AI regression testing, metric-driven development, how to know if my DSPy program improved, score predictions against labels, evaluation harness for LLM, CI/CD for AI quality.

5 stars

development

Updated May 7, 2026

$ install --global

skillsauth

npx skillsauth add lebsral/dspy-programming-not-prompting-lms-skills dspy-evaluate

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 7, 2026, 6:59 AM185.1s5 files scanned

SKILL.md

name:: dspy-evaluate
description:: Use when you need to measure how well your DSPy program performs — writing metrics, scoring against a dev set, or comparing before/after optimization. Common scenarios - measuring accuracy before and after optimization, writing custom metrics for your task, scoring a program against a held-out dev set, comparing two prompt strategies, building a test suite for AI quality, or running regression tests on AI outputs. Related - ai-improving-accuracy, ai-scoring, ai-monitoring. Also used for dspy.Evaluate, dspy.evaluate, write DSPy metric function, measure AI accuracy, evaluate DSPy program, dev set evaluation, before and after optimization comparison, custom scoring function, test AI quality systematically, AI regression testing, metric-driven development, how to know if my DSPy program improved, score predictions against labels, evaluation harness for LLM, CI/CD for AI quality.

Evaluate Your DSPy Program

What is dspy.Evaluate

dspy.Evaluate runs your program on every devset example, scores each with a metric, and reports the aggregate score. It handles threading and progress display. Returns a percentage (0-100).

Built-in metrics

DSPy provides answer_exact_match (normalized string equality) and answer_passage_match (substring check). Both expect an answer field on example and prediction.

SemanticF1

Measures token-level overlap between the predicted and expected answer using an F1 score. More forgiving than exact match — it gives partial credit for answers that are close but not identical:

from dspy.evaluate import SemanticF1

semantic_f1 = SemanticF1()

evaluator = Evaluate(devset=devset, metric=semantic_f1, num_threads=4)
score = evaluator(my_program)

CompleteAndGrounded

Checks whether the predicted answer is both complete (covers all key claims in the gold answer) and grounded (doesn't hallucinate facts not in the gold answer):

from dspy.evaluate import CompleteAndGrounded

complete_and_grounded = CompleteAndGrounded()

evaluator = Evaluate(devset=devset, metric=complete_and_grounded, num_threads=4)
score = evaluator(my_program)

Custom metrics

Multi-field scoring

def metric(example, prediction, trace=None):
    fields = ["name", "email", "phone"]
    correct = sum(
        1 for f in fields
        if getattr(prediction, f, "").strip().lower() == getattr(example, f, "").strip().lower()
    )
    return correct / len(fields)

LM-as-judge

For open-ended tasks (summaries, creative writing, complex QA), use an LM to judge quality. Define a signature for the judge, then call it inside your metric:

class AssessAnswer(dspy.Signature):
    """Assess if the predicted answer correctly addresses the question."""
    question: str = dspy.InputField()
    gold_answer: str = dspy.InputField(desc="The reference answer")
    predicted_answer: str = dspy.InputField(desc="The answer to evaluate")
    is_correct: bool = dspy.OutputField(desc="True if the prediction is correct and complete")

def llm_judge_metric(example, prediction, trace=None):
    judge = dspy.Predict(AssessAnswer)
    result = judge(
        question=example.question,
        gold_answer=example.answer,
        predicted_answer=prediction.answer,
    )
    return result.is_correct

Use a separate LM for the judge

To avoid the model grading its own work, use a different (often stronger) LM for the judge:

judge_lm = dspy.LM("openai/gpt-4o")  # or "anthropic/claude-sonnet-4-5-20250929", etc.

def llm_judge_metric(example, prediction, trace=None):
    judge = dspy.Predict(AssessAnswer)
    with dspy.context(lm=judge_lm):
        result = judge(
            question=example.question,
            gold_answer=example.answer,
            predicted_answer=prediction.answer,
        )
    return result.is_correct

Graded judge (float scores)

Return a float instead of a bool for partial credit:

class GradeAnswer(dspy.Signature):
    """Grade the predicted answer on a scale of 0 to 5."""
    question: str = dspy.InputField()
    gold_answer: str = dspy.InputField()
    predicted_answer: str = dspy.InputField()
    score: int = dspy.OutputField(desc="Score from 0 (completely wrong) to 5 (perfect)")
    justification: str = dspy.OutputField(desc="Why this score was given")

def graded_metric(example, prediction, trace=None):
    judge = dspy.ChainOfThought(GradeAnswer)
    result = judge(
        question=example.question,
        gold_answer=example.answer,
        predicted_answer=prediction.answer,
    )
    return result.score / 5.0  # normalize to 0.0-1.0

Composite metrics

Combine multiple signals into a single score with weights:

def composite_metric(example, prediction, trace=None):
    # Correctness (primary signal)
    correct = float(prediction.answer.strip().lower() == example.answer.strip().lower())

    # Conciseness (prefer shorter answers)
    concise = float(len(prediction.answer.split()) < 50)

    # Has reasoning (check that the model explained its thinking)
    has_reasoning = float(len(getattr(prediction, "reasoning", "")) > 20)

    # Weighted combination
    return 0.7 * correct + 0.2 * concise + 0.1 * has_reasoning

Mixing exact checks with LM judges

def hybrid_metric(example, prediction, trace=None):
    # Fast exact check
    if prediction.answer.strip().lower() == example.answer.strip().lower():
        return 1.0

    # Fall back to LM judge for partial credit
    judge = dspy.Predict(AssessAnswer)
    result = judge(
        question=example.question,
        gold_answer=example.answer,
        predicted_answer=prediction.answer,
    )
    return 0.5 if result.is_correct else 0.0

Debugging with per-example scores

Common patterns

Trace-aware metrics for optimization

The trace parameter is None during evaluation but set during optimization. Use this to apply stricter requirements during training:

def metric(example, prediction, trace=None):
    correct = prediction.answer.strip().lower() == example.answer.strip().lower()
    if trace is not None:
        # During optimization: also require good reasoning
        has_reasoning = len(getattr(prediction, "reasoning", "")) > 50
        return correct and has_reasoning
    # During evaluation: only check correctness
    return correct

This makes the optimizer filter for traces where the model both got the answer right and showed its work. The result is more robust few-shot demonstrations.

Before-and-after comparison

A common workflow for measuring the impact of optimization:

from dspy.evaluate import Evaluate

evaluator = Evaluate(devset=devset, metric=metric, num_threads=4, display_table=5)

# Baseline
baseline = evaluator(my_program)

# Optimize
optimizer = dspy.MIPROv2(metric=metric, auto="medium")
optimized = optimizer.compile(my_program, trainset=trainset)

# Compare
optimized_result = evaluator(optimized)
print(f"Baseline:  {baseline.score:.1f}%")
print(f"Optimized: {optimized_result.score:.1f}%")
print(f"Delta:     {optimized_result.score - baseline.score:+.1f}%")

Gotchas

Claude uses return_all_scores=True or return_outputs=True which no longer exist. Evaluate now returns an EvaluationResult object. Access .score for the aggregate percentage and .results for per-example (example, prediction, score) tuples. Do not pass return_all_scores or return_outputs.
Claude uses SemanticF1 with answer fields but it expects response. SemanticF1 looks for response on both the example and prediction, not answer. If your signature uses answer, either rename the field or write a wrapper metric.
Claude uses CompleteAndGrounded without providing context. CompleteAndGrounded expects response and context on the prediction, and question and response on the example. Without context, it cannot check groundedness.
Metrics must return a float or bool, not a string -- returning a string silently breaks scoring.
Small dev sets (<30 examples) give unreliable scores -- results can swing 10-20% between runs. Aim for 50+ examples for stable evaluation.

Additional resources

dspy.Evaluate API docs
dspy.SemanticF1 API docs
dspy.CompleteAndGrounded API docs
For constructor signatures and method reference, see reference.md
For worked examples (exact match, LM judge, composite), see examples.md

Cross-references

Install any skill: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill <name>

Need to prepare training and evaluation data? Use /dspy-data
Ready to optimize with few-shot examples? Use /dspy-bootstrap-few-shot
Want the best prompt optimization? Use /dspy-miprov2
For the full measure-improve-verify loop, see /ai-improving-accuracy
For decomposed RAG evaluation (faithfulness, context precision/recall) see /dspy-ragas
For worked examples (exact match, LM judge, composite), see examples.md
Install /ai-do if you do not have it — it routes any AI problem to the right skill and is the fastest way to work: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-do

Related Skills

lebsral/ai-watching-optimization

tools

VerifiedTrustedCommunity

See what is happening during optimizer.compile() instead of waiting blind. Use when you want to watch optimization progress, see scores as they come in, know if your optimizer is working, check if optimization is stuck, understand why optimization is taking too long, get live progress during compile, monitor convergence, detect overfitting during optimization, interpret optimization results, or pick the right tool for watching optimization. Also used for optimizer progress bar, is my optimizer doing anything, optimization seems stuck, how long will optimization take, watch GEPA run, watch MIPROv2 run, live optimization dashboard, optimizer not improving, scores not going up, optimization taking forever, see what optimizer is doing, debug slow optimization, optimization visibility, optimizer metrics, track compile progress, optimization observability.

6SKILL.mdUpdated May 31, 2026

lebsral/ai-watching-optimization

lebsral/dspy-miprov2

testing

VerifiedTrustedCommunity

Use when you want the highest-quality prompt optimization DSPy offers — jointly optimizes instructions and few-shot demos, with auto=light/medium/heavy presets. Common scenarios - you want the best possible accuracy from prompt optimization, jointly tuning instructions and few-shot demonstrations, using auto presets for different compute budgets, or when COPRO or BootstrapFewShot alone are not reaching your accuracy target. Related - ai-improving-accuracy, dspy-copro, dspy-bootstrap-few-shot. Also used for dspy.MIPROv2, best DSPy optimizer, highest quality optimization, auto=light medium heavy, joint instruction and demo optimization, most powerful prompt optimizer, MIPROv2 vs COPRO vs BootstrapFewShot, which optimizer should I use, state of the art prompt optimization, when to use MIPROv2, optimize both instructions and examples, heavy optimization for production, best optimizer for accuracy.

6SKILL.mdUpdated Apr 27, 2026

lebsral/dspy-langwatch

testing

VerifiedTrustedCommunity

Use LangWatch for DSPy auto-tracing and real-time optimizer progress. Use when you want to set up LangWatch, langwatch.dspy.init, auto-tracing DSPy, real-time optimization dashboard, optimizer progress tracking, app.langwatch.ai, or DSPy optimizer dashboard. Also used for langwatch setup, pip install langwatch, langwatch trace, optimizer progress, real-time optimization, watch optimizer run, LangWatch self-hosted, langwatch docker, langwatch vs langtrace, langwatch autotrack_dspy.

6SKILL.mdUpdated Apr 27, 2026

lebsral/dspy-langwatch

lebsral/dspy-gepa

data-ai

VerifiedTrustedCommunity

Use when you want to optimize instructions without few-shot examples — a lightweight alternative to COPRO when you do not have or do not want to use demonstrations. Common scenarios - optimizing instructions when you do not have or do not want to use few-shot demonstrations, lightweight instruction search as a first step, tasks where examples in the prompt confuse the model, or when you want fast instruction optimization without the cost of COPRO. Related - ai-improving-accuracy, dspy-copro, dspy-miprov2. Also used for dspy.GEPA, instruction optimization without demos, lightweight prompt optimization, optimize instructions only, no few-shot examples needed, GEPA vs COPRO, quick instruction search, when demonstrations hurt performance, zero-shot optimization, instruction-only optimizer, simplest instruction tuner, fast prompt optimization, skip few-shot and just tune instructions, optimize Pydantic field descriptions, GEPA structured output, GEPA does not optimize field desc.

6SKILL.mdUpdated Apr 27, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/lebsral/dspy-programming-not-prompting-lms-skills.git

# Copy into Claude Code skills folder (global)
cp -r dspy-programming-not-prompting-lms-skills/skills/dspy-evaluate ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

lebsral/dspy-programming-not-prompting-lms-skills

5 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT