skills/dspy-gepa/SKILL.md
Use when you want to optimize instructions without few-shot examples — a lightweight alternative to COPRO when you do not have or do not want to use demonstrations. Common scenarios - optimizing instructions when you do not have or do not want to use few-shot demonstrations, lightweight instruction search as a first step, tasks where examples in the prompt confuse the model, or when you want fast instruction optimization without the cost of COPRO. Related - ai-improving-accuracy, dspy-copro, dspy-miprov2. Also used for dspy.GEPA, instruction optimization without demos, lightweight prompt optimization, optimize instructions only, no few-shot examples needed, GEPA vs COPRO, quick instruction search, when demonstrations hurt performance, zero-shot optimization, instruction-only optimizer, simplest instruction tuner, fast prompt optimization, skip few-shot and just tune instructions, optimize Pydantic field descriptions, GEPA structured output, GEPA does not optimize field desc.
npx skillsauth add lebsral/dspy-programming-not-prompting-lms-skills dspy-gepaInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Guide the user through using dspy.GEPA to automatically discover better instructions for their DSPy programs through reflective evolution.
dspy.GEPA is a DSPy optimizer that evolves the instruction text in your program's predictors. Rather than adding few-shot examples (like BootstrapFewShot) or tuning model weights (like BootstrapFinetune), GEPA iteratively proposes, evaluates, and refines the natural-language instructions that guide each LM call.
Benchmark results from the GEPA paper (arxiv 2507.19457) show strong performance:
Key properties:
Use dspy.GEPA when:
Do not use GEPA when:
dspy.BootstrapFewShot insteaddspy.MIPROv2dspy.BootstrapFinetunedspy.Predict or dspy.ChainOfThought directlyThree things are needed: a DSPy program, a feedback metric, and a training set.
import dspy
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini")) # or "anthropic/claude-sonnet-4-5-20250929", etc.
# 1. Define your program
classify = dspy.ChainOfThought("text -> label")
# 2. Define a feedback metric
# GEPA metrics can return a float OR a dict with score + feedback text
def metric(gold, pred, trace=None, pred_name=None, pred_trace=None):
score = float(pred.label == gold.label)
feedback = "" if score == 1.0 else f"Expected '{gold.label}', got '{pred.label}'."
return {"score": score, "feedback": feedback}
# 3. Prepare training data
trainset = [
dspy.Example(text="Great product!", label="positive").with_inputs("text"),
dspy.Example(text="Terrible service.", label="negative").with_inputs("text"),
# ... 20-100 examples
]
# 4. Optimize
gepa = dspy.GEPA(
metric=metric,
reflection_lm=dspy.LM("openai/gpt-4o", temperature=1.0, max_tokens=4096), # use a strong model for reflection
auto="light",
)
optimized = gepa.compile(classify, trainset=trainset)
# 5. Use the optimized program
result = optimized(text="This exceeded my expectations!")
print(result.label)
# 6. Save for later
optimized.save("optimized_classifier.json")
dspy.GEPA(
metric, # GEPAFeedbackMetric (required)
*,
auto=None, # "light", "medium", or "heavy"
max_full_evals=None, # int -- full validation passes allowed
max_metric_calls=None, # int -- total metric invocations allowed
reflection_lm=None, # LM for proposing new instructions
reflection_minibatch_size=3, # examples per reflection step
candidate_selection_strategy="pareto", # "pareto" or "current_best"
skip_perfect_score=True, # skip examples already scoring perfectly
add_format_failure_as_feedback=False, # include format errors in feedback
instruction_proposer=None, # custom proposal function
component_selector="round_robin", # which predictor to improve next
use_merge=True, # merge successful variants
max_merge_invocations=5, # merge attempt limit
num_threads=None, # parallel evaluation threads
failure_score=0.0, # score for failed examples
perfect_score=1.0, # score that counts as perfect
log_dir=None, # directory for optimization logs
track_stats=False, # return detailed metadata
track_best_outputs=False, # retain best outputs per task
seed=0, # reproducibility seed
)
| Parameter | Default | Purpose |
|-----------|---------|---------|
| metric | required | Feedback function -- returns float or {"score": float, "feedback": str} |
| auto | None | Budget preset: "light" (fast), "medium" (balanced), "heavy" (thorough) |
| reflection_lm | None | LM that proposes new instructions. Use a strong model (e.g., GPT-4o, Claude Sonnet). Required unless you provide a custom instruction_proposer |
| reflection_minibatch_size | 3 | How many examples the reflection LM sees per iteration. Larger = better proposals but more cost |
| candidate_selection_strategy | "pareto" | "pareto" maintains diverse candidates; "current_best" always mutates the top scorer |
| use_merge | True | After evolving candidates, merge the best modules from different lineages |
| max_merge_invocations | 5 | Cap on merge attempts to control cost |
| skip_perfect_score | True | Do not waste budget on examples already scoring perfect_score |
| track_stats | False | When True, attach optimization metadata to optimized.detailed_results |
Exactly one of these three must be set:
auto -- preset budget ("light", "medium", "heavy")max_full_evals -- number of full passes over the validation setmax_metric_calls -- total number of metric invocationsStart with auto="light" for quick experiments, then move to "medium" or "heavy" for production.
GEPA metrics are more expressive than standard DSPy metrics. They accept additional keyword arguments for trace-level feedback:
def metric(gold, pred, trace=None, pred_name=None, pred_trace=None):
"""
Args:
gold: The expected Example (ground truth)
pred: The model's Prediction
trace: Full program execution trace (optional)
pred_name: Name of the predictor being optimized (optional)
pred_trace: Sub-trace for just this predictor (optional)
Returns:
float -- simple score
OR dict -- {"score": float, "feedback": str}
"""
The key advantage of GEPA over other optimizers is that metrics can explain why an output failed. The reflection LM reads this feedback to propose better instructions.
def metric(gold, pred, trace=None, pred_name=None, pred_trace=None):
if pred.answer == gold.answer:
return {"score": 1.0, "feedback": ""}
feedback_parts = []
if len(pred.answer) > 200:
feedback_parts.append("Answer is too verbose. Keep it under 200 characters.")
if gold.answer.lower() not in pred.answer.lower():
feedback_parts.append(f"Answer should contain '{gold.answer}'.")
return {
"score": 0.0,
"feedback": " ".join(feedback_parts),
}
Write feedback that is actionable -- describe what the instruction should encourage or discourage. Vague feedback like "wrong answer" does not help the reflection LM.
Good feedback tells the reflection LM which quality dimension failed and what the correct behavior looks like. Compare:
"Wrong answer" -- the reflection LM has no direction"Score: 0" -- no feedback at all"Faithfulness: summary claims 'revenue doubled' but the article says 'revenue grew 15%'. The instruction should emphasize only stating facts from the source text.""Format: output used bullet points instead of prose. The instruction should specify narrative paragraph format."When your examples contain metadata beyond the core input (e.g., expected categories, known edge cases, trap fields), use that metadata in the metric to give structural feedback. For example, if an example has example.edge_case = "sarcasm", the metric can say "This review uses sarcasm -- the instruction should warn about positive words with negative intent." This gives the reflection LM a pattern to fix, not just a score to chase.
The GEPA paper (arxiv 2507.19457) shows that task-specific feedback yields 20-30% better evolved prompts than generic feedback. Maximize your feedback quality:
"Faithfulness failed: the summary invented a statistic. The instruction should require citing source sentences.""This is a sarcasm edge case -- the instruction should warn about positive words with negative intent."The GEPA paper provides guidance on scaling parameters to model size and budget:
| Configuration | Population | Generations | Validation examples | |---------------|-----------|-------------|---------------------| | Small models (7B-13B) | 8-12 | 15-25 | 10-15 | | Large models (70B+) | 5-8 | 12-18 | 5-10 | | Budget-constrained (<$10) | 3-5 | 8-10 | Use aggressive early stopping |
Smaller models benefit from larger populations (more diversity to explore) and more generations (more refinement steps). Larger models converge faster and need fewer candidates.
Understanding the algorithm helps you write better metrics and choose parameters:
reflection_minibatch_size examples from trainsetreflection_lm reads traces + feedback and proposes a revised instructionThe Pareto frontier is the key innovation: rather than keeping only the single best candidate, GEPA maintains candidates that excel on different subsets. This prevents the optimizer from overfitting to one failure pattern while ignoring others.
| Aspect | dspy.GEPA | dspy.MIPROv2 |
|--------|-------------|----------------|
| What it tunes | Instructions only | Instructions + few-shot demos |
| Data needed | 20-100 examples | ~200 examples |
| Prompt size | Compact (no demos) | Larger (includes demos) |
| Feedback | Uses textual feedback from metrics | Uses scalar scores only |
| Multi-step | Per-predictor feedback and optimization | Optimizes all predictors jointly |
| Typical improvement | 10-25% (paper reports 10+ points over MIPROv2 on six tasks) | 15-35% |
| Best for | Instruction tuning, compact prompts, feedback-driven optimization | Demo-heavy tasks, larger budgets |
| Cost | Lower (fewer metric calls) | Higher (explores more candidates) |
Paper context: The GEPA paper (arxiv 2507.19457) reports GEPA outperforming MIPROv2 by 10+ percentage points across six benchmark tasks. However, MIPROv2 also tunes few-shot demonstrations, which GEPA does not -- for tasks where in-context examples are critical, MIPROv2 may still be the better choice.
Rule of thumb: Start with GEPA when you have fewer than 200 examples, want compact prompts, or can provide rich textual feedback in your metric. Move to MIPROv2 if you need few-shot demos in the prompt or have 200+ examples.
If you have a separate validation set, pass it to compile:
optimized = gepa.compile(
classify,
trainset=trainset,
valset=valset,
)
Without a valset, GEPA uses the trainset for both training and validation. This can lead to overfitting but is useful for test-time search (optimizing for a specific batch of inputs).
GEPA can be used at inference time to find the best instructions for a specific batch of tasks:
gepa = dspy.GEPA(
metric=metric,
reflection_lm=dspy.LM("openai/gpt-4o", temperature=1.0, max_tokens=4096), # use a strong model for reflection
auto="light",
track_stats=True,
track_best_outputs=True,
)
# Pass the same data as both trainset and valset
result = gepa.compile(program, trainset=tasks, valset=tasks)
# Access the best output for each task
best_per_task = result.detailed_results.best_outputs_valset
GEPA only tunes the instruction string (the Signature docstring). Everything else in your prompt is fixed during optimization:
| Prompt element | Optimized by GEPA? | Where it lives |
|----------------|-------------------|----------------|
| Signature docstring | Yes | """Classify the text.""" |
| InputField(desc=...) | No | dspy.InputField(desc="...") |
| OutputField(desc=...) | No | dspy.OutputField(desc="...") |
| Pydantic Field(description=...) | No | pydantic.Field(description="...") |
| Field names | No | label: str = dspy.OutputField() |
| Type constraints | No | Literal["a", "b"], Pydantic models |
| Few-shot demos | No (by design) | Added by other optimizers |
This matters most for structured output tasks where Pydantic field descriptions carry significant guidance for the LM. If your output schema has Field(description="Invoice date in YYYY-MM-DD format"), GEPA will never touch that description -- even if it's the source of failures.
To bring field descriptions into GEPA's optimization surface, serialize them into the instruction before optimizing, then extract back out:
import dspy
import json
from pydantic import BaseModel, Field
# 1. Your original Pydantic model
class Invoice(BaseModel):
vendor: str = Field(description="Company name of the vendor")
date: str = Field(description="Invoice date in YYYY-MM-DD format")
total: float = Field(description="Total amount due")
# 2. Serialize field descriptions into the instruction
field_guidance = "\n".join(
f"- {name}: {info.description}"
for name, info in Invoice.model_fields.items()
if info.description
)
class ParseInvoice(dspy.Signature):
# GEPA will optimize this entire docstring, including the field guidance
f"""Extract invoice data from raw text.
Output field guidelines:
{field_guidance}"""
text: str = dspy.InputField()
invoice: Invoice = dspy.OutputField()
# 3. Optimize -- GEPA now sees and can rewrite the field guidance
gepa = dspy.GEPA(metric=metric, reflection_lm=reflection_lm, auto="medium")
optimized = gepa.compile(ParseInvoice, trainset=trainset)
# 4. After optimization, inspect the optimized instruction
# to see how GEPA refined the field guidance
dspy.inspect_history(n=1)
Limitations of this workaround:
description fields remain unchanged.When this is worth doing:
desc strings are doing heavy lifting (e.g., date formats, enum explanations, nested object guidance){"score": 0.0, "feedback": "Expected positive but got negative; the review is sarcastic"}, the reflection LM uses that feedback to propose better instructions. Without feedback, GEPA degrades to blind search. Always return a dict with both score and feedback.gpt-4o-mini or a small local model for reflection produces generic, unhelpful instruction changes. Use a strong model (GPT-4o, Claude Sonnet) for reflection_lm -- the task LM can be cheaper.auto="heavy" before validating the metric. A broken or noisy metric wastes the entire optimization budget. Start with auto="light" to verify the metric produces meaningful scores and feedback, then scale up to "medium" or "heavy" for production runs.dspy.Evaluate before and after GEPA. Without a baseline measurement, there is no way to know if GEPA actually improved anything. Always evaluate the unoptimized program first, then compare against the optimized version.InputField(desc=...), OutputField(desc=...), and Pydantic Field(description=...) are never modified. If field descriptions are causing failures, flatten them into the instruction before optimizing (see the workaround in this skill).If your optimized program scores the same as the baseline, GEPA is working correctly -- it is just not finding anything to fix.
GEPA improves instructions by reflecting on failures. If every minibatch is all-correct, the reflection LM never fires and the instructions stay unchanged. This means the task is saturated for the current task LM -- the model already solves it without better instructions.
Signs of saturation:
track_stats=True shows zero reflection callsSmaller models paired with GEPA can match larger models at zero cost. Free-tier models on OpenRouter (e.g., small Qwen or Llama variants) work as task LMs while a strong model handles reflection. This lets you run optimization loops with no API spend on the task LM side. Set seed=0 for reproducibility.
# Weaker task LM + strong reflection LM = maximum GEPA signal
task_lm = dspy.LM("openrouter/qwen/qwen3-1.7b:free", seed=0)
reflection_lm = dspy.LM("openai/gpt-4o", temperature=1.0, max_tokens=4096)
dspy.configure(lm=task_lm)
gepa = dspy.GEPA(metric=metric, reflection_lm=reflection_lm, auto="medium")
optimized = gepa.compile(program, trainset=trainset)
Install any skill:
npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill <name>
/ai-watching-optimization/ai-improving-accuracy/dspy-miprov2/dspy-chain-of-thought/dspy-evaluate/dspy-refine/dspy-signatures for field descriptions, typed outputs, and gotchas about what optimizers can/cannot tune/dspy-vizpy/ai-do if you do not have it — it routes any AI problem to the right skill and is the fastest way to work: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-dotools
See what is happening during optimizer.compile() instead of waiting blind. Use when you want to watch optimization progress, see scores as they come in, know if your optimizer is working, check if optimization is stuck, understand why optimization is taking too long, get live progress during compile, monitor convergence, detect overfitting during optimization, interpret optimization results, or pick the right tool for watching optimization. Also used for optimizer progress bar, is my optimizer doing anything, optimization seems stuck, how long will optimization take, watch GEPA run, watch MIPROv2 run, live optimization dashboard, optimizer not improving, scores not going up, optimization taking forever, see what optimizer is doing, debug slow optimization, optimization visibility, optimizer metrics, track compile progress, optimization observability.
testing
Use when you want the highest-quality prompt optimization DSPy offers — jointly optimizes instructions and few-shot demos, with auto=light/medium/heavy presets. Common scenarios - you want the best possible accuracy from prompt optimization, jointly tuning instructions and few-shot demonstrations, using auto presets for different compute budgets, or when COPRO or BootstrapFewShot alone are not reaching your accuracy target. Related - ai-improving-accuracy, dspy-copro, dspy-bootstrap-few-shot. Also used for dspy.MIPROv2, best DSPy optimizer, highest quality optimization, auto=light medium heavy, joint instruction and demo optimization, most powerful prompt optimizer, MIPROv2 vs COPRO vs BootstrapFewShot, which optimizer should I use, state of the art prompt optimization, when to use MIPROv2, optimize both instructions and examples, heavy optimization for production, best optimizer for accuracy.
testing
Use LangWatch for DSPy auto-tracing and real-time optimizer progress. Use when you want to set up LangWatch, langwatch.dspy.init, auto-tracing DSPy, real-time optimization dashboard, optimizer progress tracking, app.langwatch.ai, or DSPy optimizer dashboard. Also used for langwatch setup, pip install langwatch, langwatch trace, optimizer progress, real-time optimization, watch optimizer run, LangWatch self-hosted, langwatch docker, langwatch vs langtrace, langwatch autotrack_dspy.
development
Measure and improve how well your AI works. Use when AI gives wrong answers, accuracy is bad, responses are unreliable, you need to test AI quality, evaluate your AI, write metrics, benchmark performance, optimize prompts, improve results, or systematically make your AI better. Also used for spent hours tweaking prompts, trial and error prompt engineering is not working, quality plateaued early, stale prompts everywhere in your codebase, my AI is only 60% accurate, how to measure AI quality, AI evaluation framework, benchmark my LLM, prompt optimization not working, systematic way to improve AI, AI accuracy plateaued, DSPy optimizer tutorial, MIPROv2 optimization, how to go from 70% to 90% accuracy.