skills/dspy-refine/SKILL.md
Iterative self-improvement with dspy.Refine -- wraps any module, scores each attempt with a reward function, generates feedback on failures, and retries until a quality threshold is met. Use when you want outputs to improve through self-critique, need iterative revision of drafts, or want the LM to learn from its own mistakes within a single request. Also used for self-critique and revise, iterative improvement loop, generate then evaluate then fix, AI self-editing, multi-draft generation, revise until good enough, critique-driven refinement, when first draft is not good enough.
npx skillsauth add lebsral/dspy-programming-not-prompting-lms-skills dspy-refineInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Before writing code, clarify:
dspy.BestOfN instead.Use dspy.Refine when:
Do not use Refine when:
dspy.ChainOfThought insteaddspy.Predict calldspy.BestOfNThree things are needed: a module to wrap, a reward function, and a threshold.
import dspy
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini")) # or "anthropic/claude-sonnet-4-5-20250929", etc.
# 1. Define the module to refine
qa = dspy.ChainOfThought("question -> answer")
# 2. Define a reward function
# Takes (args_dict, prediction) -> float
def concise_answer(args, pred):
"""Reward one-word answers."""
return 1.0 if len(pred.answer.split()) == 1 else 0.0
# 3. Wrap with Refine
refined_qa = dspy.Refine(
module=qa,
N=3,
reward_fn=concise_answer,
threshold=1.0,
)
# Use it -- same interface as the wrapped module
result = refined_qa(question="What is the capital of Belgium?")
print(result.answer) # "Brussels"
dspy.Refine(
module, # The DSPy module to refine (required)
N, # Max number of attempts (required, int)
reward_fn, # Callable(args_dict, prediction) -> float (required)
threshold, # Target reward score to accept an output (required, float)
fail_count, # Max failures before raising an error (optional, defaults to N)
)
| Parameter | Type | Description |
|-----------|------|-------------|
| module | dspy.Module | The module whose outputs you want to refine |
| N | int | Maximum number of attempts. Each attempt uses temperature=1.0 with a different rollout ID |
| reward_fn | Callable | Scores a prediction. Receives (args, pred) where args is the input kwargs dict and pred is the module's output. Must return a float |
| threshold | float | Target score. Refine returns immediately when an attempt meets or exceeds this value |
| fail_count | int | Optional. Maximum allowed failures before raising an error. Defaults to N |
The reward function is the core of Refine. It receives two arguments:
args -- a dict of the inputs passed to the module (e.g., {"question": "What is..."})pred -- the module's prediction object (access fields like pred.answer, pred.reasoning)It must return a float. Higher is better.
def valid_json(args, pred):
"""Accept only valid JSON outputs."""
import json
try:
json.loads(pred.output)
return 1.0
except (json.JSONDecodeError, TypeError):
return 0.0
Return partial scores to help Refine pick the best attempt even when none fully succeed:
def quality_score(args, pred):
"""Score answer quality on multiple criteria."""
score = 0.0
answer = pred.answer
# Criterion 1: not empty
if answer.strip():
score += 0.3
# Criterion 2: reasonable length (20-200 words)
word_count = len(answer.split())
if 20 <= word_count <= 200:
score += 0.4
# Criterion 3: addresses the question
if args["question"].split()[0].lower() in answer.lower():
score += 0.3
return score
import re
def valid_email_extraction(args, pred):
"""Reward valid email addresses extracted from text."""
emails = pred.emails if isinstance(pred.emails, list) else []
if not emails:
return 0.0
email_pattern = r'^[\w.-]+@[\w.-]+\.\w+$'
valid_count = sum(1 for e in emails if re.match(email_pattern, e))
return valid_count / len(emails)
Each attempt runs the wrapped module at temperature=1.0 with a different rollout ID, producing diverse outputs. Refine's selection logic:
reward_fnthreshold, return immediatelyChoosing N:
| N value | Use case | Cost | |---------|----------|------| | 2-3 | Format validation, simple constraints | Low overhead | | 3-5 | Quality criteria, multi-factor scoring | Moderate | | 5-10 | High-stakes outputs, strict requirements | Higher cost, better results |
The sweet spot for most use cases is N=3 to N=5. Beyond 5, diminishing returns are common unless the reward function is very specific.
What makes Refine different from random retries is feedback generation. When an attempt fails to meet the threshold:
This means later attempts are informed by earlier failures. Attempt 3 knows what went wrong in attempts 1 and 2.
You do not write the feedback logic -- Refine handles it automatically based on your reward function's scores.
Both modules run a wrapped module multiple times and select the best output, but they work differently:
| Aspect | dspy.Refine | dspy.BestOfN |
|--------|--------------|----------------|
| Feedback | Generates feedback from failures, improving subsequent attempts | No feedback -- each attempt is independent |
| Attempts | Sequential (each informed by previous) | Independent (can be parallel) |
| Early stopping | Returns on first success meeting threshold | Also returns on first success meeting threshold |
| Best for | Iterative improvement, complex quality criteria | Sampling diversity, simple pass/fail |
| Cost pattern | Often fewer calls (feedback improves later attempts) | All attempts independent, no learning between them |
Use Refine when the LM can improve with feedback -- writing tasks, format compliance, multi-criteria quality.
Use BestOfN when attempts are independent and feedback would not help -- creative generation, sampling diverse options, simple binary checks.
Refine works with any dspy.Module, not just built-in ones:
class Summarizer(dspy.Module):
def __init__(self):
self.summarize = dspy.ChainOfThought("article -> summary")
def forward(self, article):
return self.summarize(article=article)
def good_summary(args, pred):
"""Score summary quality."""
summary = pred.summary
article = args["article"]
score = 0.0
# Shorter than original
if len(summary) < len(article) * 0.3:
score += 0.5
# At least 2 sentences
if summary.count('.') >= 2:
score += 0.5
return score
refined_summarizer = dspy.Refine(
module=Summarizer(),
N=3,
reward_fn=good_summary,
threshold=0.8,
)
result = refined_summarizer(article="Long article text here...")
print(result.summary)
dspy.Predict does not expose a reasoning field. Use dspy.ChainOfThought as the inner module so the feedback loop has reasoning to critique and improve.Install any skill:
npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill <name>
/dspy-best-of-n/dspy-chain-of-thought/ai-checking-outputs/ai-improving-accuracy/ai-writing-content/ai-do if you do not have it — it routes any AI problem to the right skill and is the fastest way to work: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-dotools
See what is happening during optimizer.compile() instead of waiting blind. Use when you want to watch optimization progress, see scores as they come in, know if your optimizer is working, check if optimization is stuck, understand why optimization is taking too long, get live progress during compile, monitor convergence, detect overfitting during optimization, interpret optimization results, or pick the right tool for watching optimization. Also used for optimizer progress bar, is my optimizer doing anything, optimization seems stuck, how long will optimization take, watch GEPA run, watch MIPROv2 run, live optimization dashboard, optimizer not improving, scores not going up, optimization taking forever, see what optimizer is doing, debug slow optimization, optimization visibility, optimizer metrics, track compile progress, optimization observability.
testing
Use when you want the highest-quality prompt optimization DSPy offers — jointly optimizes instructions and few-shot demos, with auto=light/medium/heavy presets. Common scenarios - you want the best possible accuracy from prompt optimization, jointly tuning instructions and few-shot demonstrations, using auto presets for different compute budgets, or when COPRO or BootstrapFewShot alone are not reaching your accuracy target. Related - ai-improving-accuracy, dspy-copro, dspy-bootstrap-few-shot. Also used for dspy.MIPROv2, best DSPy optimizer, highest quality optimization, auto=light medium heavy, joint instruction and demo optimization, most powerful prompt optimizer, MIPROv2 vs COPRO vs BootstrapFewShot, which optimizer should I use, state of the art prompt optimization, when to use MIPROv2, optimize both instructions and examples, heavy optimization for production, best optimizer for accuracy.
testing
Use LangWatch for DSPy auto-tracing and real-time optimizer progress. Use when you want to set up LangWatch, langwatch.dspy.init, auto-tracing DSPy, real-time optimization dashboard, optimizer progress tracking, app.langwatch.ai, or DSPy optimizer dashboard. Also used for langwatch setup, pip install langwatch, langwatch trace, optimizer progress, real-time optimization, watch optimizer run, LangWatch self-hosted, langwatch docker, langwatch vs langtrace, langwatch autotrack_dspy.
data-ai
Use when you want to optimize instructions without few-shot examples — a lightweight alternative to COPRO when you do not have or do not want to use demonstrations. Common scenarios - optimizing instructions when you do not have or do not want to use few-shot demonstrations, lightweight instruction search as a first step, tasks where examples in the prompt confuse the model, or when you want fast instruction optimization without the cost of COPRO. Related - ai-improving-accuracy, dspy-copro, dspy-miprov2. Also used for dspy.GEPA, instruction optimization without demos, lightweight prompt optimization, optimize instructions only, no few-shot examples needed, GEPA vs COPRO, quick instruction search, when demonstrations hurt performance, zero-shot optimization, instruction-only optimizer, simplest instruction tuner, fast prompt optimization, skip few-shot and just tune instructions, optimize Pydantic field descriptions, GEPA structured output, GEPA does not optimize field desc.