Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

lebsral/dspy-refine

Name: dspy-refine
Author: lebsral

skills/dspy-refine/SKILL.md

npx skillsauth add lebsral/dspy-programming-not-prompting-lms-skills dspy-refine

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Iterative Self-Improvement with dspy.Refine

Step 1: Understand the use case

Before writing code, clarify:

What module are you wrapping? (ChainOfThought, Predict, ReAct, custom module?)
What makes a good output? Can you express quality as a numeric score?
How many retries are acceptable? (cost/latency budget)
Is feedback useful? Would knowing why attempt 1 failed help attempt 2? If not, consider dspy.BestOfN instead.

When to use Refine (and when not to)

Use dspy.Refine when:

Outputs must meet measurable quality criteria (format, length, accuracy)
The LM can improve with feedback -- writing, format compliance, multi-criteria quality
Quality is worth 2-5x cost for N attempts

Do not use Refine when:

You have no clear way to score outputs -- use dspy.ChainOfThought instead
You need human-in-the-loop feedback -- build a custom module with a human review step
Speed matters more than quality -- use a single dspy.Predict call
Attempts are independent and feedback would not help -- use dspy.BestOfN

Basic usage

Three things are needed: a module to wrap, a reward function, and a threshold.

import dspy

dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))  # or "anthropic/claude-sonnet-4-5-20250929", etc.

# 1. Define the module to refine
qa = dspy.ChainOfThought("question -> answer")

# 2. Define a reward function
# Takes (args_dict, prediction) -> float
def concise_answer(args, pred):
    """Reward one-word answers."""
    return 1.0 if len(pred.answer.split()) == 1 else 0.0

# 3. Wrap with Refine
refined_qa = dspy.Refine(
    module=qa,
    N=3,
    reward_fn=concise_answer,
    threshold=1.0,
)

# Use it -- same interface as the wrapped module
result = refined_qa(question="What is the capital of Belgium?")
print(result.answer)  # "Brussels"

Constructor parameters

dspy.Refine(
    module,       # The DSPy module to refine (required)
    N,            # Max number of attempts (required, int)
    reward_fn,    # Callable(args_dict, prediction) -> float (required)
    threshold,    # Target reward score to accept an output (required, float)
    fail_count,   # Max failures before raising an error (optional, defaults to N)
)

| Parameter | Type | Description | |-----------|------|-------------| | module | dspy.Module | The module whose outputs you want to refine | | N | int | Maximum number of attempts. Each attempt uses temperature=1.0 with a different rollout ID | | reward_fn | Callable | Scores a prediction. Receives (args, pred) where args is the input kwargs dict and pred is the module's output. Must return a float | | threshold | float | Target score. Refine returns immediately when an attempt meets or exceeds this value | | fail_count | int | Optional. Maximum allowed failures before raising an error. Defaults to N |

Writing reward functions

The reward function is the core of Refine. It receives two arguments:

args -- a dict of the inputs passed to the module (e.g., {"question": "What is..."})
pred -- the module's prediction object (access fields like pred.answer, pred.reasoning)

It must return a float. Higher is better.

Simple binary reward

def valid_json(args, pred):
    """Accept only valid JSON outputs."""
    import json
    try:
        json.loads(pred.output)
        return 1.0
    except (json.JSONDecodeError, TypeError):
        return 0.0

Graduated reward

Return partial scores to help Refine pick the best attempt even when none fully succeed:

def quality_score(args, pred):
    """Score answer quality on multiple criteria."""
    score = 0.0
    answer = pred.answer

    # Criterion 1: not empty
    if answer.strip():
        score += 0.3

    # Criterion 2: reasonable length (20-200 words)
    word_count = len(answer.split())
    if 20 <= word_count <= 200:
        score += 0.4

    # Criterion 3: addresses the question
    if args["question"].split()[0].lower() in answer.lower():
        score += 0.3

    return score

Using external validation

import re

def valid_email_extraction(args, pred):
    """Reward valid email addresses extracted from text."""
    emails = pred.emails if isinstance(pred.emails, list) else []
    if not emails:
        return 0.0
    email_pattern = r'^[\w.-]+@[\w.-]+\.\w+$'
    valid_count = sum(1 for e in emails if re.match(email_pattern, e))
    return valid_count / len(emails)

How iteration count (N) works

Each attempt runs the wrapped module at temperature=1.0 with a different rollout ID, producing diverse outputs. Refine's selection logic:

Run the module and score the output with reward_fn
If the score meets or exceeds threshold, return immediately
If not, generate feedback from the failure and try again
After N attempts, return the attempt with the highest reward score

Choosing N:

| N value | Use case | Cost | |---------|----------|------| | 2-3 | Format validation, simple constraints | Low overhead | | 3-5 | Quality criteria, multi-factor scoring | Moderate | | 5-10 | High-stakes outputs, strict requirements | Higher cost, better results |

The sweet spot for most use cases is N=3 to N=5. Beyond 5, diminishing returns are common unless the reward function is very specific.

The feedback mechanism

What makes Refine different from random retries is feedback generation. When an attempt fails to meet the threshold:

Refine examines why the attempt scored below the threshold
It generates natural-language feedback describing the shortcoming
This feedback is included in the prompt for the next attempt
The LM uses this feedback to produce a better output

This means later attempts are informed by earlier failures. Attempt 3 knows what went wrong in attempts 1 and 2.

You do not write the feedback logic -- Refine handles it automatically based on your reward function's scores.

Refine vs BestOfN -- when to use which

Both modules run a wrapped module multiple times and select the best output, but they work differently:

| Aspect | dspy.Refine | dspy.BestOfN | |--------|--------------|----------------| | Feedback | Generates feedback from failures, improving subsequent attempts | No feedback -- each attempt is independent | | Attempts | Sequential (each informed by previous) | Independent (can be parallel) | | Early stopping | Returns on first success meeting threshold | Also returns on first success meeting threshold | | Best for | Iterative improvement, complex quality criteria | Sampling diversity, simple pass/fail | | Cost pattern | Often fewer calls (feedback improves later attempts) | All attempts independent, no learning between them |

Use Refine when the LM can improve with feedback -- writing tasks, format compliance, multi-criteria quality.

Use BestOfN when attempts are independent and feedback would not help -- creative generation, sampling diverse options, simple binary checks.

Wrapping custom modules

Refine works with any dspy.Module, not just built-in ones:

class Summarizer(dspy.Module):
    def __init__(self):
        self.summarize = dspy.ChainOfThought("article -> summary")

    def forward(self, article):
        return self.summarize(article=article)


def good_summary(args, pred):
    """Score summary quality."""
    summary = pred.summary
    article = args["article"]
    score = 0.0

    # Shorter than original
    if len(summary) < len(article) * 0.3:
        score += 0.5

    # At least 2 sentences
    if summary.count('.') >= 2:
        score += 0.5

    return score


refined_summarizer = dspy.Refine(
    module=Summarizer(),
    N=3,
    reward_fn=good_summary,
    threshold=0.8,
)

result = refined_summarizer(article="Long article text here...")
print(result.summary)

Gotchas

Claude uses binary reward functions (0 or 1) instead of graduated scores. Binary rewards mean Refine cannot distinguish between a near-miss and a total failure when no attempt hits the threshold. Use graduated floats (0.0 to 1.0) with partial credit for each criterion so Refine returns the best near-miss.
Claude puts LM calls inside reward functions for "smarter" scoring. The reward function runs on every attempt (up to N times), so an LM call inside it doubles or triples your cost. Use deterministic checks (regex, AST parsing, length checks) in reward functions. Reserve LM-as-judge for the final evaluation after Refine returns.
Claude sets threshold=1.0 with multi-criteria reward functions. If your reward function scores 4 weighted criteria, achieving a perfect 1.0 is unlikely. Refine then burns through all N attempts and returns a near-miss anyway. Set the threshold to 0.8 or the realistic "good enough" score for your criteria.
Claude confuses Refine with BestOfN and uses them interchangeably. The key difference is feedback: Refine generates feedback from failures and feeds it into subsequent attempts (sequential, improving). BestOfN runs independent attempts with no cross-attempt learning. Use Refine when later attempts can learn from earlier failures; use BestOfN when attempts are inherently independent.
Claude wraps a Predict module with Refine for tasks that need reasoning. Refine improves outputs through feedback, but dspy.Predict does not expose a reasoning field. Use dspy.ChainOfThought as the inner module so the feedback loop has reasoning to critique and improve.

Cross-references

Install any skill: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill <name>

BestOfN for independent sampling without feedback -- see /dspy-best-of-n
Chain of thought reasoning as the inner module -- see /dspy-chain-of-thought
Checking and validating outputs with assertions -- see /ai-checking-outputs
Improving accuracy with optimization -- see /ai-improving-accuracy
Writing content that benefits from iterative refinement -- see /ai-writing-content
Install /ai-do if you do not have it — it routes any AI problem to the right skill and is the fastest way to work: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-do

Additional resources

dspy.Refine API docs
For API details, see reference.md
For worked examples, see examples.md

lebsral/dspy-refine

skills/dspy-refine/SKILL.md

Iterative self-improvement with dspy.Refine -- wraps any module, scores each attempt with a reward function, generates feedback on failures, and retries until a quality threshold is met. Use when you want outputs to improve through self-critique, need iterative revision of drafts, or want the LM to learn from its own mistakes within a single request. Also used for self-critique and revise, iterative improvement loop, generate then evaluate then fix, AI self-editing, multi-draft generation, revise until good enough, critique-driven refinement, when first draft is not good enough.

5 stars

development

Updated May 7, 2026

$ install --global

skillsauth

npx skillsauth add lebsral/dspy-programming-not-prompting-lms-skills dspy-refine

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 7, 2026, 7:02 AM157.9s4 files scanned

SKILL.md

name:: dspy-refine
description:: Iterative self-improvement with dspy.Refine -- wraps any module, scores each attempt with a reward function, generates feedback on failures, and retries until a quality threshold is met. Use when you want outputs to improve through self-critique, need iterative revision of drafts, or want the LM to learn from its own mistakes within a single request. Also used for self-critique and revise, iterative improvement loop, generate then evaluate then fix, AI self-editing, multi-draft generation, revise until good enough, critique-driven refinement, when first draft is not good enough.

Iterative Self-Improvement with dspy.Refine

Step 1: Understand the use case

Before writing code, clarify:

What module are you wrapping? (ChainOfThought, Predict, ReAct, custom module?)
What makes a good output? Can you express quality as a numeric score?
How many retries are acceptable? (cost/latency budget)
Is feedback useful? Would knowing why attempt 1 failed help attempt 2? If not, consider dspy.BestOfN instead.

When to use Refine (and when not to)

Use dspy.Refine when:

Outputs must meet measurable quality criteria (format, length, accuracy)
The LM can improve with feedback -- writing, format compliance, multi-criteria quality
Quality is worth 2-5x cost for N attempts

Do not use Refine when:

You have no clear way to score outputs -- use dspy.ChainOfThought instead
You need human-in-the-loop feedback -- build a custom module with a human review step
Speed matters more than quality -- use a single dspy.Predict call
Attempts are independent and feedback would not help -- use dspy.BestOfN

Basic usage

Three things are needed: a module to wrap, a reward function, and a threshold.

import dspy

dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))  # or "anthropic/claude-sonnet-4-5-20250929", etc.

# 1. Define the module to refine
qa = dspy.ChainOfThought("question -> answer")

# 2. Define a reward function
# Takes (args_dict, prediction) -> float
def concise_answer(args, pred):
    """Reward one-word answers."""
    return 1.0 if len(pred.answer.split()) == 1 else 0.0

# 3. Wrap with Refine
refined_qa = dspy.Refine(
    module=qa,
    N=3,
    reward_fn=concise_answer,
    threshold=1.0,
)

# Use it -- same interface as the wrapped module
result = refined_qa(question="What is the capital of Belgium?")
print(result.answer)  # "Brussels"

Constructor parameters

dspy.Refine(
    module,       # The DSPy module to refine (required)
    N,            # Max number of attempts (required, int)
    reward_fn,    # Callable(args_dict, prediction) -> float (required)
    threshold,    # Target reward score to accept an output (required, float)
    fail_count,   # Max failures before raising an error (optional, defaults to N)
)

Writing reward functions

The reward function is the core of Refine. It receives two arguments:

args -- a dict of the inputs passed to the module (e.g., {"question": "What is..."})
pred -- the module's prediction object (access fields like pred.answer, pred.reasoning)

It must return a float. Higher is better.

Simple binary reward

def valid_json(args, pred):
    """Accept only valid JSON outputs."""
    import json
    try:
        json.loads(pred.output)
        return 1.0
    except (json.JSONDecodeError, TypeError):
        return 0.0

Graduated reward

Return partial scores to help Refine pick the best attempt even when none fully succeed:

def quality_score(args, pred):
    """Score answer quality on multiple criteria."""
    score = 0.0
    answer = pred.answer

    # Criterion 1: not empty
    if answer.strip():
        score += 0.3

    # Criterion 2: reasonable length (20-200 words)
    word_count = len(answer.split())
    if 20 <= word_count <= 200:
        score += 0.4

    # Criterion 3: addresses the question
    if args["question"].split()[0].lower() in answer.lower():
        score += 0.3

    return score

Using external validation

import re

def valid_email_extraction(args, pred):
    """Reward valid email addresses extracted from text."""
    emails = pred.emails if isinstance(pred.emails, list) else []
    if not emails:
        return 0.0
    email_pattern = r'^[\w.-]+@[\w.-]+\.\w+$'
    valid_count = sum(1 for e in emails if re.match(email_pattern, e))
    return valid_count / len(emails)

How iteration count (N) works

Each attempt runs the wrapped module at temperature=1.0 with a different rollout ID, producing diverse outputs. Refine's selection logic:

Run the module and score the output with reward_fn
If the score meets or exceeds threshold, return immediately
If not, generate feedback from the failure and try again
After N attempts, return the attempt with the highest reward score

Choosing N:

The sweet spot for most use cases is N=3 to N=5. Beyond 5, diminishing returns are common unless the reward function is very specific.

The feedback mechanism

What makes Refine different from random retries is feedback generation. When an attempt fails to meet the threshold:

Refine examines why the attempt scored below the threshold
It generates natural-language feedback describing the shortcoming
This feedback is included in the prompt for the next attempt
The LM uses this feedback to produce a better output

This means later attempts are informed by earlier failures. Attempt 3 knows what went wrong in attempts 1 and 2.

You do not write the feedback logic -- Refine handles it automatically based on your reward function's scores.

Refine vs BestOfN -- when to use which

Both modules run a wrapped module multiple times and select the best output, but they work differently:

Use Refine when the LM can improve with feedback -- writing tasks, format compliance, multi-criteria quality.

Use BestOfN when attempts are independent and feedback would not help -- creative generation, sampling diverse options, simple binary checks.

Wrapping custom modules

Refine works with any dspy.Module, not just built-in ones:

class Summarizer(dspy.Module):
    def __init__(self):
        self.summarize = dspy.ChainOfThought("article -> summary")

    def forward(self, article):
        return self.summarize(article=article)


def good_summary(args, pred):
    """Score summary quality."""
    summary = pred.summary
    article = args["article"]
    score = 0.0

    # Shorter than original
    if len(summary) < len(article) * 0.3:
        score += 0.5

    # At least 2 sentences
    if summary.count('.') >= 2:
        score += 0.5

    return score


refined_summarizer = dspy.Refine(
    module=Summarizer(),
    N=3,
    reward_fn=good_summary,
    threshold=0.8,
)

result = refined_summarizer(article="Long article text here...")
print(result.summary)

Gotchas

Claude uses binary reward functions (0 or 1) instead of graduated scores. Binary rewards mean Refine cannot distinguish between a near-miss and a total failure when no attempt hits the threshold. Use graduated floats (0.0 to 1.0) with partial credit for each criterion so Refine returns the best near-miss.
Claude puts LM calls inside reward functions for "smarter" scoring. The reward function runs on every attempt (up to N times), so an LM call inside it doubles or triples your cost. Use deterministic checks (regex, AST parsing, length checks) in reward functions. Reserve LM-as-judge for the final evaluation after Refine returns.
Claude sets threshold=1.0 with multi-criteria reward functions. If your reward function scores 4 weighted criteria, achieving a perfect 1.0 is unlikely. Refine then burns through all N attempts and returns a near-miss anyway. Set the threshold to 0.8 or the realistic "good enough" score for your criteria.
Claude confuses Refine with BestOfN and uses them interchangeably. The key difference is feedback: Refine generates feedback from failures and feeds it into subsequent attempts (sequential, improving). BestOfN runs independent attempts with no cross-attempt learning. Use Refine when later attempts can learn from earlier failures; use BestOfN when attempts are inherently independent.
Claude wraps a Predict module with Refine for tasks that need reasoning. Refine improves outputs through feedback, but dspy.Predict does not expose a reasoning field. Use dspy.ChainOfThought as the inner module so the feedback loop has reasoning to critique and improve.

Cross-references

Install any skill: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill <name>

BestOfN for independent sampling without feedback -- see /dspy-best-of-n
Chain of thought reasoning as the inner module -- see /dspy-chain-of-thought
Checking and validating outputs with assertions -- see /ai-checking-outputs
Improving accuracy with optimization -- see /ai-improving-accuracy
Writing content that benefits from iterative refinement -- see /ai-writing-content
Install /ai-do if you do not have it — it routes any AI problem to the right skill and is the fastest way to work: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-do

Additional resources

dspy.Refine API docs
For API details, see reference.md
For worked examples, see examples.md

Related Skills

lebsral/ai-watching-optimization

tools

VerifiedTrustedCommunity

See what is happening during optimizer.compile() instead of waiting blind. Use when you want to watch optimization progress, see scores as they come in, know if your optimizer is working, check if optimization is stuck, understand why optimization is taking too long, get live progress during compile, monitor convergence, detect overfitting during optimization, interpret optimization results, or pick the right tool for watching optimization. Also used for optimizer progress bar, is my optimizer doing anything, optimization seems stuck, how long will optimization take, watch GEPA run, watch MIPROv2 run, live optimization dashboard, optimizer not improving, scores not going up, optimization taking forever, see what optimizer is doing, debug slow optimization, optimization visibility, optimizer metrics, track compile progress, optimization observability.

6SKILL.mdUpdated May 31, 2026

lebsral/ai-watching-optimization

lebsral/dspy-miprov2

testing

VerifiedTrustedCommunity

Use when you want the highest-quality prompt optimization DSPy offers — jointly optimizes instructions and few-shot demos, with auto=light/medium/heavy presets. Common scenarios - you want the best possible accuracy from prompt optimization, jointly tuning instructions and few-shot demonstrations, using auto presets for different compute budgets, or when COPRO or BootstrapFewShot alone are not reaching your accuracy target. Related - ai-improving-accuracy, dspy-copro, dspy-bootstrap-few-shot. Also used for dspy.MIPROv2, best DSPy optimizer, highest quality optimization, auto=light medium heavy, joint instruction and demo optimization, most powerful prompt optimizer, MIPROv2 vs COPRO vs BootstrapFewShot, which optimizer should I use, state of the art prompt optimization, when to use MIPROv2, optimize both instructions and examples, heavy optimization for production, best optimizer for accuracy.

6SKILL.mdUpdated Apr 27, 2026

lebsral/dspy-langwatch

testing

VerifiedTrustedCommunity

Use LangWatch for DSPy auto-tracing and real-time optimizer progress. Use when you want to set up LangWatch, langwatch.dspy.init, auto-tracing DSPy, real-time optimization dashboard, optimizer progress tracking, app.langwatch.ai, or DSPy optimizer dashboard. Also used for langwatch setup, pip install langwatch, langwatch trace, optimizer progress, real-time optimization, watch optimizer run, LangWatch self-hosted, langwatch docker, langwatch vs langtrace, langwatch autotrack_dspy.

6SKILL.mdUpdated Apr 27, 2026

lebsral/dspy-langwatch

lebsral/dspy-gepa

data-ai

VerifiedTrustedCommunity

Use when you want to optimize instructions without few-shot examples — a lightweight alternative to COPRO when you do not have or do not want to use demonstrations. Common scenarios - optimizing instructions when you do not have or do not want to use few-shot demonstrations, lightweight instruction search as a first step, tasks where examples in the prompt confuse the model, or when you want fast instruction optimization without the cost of COPRO. Related - ai-improving-accuracy, dspy-copro, dspy-miprov2. Also used for dspy.GEPA, instruction optimization without demos, lightweight prompt optimization, optimize instructions only, no few-shot examples needed, GEPA vs COPRO, quick instruction search, when demonstrations hurt performance, zero-shot optimization, instruction-only optimizer, simplest instruction tuner, fast prompt optimization, skip few-shot and just tune instructions, optimize Pydantic field descriptions, GEPA structured output, GEPA does not optimize field desc.

6SKILL.mdUpdated Apr 27, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/lebsral/dspy-programming-not-prompting-lms-skills.git

# Copy into Claude Code skills folder (global)
cp -r dspy-programming-not-prompting-lms-skills/skills/dspy-refine ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

lebsral/dspy-programming-not-prompting-lms-skills

5 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT