skills/dspy-ensemble/SKILL.md
Use when you have multiple optimized versions of a program and want to combine them — voting, averaging, or routing across program variants for more robust outputs. Common scenarios - you have optimized several versions of a program and want to combine the best ones, using majority voting across multiple programs for higher accuracy, building a robust system by routing to different specialized programs, or reducing variance by averaging outputs. Related - ai-improving-accuracy, ai-making-consistent, dspy-bootstrap-rs. Also used for dspy.Ensemble, combine multiple optimized programs, majority voting across models, ensemble of DSPy programs, voting for reliability, reduce variance with multiple programs, aggregate predictions, combine outputs from different optimizers, when one program is not reliable enough, model committee, ensemble for production robustness, multiple programs one answer.
npx skillsauth add lebsral/dspy-programming-not-prompting-lms-skills dspy-ensembleInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Guide the user through using DSPy's Ensemble optimizer to combine multiple optimized programs into a single ensemble that aggregates their outputs. This is useful when you have run several optimization passes (different optimizers, different hyperparameters, different random seeds) and want to combine them for more robust predictions.
dspy.Ensemble is an optimizer (teleprompter) that takes a list of DSPy programs and returns a single EnsembledProgram. When you call the ensembled program, it runs each constituent program on the same inputs and aggregates the results using a reduce function you provide.
Program A ──┐
Program B ──┼──> Run all ──> reduce_fn ──> Single output
Program C ──┘
Unlike other optimizers that tune prompts or weights, Ensemble does not change the programs themselves. It combines their outputs at inference time.
Do not use Ensemble when:
import dspy
lm = dspy.LM("openai/gpt-4o-mini") # or "anthropic/claude-sonnet-4-5-20250929", etc.
dspy.configure(lm=lm)
# 1. Define your base program
qa = dspy.ChainOfThought("question -> answer")
# 2. Create a training set and metric
trainset = [
dspy.Example(question="What is the capital of France?", answer="Paris").with_inputs("question"),
dspy.Example(question="What is 2 + 2?", answer="4").with_inputs("question"),
# ... more examples
]
def exact_match(example, pred, trace=None):
return pred.answer.strip().lower() == example.answer.strip().lower()
# 3. Run multiple optimization passes to get different programs
programs = []
for i in range(3):
optimizer = dspy.BootstrapFewShot(
metric=exact_match,
max_bootstrapped_demos=4,
max_labeled_demos=4,
)
optimized = optimizer.compile(qa, trainset=trainset)
programs.append(optimized)
# 4. Combine with Ensemble using majority voting
ensemble_optimizer = dspy.Ensemble(reduce_fn=dspy.majority, size=None)
ensemble_program = ensemble_optimizer.compile(programs)
# 5. Use the ensemble like any module
result = ensemble_program(question="What is the capital of Germany?")
print(result.answer)
dspy.Ensemble(
reduce_fn=None, # Function to aggregate outputs from all programs
size=None, # How many programs to sample (None = use all)
deterministic=False, # Must be False (deterministic mode not yet implemented)
)
| Parameter | Type | Description |
|-----------|------|-------------|
| reduce_fn | Callable \| None | Aggregation function applied to the list of outputs. If None, returns the raw list of predictions. |
| size | int \| None | Number of programs to randomly sample from the ensemble. None means use all programs. |
| deterministic | bool | Reserved for future use. Must be False. |
ensemble_optimizer.compile(programs)
| Parameter | Type | Description |
|-----------|------|-------------|
| programs | list[dspy.Module] | List of DSPy programs to ensemble |
Returns an EnsembledProgram that runs the selected programs and applies reduce_fn.
The reduce function determines how outputs from multiple programs are combined into a single result.
The most common reduce function. It picks the most frequent output value across all programs -- majority voting.
ensemble = dspy.Ensemble(reduce_fn=dspy.majority)
Use dspy.majority when:
def average_scores(predictions):
"""Average a numeric output field across all predictions."""
scores = [float(p.score) for p in predictions]
avg = sum(scores) / len(scores)
# Return a Prediction-like object with the averaged score
return predictions[0].__class__(score=str(avg))
ensemble = dspy.Ensemble(reduce_fn=average_scores)
def weighted_vote(predictions):
"""Pick the answer backed by the most programs, with confidence weighting."""
from collections import Counter
votes = Counter(p.answer for p in predictions)
winner = votes.most_common(1)[0][0]
# Return a prediction with the winning answer
return predictions[0].__class__(answer=winner)
ensemble = dspy.Ensemble(reduce_fn=weighted_vote)
If you pass reduce_fn=None, the ensembled program returns the raw list of predictions from all programs. This is useful when you want to implement custom post-processing logic outside the ensemble.
ensemble = dspy.Ensemble(reduce_fn=None)
ensemble_program = ensemble.compile(programs)
# Returns a list of predictions
all_predictions = ensemble_program(question="What is DSPy?")
# Process them yourself
for pred in all_predictions:
print(pred.answer)
One of the most powerful uses of Ensemble is combining programs from different optimization strategies. Each optimizer may find different strengths.
import dspy
lm = dspy.LM("openai/gpt-4o-mini") # or "anthropic/claude-sonnet-4-5-20250929", etc.
dspy.configure(lm=lm)
qa = dspy.ChainOfThought("question -> answer")
# Program 1: Optimized with BootstrapFewShot
opt1 = dspy.BootstrapFewShot(metric=metric, max_bootstrapped_demos=4)
prog1 = opt1.compile(qa, trainset=trainset)
# Program 2: Optimized with MIPROv2
opt2 = dspy.MIPROv2(metric=metric, auto="light")
prog2 = opt2.compile(qa, trainset=trainset)
# Program 3: Optimized with BootstrapFewShotWithRandomSearch
opt3 = dspy.BootstrapFewShotWithRandomSearch(
metric=metric,
max_bootstrapped_demos=4,
num_candidate_programs=5,
)
prog3 = opt3.compile(qa, trainset=trainset)
# Ensemble all three
ensemble = dspy.Ensemble(reduce_fn=dspy.majority)
combined = ensemble.compile([prog1, prog2, prog3])
result = combined(question="What is the tallest mountain?")
print(result.answer)
This approach works because different optimizers explore different parts of the prompt space. BootstrapFewShot finds good demonstrations, MIPROv2 finds good instructions, and combining them via voting smooths out individual weaknesses.
When you have many optimized programs (e.g., from a large random search), you can use size to randomly sample a subset at inference time. This reduces cost while still benefiting from diversity.
# You have 10 programs from BootstrapFewShotWithRandomSearch
programs = [...] # 10 optimized programs
# Only run 3 of them per inference call (randomly sampled)
ensemble = dspy.Ensemble(reduce_fn=dspy.majority, size=3)
ensemble_program = ensemble.compile(programs)
Each call to ensemble_program randomly picks 3 of the 10 programs, runs them, and applies majority voting. This balances diversity against cost.
Ensemble multiplies your inference cost and latency by the number of programs (or size if set):
| Programs | Cost multiplier | Latency (sequential) | |----------|-----------------|----------------------| | 3 | 3x | 3x | | 5 | 5x | 5x | | 10 | 10x | 10x |
Ways to manage this:
size to cap the number of programs run per inference callBoth combine multiple outputs, but they work differently:
| | Ensemble | BestOfN | |---|---------|---------| | What it combines | Different optimized programs | Multiple runs of the same program | | Selection method | Voting / averaging across programs | Reward function picks the best single run | | Diversity source | Different prompts/demos from optimization | Temperature sampling of the same prompt | | When to use | You have multiple optimized programs | You have one program and a scoring metric | | Optimizer type | Combines at the program level | Combines at the inference level |
You can even stack them: ensemble multiple optimized programs, then wrap the ensemble with BestOfN for additional quality.
compile(). Ensemble.compile() expects a list[dspy.Module], not a single module. Always wrap even two programs in a list: ensemble.compile([prog1, prog2]).dspy.configure(lm=...) calls retain their LM binding. You do not need to re-configure the LM before calling the ensemble -- each program already knows which LM to use.deterministic=True expecting reproducible sampling. The deterministic parameter is reserved but not yet implemented -- setting it to True raises an error. Leave it at the default False.dspy.BestOfN instead.reduce_fn receives a list of dspy.Prediction objects and must return a dspy.Prediction (or compatible object). Returning a plain string breaks downstream field access.Install any skill:
npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill <name>
/dspy-best-of-n/ai-improving-accuracy/ai-improving-accuracy/dspy-evaluate/ai-do if you do not have it — it routes any AI problem to the right skill and is the fastest way to work: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-dotools
See what is happening during optimizer.compile() instead of waiting blind. Use when you want to watch optimization progress, see scores as they come in, know if your optimizer is working, check if optimization is stuck, understand why optimization is taking too long, get live progress during compile, monitor convergence, detect overfitting during optimization, interpret optimization results, or pick the right tool for watching optimization. Also used for optimizer progress bar, is my optimizer doing anything, optimization seems stuck, how long will optimization take, watch GEPA run, watch MIPROv2 run, live optimization dashboard, optimizer not improving, scores not going up, optimization taking forever, see what optimizer is doing, debug slow optimization, optimization visibility, optimizer metrics, track compile progress, optimization observability.
testing
Use when you want the highest-quality prompt optimization DSPy offers — jointly optimizes instructions and few-shot demos, with auto=light/medium/heavy presets. Common scenarios - you want the best possible accuracy from prompt optimization, jointly tuning instructions and few-shot demonstrations, using auto presets for different compute budgets, or when COPRO or BootstrapFewShot alone are not reaching your accuracy target. Related - ai-improving-accuracy, dspy-copro, dspy-bootstrap-few-shot. Also used for dspy.MIPROv2, best DSPy optimizer, highest quality optimization, auto=light medium heavy, joint instruction and demo optimization, most powerful prompt optimizer, MIPROv2 vs COPRO vs BootstrapFewShot, which optimizer should I use, state of the art prompt optimization, when to use MIPROv2, optimize both instructions and examples, heavy optimization for production, best optimizer for accuracy.
testing
Use LangWatch for DSPy auto-tracing and real-time optimizer progress. Use when you want to set up LangWatch, langwatch.dspy.init, auto-tracing DSPy, real-time optimization dashboard, optimizer progress tracking, app.langwatch.ai, or DSPy optimizer dashboard. Also used for langwatch setup, pip install langwatch, langwatch trace, optimizer progress, real-time optimization, watch optimizer run, LangWatch self-hosted, langwatch docker, langwatch vs langtrace, langwatch autotrack_dspy.
data-ai
Use when you want to optimize instructions without few-shot examples — a lightweight alternative to COPRO when you do not have or do not want to use demonstrations. Common scenarios - optimizing instructions when you do not have or do not want to use few-shot demonstrations, lightweight instruction search as a first step, tasks where examples in the prompt confuse the model, or when you want fast instruction optimization without the cost of COPRO. Related - ai-improving-accuracy, dspy-copro, dspy-miprov2. Also used for dspy.GEPA, instruction optimization without demos, lightweight prompt optimization, optimize instructions only, no few-shot examples needed, GEPA vs COPRO, quick instruction search, when demonstrations hurt performance, zero-shot optimization, instruction-only optimizer, simplest instruction tuner, fast prompt optimization, skip few-shot and just tune instructions, optimize Pydantic field descriptions, GEPA structured output, GEPA does not optimize field desc.