skills/dspy-multi-chain-comparison/SKILL.md
Use when you want higher accuracy by generating multiple reasoning chains and selecting the best answer — trading speed for quality on critical outputs. Common scenarios - high-stakes decisions where you want multiple reasoning paths compared, classification tasks where one chain of thought is not reliable enough, improving accuracy by generating several answers and selecting the best-reasoned one, or tasks where different reasoning approaches yield different answers. Related - ai-reasoning, ai-improving-accuracy, dspy-chain-of-thought. Also used for dspy.MultiChainComparison, compare multiple reasoning chains, select best reasoning path, multi-path reasoning, vote across chain-of-thought outputs, more reliable than single CoT, deliberation for hard problems, when one reasoning chain is not enough, robust reasoning through comparison, ensemble reasoning, trade speed for accuracy on critical tasks.
npx skillsauth add lebsral/dspy-programming-not-prompting-lms-skills dspy-multi-chain-comparisonInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Guide the user through using dspy.MultiChainComparison to improve answer quality. Instead of relying on a single chain of thought, this module generates several independent reasoning chains and then selects the best final answer by comparing them.
dspy.MultiChainComparison is a DSPy module that:
Think of it as getting multiple opinions from different experts, then having a judge pick the most convincing one. The diversity of reasoning paths surfaces better answers than any single chain alone.
Use it when:
Do NOT use it when:
ChainOfThoughtMultiChainComparison works with any signature, just like ChainOfThought:
import dspy
lm = dspy.LM("openai/gpt-4o-mini") # or "anthropic/claude-sonnet-4-5-20250929", etc.
dspy.configure(lm=lm)
# Inline signature -- same as you'd use with ChainOfThought
recommend = dspy.MultiChainComparison("problem -> recommendation")
result = recommend(problem="We need to migrate from MongoDB to PostgreSQL for our 50GB dataset with complex joins")
print(result.recommendation)
With a class-based signature:
import dspy
class TechRecommendation(dspy.Signature):
"""Recommend the best technical approach for this problem."""
problem: str = dspy.InputField(desc="Technical problem or decision to make")
constraints: str = dspy.InputField(desc="Budget, timeline, or technical constraints")
recommendation: str = dspy.OutputField(desc="The recommended approach with justification")
recommend = dspy.MultiChainComparison(TechRecommendation)
result = recommend(
problem="Our API response times are over 2 seconds under load",
constraints="Small team, no budget for new infrastructure",
)
print(result.recommendation)
When you call a MultiChainComparison module, DSPy does the following:
ChainOfThought calls -- each produces its own reasoning and output fieldsThe comparison step is the key differentiator. Rather than picking randomly or voting, the model actively evaluates the quality of each reasoning chain before choosing.
Input --> CoT Chain 1 --> reasoning_1 + answer_1 --|
--> CoT Chain 2 --> reasoning_2 + answer_2 --|--> Comparison --> best answer
--> CoT Chain 3 --> reasoning_3 + answer_3 --|
By default, MultiChainComparison generates 3 chains. You can adjust M and temperature:
# Constructor signature
dspy.MultiChainComparison(signature, M=3, temperature=0.7, **config)
M — number of independent reasoning chains (default 3)temperature — sampling temperature for chain generation (default 0.7). Higher values produce more diverse chains, which gives the comparison step more to work with.Guidelines for choosing M:
| M value | LM calls | Best for | |---------|----------|----------| | 2 | 3 (2 chains + 1 comparison) | Slight quality boost over single CoT | | 3 | 4 (3 chains + 1 comparison) | Good default, balances quality and cost | | 5 | 6 (5 chains + 1 comparison) | High-stakes tasks where accuracy is critical | | 7+ | 8+ | Diminishing returns for most tasks |
Wrap it in a dspy.Module to combine with other steps:
import dspy
from typing import Literal
class RiskAssessment(dspy.Signature):
"""Assess the risk level of this proposed change."""
change_description: str = dspy.InputField(desc="What is being changed")
system_context: str = dspy.InputField(desc="The system being modified")
risk_level: Literal["low", "medium", "high", "critical"] = dspy.OutputField()
risk_factors: str = dspy.OutputField(desc="Key risks identified")
mitigation: str = dspy.OutputField(desc="Recommended mitigation steps")
class ChangeReviewer(dspy.Module):
def __init__(self):
self.classify = dspy.Predict("change_description -> change_type: str")
self.assess = dspy.MultiChainComparison(RiskAssessment, M=3)
def forward(self, change_description, system_context):
change_type = self.classify(change_description=change_description).change_type
result = self.assess(
change_description=f"[{change_type}] {change_description}",
system_context=system_context,
)
return dspy.Prediction(
change_type=change_type,
risk_level=result.risk_level,
risk_factors=result.risk_factors,
mitigation=result.mitigation,
)
# Usage
lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)
reviewer = ChangeReviewer()
result = reviewer(
change_description="Drop the users_v1 table and migrate all queries to users_v2",
system_context="Production e-commerce platform with 10k daily active users",
)
print(f"Risk: {result.risk_level}")
print(f"Factors: {result.risk_factors}")
print(f"Mitigation: {result.mitigation}")
MultiChainComparison trades speed and cost for quality. Here is a rough comparison with M=3:
| Aspect | ChainOfThought | MultiChainComparison (M=3) | |--------|---------------|---------------------------| | LM calls | 1 | 4 (3 chains + 1 comparison) | | Latency | 1x | ~3-4x (chains can run in parallel internally) | | Cost | 1x | ~4x | | Quality | Good | Better on ambiguous/complex tasks |
Strategies to manage cost:
cheap_lm = dspy.LM("openai/gpt-4o-mini") # or any smaller model
expensive_lm = dspy.LM("openai/gpt-4o") # or "anthropic/claude-sonnet-4-5-20250929", etc.
dspy.configure(lm=cheap_lm) # default for chains
pipeline = ChangeReviewer()
# The comparison predict inside MultiChainComparison can be set separately
# by accessing the internal predict module
class AdaptiveReasoner(dspy.Module):
def __init__(self):
self.classify_difficulty = dspy.Predict("question -> difficulty: str")
self.fast = dspy.ChainOfThought("question -> answer")
self.thorough = dspy.MultiChainComparison("question -> answer", M=3)
def forward(self, question):
difficulty = self.classify_difficulty(question=question).difficulty.lower()
if "hard" in difficulty or "complex" in difficulty:
return self.thorough(question=question)
return self.fast(question=question)
MultiChainComparison modules are optimizable like any other DSPy module. Optimizers tune the prompts for both the chain generation and comparison steps:
def quality_metric(example, prediction, trace=None):
return prediction.answer.strip().lower() == example.answer.strip().lower()
program = dspy.MultiChainComparison("question -> answer", M=3)
optimizer = dspy.BootstrapFewShot(metric=quality_metric, max_bootstrapped_demos=4)
optimized = optimizer.compile(program, trainset=trainset)
# Save for production
optimized.save("optimized_mcc.json")
For best results with MIPROv2:
optimizer = dspy.MIPROv2(metric=quality_metric, auto="medium")
optimized = optimizer.compile(program, trainset=trainset)
Pick a simpler alternative when:
| Situation | Use instead |
|-----------|-------------|
| Simple classification or extraction | dspy.Predict |
| Needs reasoning but latency matters | dspy.ChainOfThought |
| Math or computation tasks | dspy.ProgramOfThought |
| Need tool use or API calls | dspy.ReAct |
| Want retries with self-correction | dspy.Refine + ChainOfThought |
MultiChainComparison is most valuable when the problem genuinely benefits from diverse perspectives -- not when there is a single clearly correct approach.
temperature parameter. MultiChainComparison relies on diverse chains — with temperature=0 (or very low), chains produce near-identical outputs and the comparison step adds cost with no quality gain. Keep the default temperature=0.7 or set it higher for more diversity.dspy.Predict or dspy.ChainOfThought for simple tasks and reserve MultiChainComparison for genuinely ambiguous or high-stakes decisions.auto="light" for MIPROv2 or keep trial counts low.Install any skill:
npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill <name>
/dspy-chain-of-thought/ai-reasoning/ai-improving-accuracy/ai-do if you do not have it — it routes any AI problem to the right skill and is the fastest way to work: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-dotools
See what is happening during optimizer.compile() instead of waiting blind. Use when you want to watch optimization progress, see scores as they come in, know if your optimizer is working, check if optimization is stuck, understand why optimization is taking too long, get live progress during compile, monitor convergence, detect overfitting during optimization, interpret optimization results, or pick the right tool for watching optimization. Also used for optimizer progress bar, is my optimizer doing anything, optimization seems stuck, how long will optimization take, watch GEPA run, watch MIPROv2 run, live optimization dashboard, optimizer not improving, scores not going up, optimization taking forever, see what optimizer is doing, debug slow optimization, optimization visibility, optimizer metrics, track compile progress, optimization observability.
testing
Use when you want the highest-quality prompt optimization DSPy offers — jointly optimizes instructions and few-shot demos, with auto=light/medium/heavy presets. Common scenarios - you want the best possible accuracy from prompt optimization, jointly tuning instructions and few-shot demonstrations, using auto presets for different compute budgets, or when COPRO or BootstrapFewShot alone are not reaching your accuracy target. Related - ai-improving-accuracy, dspy-copro, dspy-bootstrap-few-shot. Also used for dspy.MIPROv2, best DSPy optimizer, highest quality optimization, auto=light medium heavy, joint instruction and demo optimization, most powerful prompt optimizer, MIPROv2 vs COPRO vs BootstrapFewShot, which optimizer should I use, state of the art prompt optimization, when to use MIPROv2, optimize both instructions and examples, heavy optimization for production, best optimizer for accuracy.
testing
Use LangWatch for DSPy auto-tracing and real-time optimizer progress. Use when you want to set up LangWatch, langwatch.dspy.init, auto-tracing DSPy, real-time optimization dashboard, optimizer progress tracking, app.langwatch.ai, or DSPy optimizer dashboard. Also used for langwatch setup, pip install langwatch, langwatch trace, optimizer progress, real-time optimization, watch optimizer run, LangWatch self-hosted, langwatch docker, langwatch vs langtrace, langwatch autotrack_dspy.
data-ai
Use when you want to optimize instructions without few-shot examples — a lightweight alternative to COPRO when you do not have or do not want to use demonstrations. Common scenarios - optimizing instructions when you do not have or do not want to use few-shot demonstrations, lightweight instruction search as a first step, tasks where examples in the prompt confuse the model, or when you want fast instruction optimization without the cost of COPRO. Related - ai-improving-accuracy, dspy-copro, dspy-miprov2. Also used for dspy.GEPA, instruction optimization without demos, lightweight prompt optimization, optimize instructions only, no few-shot examples needed, GEPA vs COPRO, quick instruction search, when demonstrations hurt performance, zero-shot optimization, instruction-only optimizer, simplest instruction tuner, fast prompt optimization, skip few-shot and just tune instructions, optimize Pydantic field descriptions, GEPA structured output, GEPA does not optimize field desc.