skills/ai-reasoning/SKILL.md
Make AI solve hard problems that need planning and multi-step thinking. Use when your AI fails on complex questions, needs to break down problems, requires multi-step logic, needs to plan before acting, gives wrong answers on math or analysis tasks, or when a simple prompt is not enough for the reasoning required. Covers ChainOfThought, ProgramOfThought, MultiChainComparison, and Self-Discovery reasoning patterns in DSPy., AI gives shallow answers, LLM does not think before answering, chain of thought prompting, make AI show its work, AI fails at math, complex analysis with LLM, multi-step problem solving, AI reasoning errors, LLM logic mistakes, think step by step DSPy, AI cannot do basic arithmetic, deep reasoning with language models, self-consistency for better answers, tree of thought.
npx skillsauth add lebsral/dspy-programming-not-prompting-lms-skills ai-reasoningInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Guide the user through making AI solve problems that need more than a simple answer. When a task requires planning, multi-step logic, or choosing the right approach, basic prompting fails. DSPy gives you composable reasoning strategies.
Use this decision tree:
| Task type | Example | Best approach |
|-----------|---------|---------------|
| Simple lookup / classification | "Is this email spam?" | dspy.Predict |
| Needs explanation or logic | "Why did the build fail?" | dspy.ChainOfThought |
| Math, counting, computation | "What's the total after discounts?" | dspy.ProgramOfThought |
| Needs to compare approaches | "Which database is best for this?" | dspy.MultiChainComparison |
| Complex multi-step, novel problems | "Plan a migration strategy" | Self-Discovery pattern |
If the user isn't sure, start with ChainOfThought — it's the right default for most tasks.
The workhorse. Adds intermediate reasoning before the final answer:
import dspy
class AnalyzeBug(dspy.Signature):
"""Analyze the bug report and determine root cause."""
bug_report: str = dspy.InputField(desc="The bug report with error details")
root_cause: str = dspy.OutputField(desc="The most likely root cause")
fix_suggestion: str = dspy.OutputField(desc="Suggested fix")
analyzer = dspy.ChainOfThought(AnalyzeBug)
result = analyzer(bug_report="Users see 500 errors after deploying v2.3...")
print(result.reasoning) # shows step-by-step thinking
print(result.root_cause)
When the answer requires calculation, let the AI write and execute code:
class CalculateMetrics(dspy.Signature):
"""Calculate business metrics from the provided data."""
data_description: str = dspy.InputField(desc="Description of the data and what to calculate")
result: str = dspy.OutputField(desc="The calculated result")
calculator = dspy.ProgramOfThought(CalculateMetrics)
result = calculator(data_description="Revenue was $50k in Jan, $63k in Feb, $58k in March. What's the average monthly growth rate?")
ProgramOfThought generates Python code, runs it in a sandbox, and returns the output. Use this for anything involving math, dates, data manipulation, or counting.
When quality matters more than speed, reason multiple ways and compare:
class RecommendApproach(dspy.Signature):
"""Recommend the best technical approach for this problem."""
problem: str = dspy.InputField()
recommendation: str = dspy.OutputField()
recommender = dspy.MultiChainComparison(RecommendApproach)
result = recommender(problem="We need to add real-time notifications to our app")
# Internally generates multiple chains of thought, then picks the best
class SmartReasoner(dspy.Module):
"""Route to the best reasoning strategy based on the task."""
def __init__(self):
self.classify = dspy.Predict("question -> task_type: str")
self.cot = dspy.ChainOfThought("question -> answer")
self.pot = dspy.ProgramOfThought("question -> answer")
self.mcc = dspy.MultiChainComparison("question -> answer")
def forward(self, question):
task_type = self.classify(question=question).task_type.lower()
if "math" in task_type or "calcul" in task_type or "count" in task_type:
return self.pot(question=question)
elif "compare" in task_type or "recommend" in task_type or "best" in task_type:
return self.mcc(question=question)
else:
return self.cot(question=question)
For genuinely hard problems where the AI needs to figure out how to think, not just think harder. Inspired by Self-Discover prompting research.
The 4-stage pipeline:
from pydantic import BaseModel, Field
# Reasoning strategy library
REASONING_STRATEGIES = [
"Break the problem into smaller sub-problems",
"Think about edge cases and exceptions",
"Work backwards from the desired outcome",
"Consider analogies to simpler problems",
"Identify constraints and requirements first",
"Generate multiple hypotheses and evaluate each",
"Think about what information is missing",
"Check if the problem has been solved before in a different context",
"Separate facts from assumptions",
"Consider the problem from different stakeholder perspectives",
]
class SelectStrategies(dspy.Signature):
"""Select the most relevant reasoning strategies for this task."""
task: str = dspy.InputField(desc="The problem to solve")
available_strategies: list[str] = dspy.InputField()
selected_strategies: list[str] = dspy.OutputField(
desc="2-4 most relevant strategies for this task"
)
class AdaptStrategies(dspy.Signature):
"""Adapt the selected strategies to this specific task."""
task: str = dspy.InputField()
strategies: list[str] = dspy.InputField(desc="Selected reasoning strategies")
adapted_strategies: list[str] = dspy.OutputField(
desc="Strategies rewritten for this specific problem"
)
class ReasoningStep(BaseModel):
step_number: int
strategy: str = Field(description="Which reasoning strategy this step uses")
description: str = Field(description="What to do in this step")
class CreatePlan(dspy.Signature):
"""Create a structured step-by-step reasoning plan."""
task: str = dspy.InputField()
adapted_strategies: list[str] = dspy.InputField()
plan: list[ReasoningStep] = dspy.OutputField(desc="Ordered reasoning steps")
class ExecutePlan(dspy.Signature):
"""Execute the reasoning plan to solve the task."""
task: str = dspy.InputField()
plan: list[ReasoningStep] = dspy.InputField()
step_results: list[str] = dspy.OutputField(desc="Result of each reasoning step")
final_answer: str = dspy.OutputField(desc="The final answer based on all reasoning")
class SelfDiscoveryReasoner(dspy.Module):
def __init__(self):
self.select = dspy.ChainOfThought(SelectStrategies)
self.adapt = dspy.ChainOfThought(AdaptStrategies)
self.plan = dspy.ChainOfThought(CreatePlan)
self.execute = dspy.ChainOfThought(ExecutePlan)
def forward(self, task):
# Stage 1: Select relevant strategies
selected = self.select(
task=task,
available_strategies=REASONING_STRATEGIES,
).selected_strategies
# Stage 2: Adapt to this task
adapted = self.adapt(
task=task,
strategies=selected,
).adapted_strategies
# Stage 3: Create reasoning plan
plan = self.plan(
task=task,
adapted_strategies=adapted,
).plan
# Stage 4: Execute the plan
result = self.execute(task=task, plan=plan)
return dspy.Prediction(
strategies=selected,
plan=plan,
step_results=result.step_results,
answer=result.final_answer,
)
For complex tasks, force the AI to show its work in a structured format:
class ReasoningTrace(BaseModel):
step: str = Field(description="What this reasoning step does")
observation: str = Field(description="What was observed or concluded")
confidence: float = Field(description="0.0-1.0 confidence in this step")
class StructuredReasoner(dspy.Module):
def __init__(self):
self.reason = dspy.ChainOfThought(ReasonWithTrace)
def forward(self, question):
result = self.reason(question=question)
return result
class ReasonWithTrace(dspy.Signature):
"""Solve the problem step by step, showing reasoning at each stage."""
question: str = dspy.InputField()
trace: list[ReasoningTrace] = dspy.OutputField(desc="Step-by-step reasoning trace")
answer: str = dspy.OutputField(desc="Final answer based on the reasoning trace")
Don't just check the final answer — evaluate the reasoning process:
class JudgeReasoning(dspy.Signature):
"""Judge whether the reasoning process is sound."""
question: str = dspy.InputField()
reasoning_steps: list[str] = dspy.InputField(desc="The steps taken to reach the answer")
answer: str = dspy.InputField()
steps_are_logical: bool = dspy.OutputField(desc="Each step follows from the previous")
no_logical_leaps: bool = dspy.OutputField(desc="No unjustified jumps in reasoning")
answer_follows: bool = dspy.OutputField(desc="The answer follows from the reasoning")
def reasoning_quality_metric(example, prediction, trace=None):
# Check final answer correctness
correct = prediction.answer.strip().lower() == example.answer.strip().lower()
# Also check reasoning quality
judge = dspy.Predict(JudgeReasoning)
quality = judge(
question=example.question,
reasoning_steps=prediction.step_results if hasattr(prediction, 'step_results') else [prediction.reasoning],
answer=prediction.answer,
)
reasoning_score = (
quality.steps_are_logical + quality.no_logical_leaps + quality.answer_follows
) / 3
# Weight: 60% correct answer, 40% good reasoning
return (0.6 * correct) + (0.4 * reasoning_score)
Test which reasoning strategy works best for your task:
from dspy.evaluate import Evaluate
evaluator = Evaluate(devset=devset, metric=reasoning_quality_metric, num_threads=4)
# Test different approaches
cot = dspy.ChainOfThought("question -> answer")
pot = dspy.ProgramOfThought("question -> answer")
self_disc = SelfDiscoveryReasoner()
print("ChainOfThought:", evaluator(cot))
print("ProgramOfThought:", evaluator(pot))
print("SelfDiscovery:", evaluator(self_disc))
For multi-stage reasoning (like Self-Discovery), optimize each stage. Typical improvement: 15-30% on reasoning quality metrics (e.g., a ChainOfThought module going from 62% to 81% on a multi-step QA task after 4 bootstrapped demos):
optimizer = dspy.BootstrapFewShot(
metric=reasoning_quality_metric,
max_bootstrapped_demos=4,
)
optimized = optimizer.compile(SelfDiscoveryReasoner(), trainset=trainset)
Automatically discover better instructions for the reasoning prompts:
optimizer = dspy.MIPROv2(metric=reasoning_quality_metric, auto="medium")
optimized = optimizer.compile(SelfDiscoveryReasoner(), trainset=trainset)
GEPA analyzes traces of successful and failed attempts to generate better instructions:
optimizer = dspy.GEPA(metric=reasoning_quality_metric)
optimized = optimizer.compile(SelfDiscoveryReasoner(), trainset=trainset)
| Module | When to consider |
|--------|-----------------|
| dspy.BestOfN | Generate N completions, return the one scoring highest on a metric — simpler than MultiChainComparison when you have a good metric |
| dspy.Refine | Iteratively improve an answer using feedback — good for tasks where a first draft is easy but polish is hard |
| dspy.RLM | Reasoning Language Model — uses test-time compute scaling for verified reasoning (math proofs, code correctness) |
| dspy.Parallel | Run multiple modules concurrently — combine with reasoning modules to parallelize sub-problems |
reasoning field to your signature when using ChainOfThought. DSPy injects the reasoning field automatically. Adding your own creates a duplicate that confuses the LM and produces garbled output. Just define your task-specific input/output fields and let dspy.ChainOfThought handle the rest.dspy.Refine instead of MultiChainComparison with 3-5 chains.BootstrapFewShot on a multi-stage module, it optimizes end-to-end but the intermediate stages (select, adapt, plan) often get weak demos. Evaluate intermediate outputs during development to catch silent degradation in early stages.SmartReasoner pattern with if "math" in task_type is brittle — LMs produce unpredictable classification labels. Use dspy.Predict with Literal types for routing, or better yet, let the optimizer discover which strategy works best via dspy.Evaluate comparisons.Install any skill:
npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill <name>
/dspy-chain-of-thought/dspy-program-of-thought/dspy-multi-chain-comparison/dspy-signatures/dspy-refine/dspy-predict/ai-taking-actions/ai-building-pipelines/ai-improving-accuracy/ai-do/ai-do if you do not have it — it routes any AI problem to the right skill and is the fastest way to work: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-dotools
See what is happening during optimizer.compile() instead of waiting blind. Use when you want to watch optimization progress, see scores as they come in, know if your optimizer is working, check if optimization is stuck, understand why optimization is taking too long, get live progress during compile, monitor convergence, detect overfitting during optimization, interpret optimization results, or pick the right tool for watching optimization. Also used for optimizer progress bar, is my optimizer doing anything, optimization seems stuck, how long will optimization take, watch GEPA run, watch MIPROv2 run, live optimization dashboard, optimizer not improving, scores not going up, optimization taking forever, see what optimizer is doing, debug slow optimization, optimization visibility, optimizer metrics, track compile progress, optimization observability.
testing
Use when you want the highest-quality prompt optimization DSPy offers — jointly optimizes instructions and few-shot demos, with auto=light/medium/heavy presets. Common scenarios - you want the best possible accuracy from prompt optimization, jointly tuning instructions and few-shot demonstrations, using auto presets for different compute budgets, or when COPRO or BootstrapFewShot alone are not reaching your accuracy target. Related - ai-improving-accuracy, dspy-copro, dspy-bootstrap-few-shot. Also used for dspy.MIPROv2, best DSPy optimizer, highest quality optimization, auto=light medium heavy, joint instruction and demo optimization, most powerful prompt optimizer, MIPROv2 vs COPRO vs BootstrapFewShot, which optimizer should I use, state of the art prompt optimization, when to use MIPROv2, optimize both instructions and examples, heavy optimization for production, best optimizer for accuracy.
testing
Use LangWatch for DSPy auto-tracing and real-time optimizer progress. Use when you want to set up LangWatch, langwatch.dspy.init, auto-tracing DSPy, real-time optimization dashboard, optimizer progress tracking, app.langwatch.ai, or DSPy optimizer dashboard. Also used for langwatch setup, pip install langwatch, langwatch trace, optimizer progress, real-time optimization, watch optimizer run, LangWatch self-hosted, langwatch docker, langwatch vs langtrace, langwatch autotrack_dspy.
data-ai
Use when you want to optimize instructions without few-shot examples — a lightweight alternative to COPRO when you do not have or do not want to use demonstrations. Common scenarios - optimizing instructions when you do not have or do not want to use few-shot demonstrations, lightweight instruction search as a first step, tasks where examples in the prompt confuse the model, or when you want fast instruction optimization without the cost of COPRO. Related - ai-improving-accuracy, dspy-copro, dspy-miprov2. Also used for dspy.GEPA, instruction optimization without demos, lightweight prompt optimization, optimize instructions only, no few-shot examples needed, GEPA vs COPRO, quick instruction search, when demonstrations hurt performance, zero-shot optimization, instruction-only optimizer, simplest instruction tuner, fast prompt optimization, skip few-shot and just tune instructions, optimize Pydantic field descriptions, GEPA structured output, GEPA does not optimize field desc.