skills/dspy-simba/SKILL.md
Use when you want conservative, incremental optimization — making small targeted improvements rather than large changes, useful for already-working programs that need fine-tuning. Common scenarios - your program already works well and you want to improve it without breaking what works, conservative optimization that preserves existing quality, fine-tuning a production program incrementally, or when aggressive optimization causes regressions. Related - ai-improving-accuracy, dspy-miprov2, dspy-refine. Also used for dspy.SIMBA, conservative optimization, incremental improvement, do not break what works, small targeted optimization, safe optimization for production, avoid regressions during optimization, production-safe optimizer, gentle optimization, when MIPROv2 changes too much, preserve existing quality, stable optimization, risk-averse prompt tuning, optimize without regressions.
npx skillsauth add lebsral/dspy-programming-not-prompting-lms-skills dspy-simbaInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Guide the user through using dspy.SIMBA (Stochastic Introspective Mini-Batch Ascent) to optimize DSPy programs through incremental, targeted improvements rather than large sweeping changes.
dspy.SIMBA is a DSPy optimizer that improves programs by analyzing mini-batches of examples, identifying where the program struggles most, and making small targeted fixes -- either adding demonstrations or generating self-reflective rules. Instead of rewriting the entire prompt at once, SIMBA takes conservative steps, focusing on the examples with the highest output variability.
Key properties:
Use dspy.SIMBA when:
Do not use SIMBA when:
dspy.BootstrapFewShot firstdspy.MIPROv2 insteaddspy.BootstrapFinetuneThree things are needed: a DSPy program, a metric function, and a training set.
import dspy
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini")) # or "anthropic/claude-sonnet-4-5-20250929", etc.
# 1. Define your program
classify = dspy.ChainOfThought("text -> label")
# 2. Define a metric
def metric(example, prediction, trace=None):
return prediction.label.lower() == example.label.lower()
# 3. Build training data
trainset = [
dspy.Example(text="Great product!", label="positive").with_inputs("text"),
dspy.Example(text="Terrible service.", label="negative").with_inputs("text"),
# ... more examples
]
# 4. Optimize with SIMBA
optimizer = dspy.SIMBA(metric=metric)
optimized = optimizer.compile(classify, trainset=trainset)
# 5. Use the optimized program
result = optimized(text="This exceeded my expectations!")
print(result.label)
# 6. Save for later
optimized.save("optimized_classifier.json")
SIMBA's optimization loop proceeds through repeated small steps:
SIMBA runs the current program pool on a mini-batch of examples. Each program runs with distinct LM configurations to produce diverse outputs, scored by your metric.
Examples are grouped and sorted by output variability -- the gap between the best and worst scores across runs. High-variability examples are where the program is inconsistent and has the most room for improvement.
For each high-variability example, SIMBA applies one of two strategies:
New candidate programs (with the added demos or rules) are evaluated on a fresh mini-batch. This prevents overfitting to the examples used for rule generation.
The best-performing candidates are added to the program pool. Future iterations select source programs using softmax sampling weighted by average scores -- favoring better programs while still exploring alternatives.
This cycle repeats for max_steps iterations, with each step making a small, targeted improvement.
dspy.SIMBA(
metric, # Scoring function (required)
bsize=32, # Mini-batch size
num_candidates=6, # New candidates per iteration
max_steps=8, # Number of optimization iterations
max_demos=4, # Max demonstrations per predictor
prompt_model=None, # LM for generating rules (defaults to global LM)
teacher_settings=None, # Teacher model configuration dict
demo_input_field_maxlen=100000, # Char limit for demo input fields
num_threads=None, # Parallel execution threads
temperature_for_sampling=0.2, # Temperature for trajectory sampling
temperature_for_candidates=0.2, # Temperature for source program selection
)
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| metric | Callable | (required) | Function (example, prediction, trace=None) -> float that scores outputs |
| bsize | int | 32 | Number of examples per mini-batch. Larger batches give more stable estimates but cost more LM calls |
| num_candidates | int | 6 | Candidate programs generated per iteration. More candidates explore more strategies but cost more |
| max_steps | int | 8 | Total optimization iterations. Each step samples a fresh mini-batch and produces new candidates |
| max_demos | int | 4 | Maximum few-shot demonstrations added to any predictor. Keeps prompts from growing too large |
| prompt_model | dspy.LM | None | LM used for generating introspective rules. Falls back to the globally configured LM if not set |
| teacher_settings | dict | None | Configuration dict for the teacher model |
| demo_input_field_maxlen | int | 100000 | Max characters for demo input fields. Reduce for tasks with very long inputs to keep prompts manageable |
| num_threads | int | None | Number of parallel threads for evaluation. Defaults to dspy.settings.num_threads |
| temperature_for_sampling | float | 0.2 | Temperature when running programs on mini-batches. Lower values produce more deterministic outputs |
| temperature_for_candidates | float | 0.2 | Temperature for softmax selection of source programs from the pool. Lower values favor the top performers |
bsize (mini-batch size):
| Value | Use case | |-------|----------| | 16 | Small datasets (50-100 examples), faster iterations | | 32 | Default, good balance for most tasks | | 64 | Larger datasets, more stable gradient estimates |
max_steps:
| Value | Use case | |-------|----------| | 4-6 | Quick optimization pass, limited budget | | 8 | Default, enough steps for meaningful improvement | | 12-16 | Longer optimization for complex programs or larger datasets |
num_candidates:
| Value | Use case | |-------|----------| | 3-4 | Budget-conscious, smaller search space | | 6 | Default, reasonable exploration | | 8-10 | Wider search when you have LM budget to spare |
Runs the optimization loop and returns the best program found.
optimized = optimizer.compile(program, trainset=trainset, seed=0)
The seed parameter (default 0) controls random sampling for reproducible results.
The returned program includes two additional attributes:
candidate_programs -- list of scored alternative programs discovered during optimization. Useful for ensemble strategies or analyzing what SIMBA tried.trial_logs -- per-batch metrics from each optimization step. Useful for understanding how performance evolved.Returns the optimizer's configuration as a dictionary. Useful for logging and experiment tracking.
params = optimizer.get_params()
print(params)
# {'bsize': 32, 'num_candidates': 6, 'max_steps': 8, ...}
After optimization, examine what SIMBA found:
optimizer = dspy.SIMBA(metric=metric)
optimized = optimizer.compile(program, trainset=trainset)
# Check the candidate pool
for i, (prog, score) in enumerate(optimized.candidate_programs):
print(f"Candidate {i}: score={score:.3f}")
# Review trial logs to see improvement over time
for step, log in enumerate(optimized.trial_logs):
print(f"Step {step}: {log}")
| Aspect | dspy.SIMBA | dspy.MIPROv2 | dspy.BootstrapFewShot |
|--------|-------------|----------------|------------------------|
| Strategy | Small incremental steps on mini-batches | Full instruction + demo optimization | Bootstrap few-shot examples |
| Change size | Small, targeted per iteration | Can rewrite entire instructions | Adds demonstrations only |
| Risk of regression | Low -- changes are conservative | Higher -- rewrites can miss edge cases | Low -- additive only |
| Data needed | 50-500 examples | 200+ examples | 50+ examples |
| Cost | Moderate (mini-batch sampling) | Higher (full search) | Lower (single pass) |
| Best for | Incremental improvement, production stability | Maximum prompt quality | Quick first optimization |
| Introspection | Yes -- analyzes failures | Yes -- generates instructions | No |
A common approach is to layer optimizers:
BootstrapFewShot to get a working baseline with good demonstrationsMIPROv2 if you need maximum quality and can tolerate larger changes# Step 1: Bootstrap baseline
bootstrap = dspy.BootstrapFewShot(metric=metric, max_bootstrapped_demos=4)
baseline = bootstrap.compile(program, trainset=trainset)
# Step 2: Incrementally improve with SIMBA
simba = dspy.SIMBA(metric=metric, max_steps=8)
improved = simba.compile(baseline, trainset=trainset)
Expect incremental gains per step rather than a single large jump:
| Stage | Example score | Notes | |-------|-------------|-------| | Unoptimized baseline | ~60-70% | Raw program with no demos or instructions | | After BootstrapFewShot | ~75-85% | Good demos added, biggest single jump | | After SIMBA (4-8 steps) | ~80-90% | Incremental +3-8% from targeting hard examples |
The exact numbers depend on your task, data, and LM. SIMBA shines on the incremental step — it finds the examples your program is inconsistent on and fixes those specifically.
BootstrapFewShot first to establish a baseline with good demonstrations, then run SIMBA on the bootstrapped program to push accuracy higher.max_demos too high. Each demo added by SIMBA increases prompt length. With max_demos=10 and multi-paragraph examples, prompts can exceed context limits or degrade quality from demo overload. Keep max_demos at 4-6 (the default is 4).prompt_model and the main program. SIMBA's introspective rules are generated by analyzing failures and writing natural-language instructions. If your main LM is small (e.g., gpt-4o-mini), the rule quality suffers. Set prompt_model to a stronger model for rule generation while keeping the cheaper model for the main program.candidate_programs and trial_logs on the result. After compile(), the returned program has candidate_programs (list of scored alternatives) and trial_logs (per-step metrics). Inspecting these reveals whether optimization plateaued, which strategies worked, and whether alternative candidates might be better for specific inputs.Install any skill:
npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill <name>
/ai-improving-accuracy/dspy-evaluate/dspy-chain-of-thought or /dspy-modules/dspy-data/ai-do if you do not have it — it routes any AI problem to the right skill and is the fastest way to work: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-dotools
See what is happening during optimizer.compile() instead of waiting blind. Use when you want to watch optimization progress, see scores as they come in, know if your optimizer is working, check if optimization is stuck, understand why optimization is taking too long, get live progress during compile, monitor convergence, detect overfitting during optimization, interpret optimization results, or pick the right tool for watching optimization. Also used for optimizer progress bar, is my optimizer doing anything, optimization seems stuck, how long will optimization take, watch GEPA run, watch MIPROv2 run, live optimization dashboard, optimizer not improving, scores not going up, optimization taking forever, see what optimizer is doing, debug slow optimization, optimization visibility, optimizer metrics, track compile progress, optimization observability.
testing
Use when you want the highest-quality prompt optimization DSPy offers — jointly optimizes instructions and few-shot demos, with auto=light/medium/heavy presets. Common scenarios - you want the best possible accuracy from prompt optimization, jointly tuning instructions and few-shot demonstrations, using auto presets for different compute budgets, or when COPRO or BootstrapFewShot alone are not reaching your accuracy target. Related - ai-improving-accuracy, dspy-copro, dspy-bootstrap-few-shot. Also used for dspy.MIPROv2, best DSPy optimizer, highest quality optimization, auto=light medium heavy, joint instruction and demo optimization, most powerful prompt optimizer, MIPROv2 vs COPRO vs BootstrapFewShot, which optimizer should I use, state of the art prompt optimization, when to use MIPROv2, optimize both instructions and examples, heavy optimization for production, best optimizer for accuracy.
testing
Use LangWatch for DSPy auto-tracing and real-time optimizer progress. Use when you want to set up LangWatch, langwatch.dspy.init, auto-tracing DSPy, real-time optimization dashboard, optimizer progress tracking, app.langwatch.ai, or DSPy optimizer dashboard. Also used for langwatch setup, pip install langwatch, langwatch trace, optimizer progress, real-time optimization, watch optimizer run, LangWatch self-hosted, langwatch docker, langwatch vs langtrace, langwatch autotrack_dspy.
data-ai
Use when you want to optimize instructions without few-shot examples — a lightweight alternative to COPRO when you do not have or do not want to use demonstrations. Common scenarios - optimizing instructions when you do not have or do not want to use few-shot demonstrations, lightweight instruction search as a first step, tasks where examples in the prompt confuse the model, or when you want fast instruction optimization without the cost of COPRO. Related - ai-improving-accuracy, dspy-copro, dspy-miprov2. Also used for dspy.GEPA, instruction optimization without demos, lightweight prompt optimization, optimize instructions only, no few-shot examples needed, GEPA vs COPRO, quick instruction search, when demonstrations hurt performance, zero-shot optimization, instruction-only optimizer, simplest instruction tuner, fast prompt optimization, skip few-shot and just tune instructions, optimize Pydantic field descriptions, GEPA structured output, GEPA does not optimize field desc.