skills/dspy-bootstrap-few-shot/SKILL.md
Use when you have 50+ labeled examples and want a quick accuracy boost as your first optimization step — the simplest and fastest DSPy optimizer. Common scenarios - your first optimization attempt on a new DSPy program, adding few-shot examples automatically from labeled data, quick accuracy boost before trying heavier optimizers, bootstrapping demonstrations from a teacher model, or getting started with DSPy optimization. Related - ai-improving-accuracy, dspy-labeled-few-shot. Also used for dspy.BootstrapFewShot, simplest DSPy optimizer, first optimizer to try, automatic few-shot example selection, bootstrap demonstrations from labels, quick optimization baseline, add examples to prompt automatically, teacher bootstrapping, labeled data to few-shot demos, starting point for DSPy optimization, easy accuracy improvement, how to optimize DSPy program for the first time.
npx skillsauth add lebsral/dspy-programming-not-prompting-lms-skills dspy-bootstrap-few-shotInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Guide the user through using dspy.BootstrapFewShot to automatically generate and select high-quality few-shot demonstrations for their DSPy program. This is the simplest optimizer and the recommended first step before trying heavier optimizers.
dspy.BootstrapFewShot takes your program, a training set, and a metric, then:
The result is a copy of your program with working examples baked into the prompt — so the LM sees "here's how I solved similar problems" every time it runs.
optimizer = dspy.BootstrapFewShot(metric=metric, max_bootstrapped_demos=4)
optimized = optimizer.compile(my_program, trainset=trainset)
import dspy
from dspy.evaluate import Evaluate
lm = dspy.LM("openai/gpt-4o-mini") # or "anthropic/claude-sonnet-4-5-20250929", etc.
dspy.configure(lm=lm)
# 1. Define your program
qa = dspy.ChainOfThought("question -> answer")
# 2. Prepare your data (mark inputs with .with_inputs())
trainset = [
dspy.Example(question="What is the capital of France?", answer="Paris").with_inputs("question"),
dspy.Example(question="What is 2+2?", answer="4").with_inputs("question"),
# ... ~50+ examples
]
devset = [
dspy.Example(question="Who wrote Hamlet?", answer="Shakespeare").with_inputs("question"),
# ... held-out examples for evaluation
]
# 3. Define a metric
def metric(example, prediction, trace=None):
return prediction.answer.strip().lower() == example.answer.strip().lower()
# 4. Evaluate baseline
evaluator = Evaluate(devset=devset, metric=metric, num_threads=4)
baseline = evaluator(qa)
print(f"Baseline: {baseline:.1f}%")
# 5. Optimize
optimizer = dspy.BootstrapFewShot(metric=metric, max_bootstrapped_demos=4)
optimized_qa = optimizer.compile(qa, trainset=trainset)
# 6. Evaluate optimized program
improved = evaluator(optimized_qa)
print(f"Optimized: {improved:.1f}%")
optimizer = dspy.BootstrapFewShot(
metric=metric, # Scoring function(example, prediction, trace) -> bool/float
max_bootstrapped_demos=4, # Max bootstrapped (generated) demos per predictor. Default: 4
max_labeled_demos=16, # Max labeled (from trainset) demos per predictor. Default: 16
max_rounds=1, # Number of bootstrap rounds. Default: 1
max_errors=None, # Error tolerance. Default: None (uses dspy.settings.max_errors)
metric_threshold=None, # Numerical threshold for accepting bootstrap examples
teacher_settings=None, # Config dict for the teacher model (e.g., {"lm": teacher_lm})
)
max_bootstrapped_demos — How many auto-generated demonstrations to include in the prompt. These come from running the program on training examples and keeping traces that pass the metric. Start with 4, increase to 8 if you have a complex task.
max_labeled_demos — How many examples from your trainset to include directly as demonstrations (without running through the program first). These are simpler input/output pairs. Set to 0 if you only want bootstrapped demos.
max_rounds — Number of bootstrapping iterations. In each round, the optimizer runs the program (with any demos from previous rounds) and collects new passing traces. More rounds can find better demos but take longer. Usually 1 is sufficient.
max_errors — How many failed examples to tolerate before the optimizer stops. Defaults to None (uses dspy.settings.max_errors). Increase if your task is noisy or the metric is strict.
metric_threshold — Numerical threshold for accepting bootstrap examples. When set, only traces scoring above this threshold become demos. Useful when your metric returns floats rather than booleans.
teacher_settings — Configuration dict for a teacher model. Pass {"lm": teacher_lm} to use a stronger model for generating traces while the student uses a cheaper model.
Understanding the process helps you debug when results are unexpected.
Round 1:
trainsetmetric(example, prediction, trace)max_bootstrapped_demos traces are attached to each predictorRound 2+ (if max_rounds > 1):
The result: Your program's predictors now have few-shot demonstrations in their prompts. When the program runs, the LM sees these worked examples before processing the new input.
The trace parameter in your metric is None during evaluation but set during optimization. Use this to apply stricter filtering during bootstrapping:
def metric(example, prediction, trace=None):
correct = prediction.answer.strip().lower() == example.answer.strip().lower()
if trace is not None:
# During optimization: require good reasoning too
has_reasoning = len(getattr(prediction, "reasoning", "")) > 50
return correct and has_reasoning
# During evaluation: only check correctness
return correct
This ensures bootstrapped demos have both correct answers and clear reasoning, producing higher-quality demonstrations.
After optimization, save the program so you don't have to re-optimize every time:
# Save
optimized_qa.save("optimized_qa.json")
# Load later
loaded_qa = dspy.ChainOfThought("question -> answer")
loaded_qa.load("optimized_qa.json")
# Use it
result = loaded_qa(question="What is the capital of Japan?")
For custom modules:
class MyPipeline(dspy.Module):
def __init__(self):
self.step1 = dspy.ChainOfThought("question -> search_query")
self.step2 = dspy.ChainOfThought("question, search_query -> answer")
def forward(self, question):
query = self.step1(question=question)
return self.step2(question=question, search_query=query.search_query)
# Save after optimization
optimized_pipeline.save("pipeline.json")
# Load
loaded = MyPipeline()
loaded.load("pipeline.json")
The saved file contains the few-shot demonstrations for each predictor. The program structure itself is defined in code — save and load only handle the learned demos and parameters.
BootstrapFewShot works on every predictor in your program. For a multi-step pipeline, each step gets its own demonstrations:
class RAG(dspy.Module):
def __init__(self):
self.generate_query = dspy.ChainOfThought("question -> search_query")
self.generate_answer = dspy.ChainOfThought("context, question -> answer")
def forward(self, question):
query = self.generate_query(question=question)
# Assume some retrieval step here
context = retrieve(query.search_query)
return self.generate_answer(context=context, question=question)
optimizer = dspy.BootstrapFewShot(metric=metric, max_bootstrapped_demos=4)
optimized_rag = optimizer.compile(RAG(), trainset=trainset)
# Both generate_query and generate_answer now have bootstrapped demos
BootstrapFewShot is a great starting point, but you may want to upgrade if:
| Signal | Next step |
|--------|-----------|
| Accuracy plateaus after bootstrapping | Try dspy.BootstrapFewShotWithRandomSearch — it runs multiple bootstrap trials and picks the best set of demos |
| You have 200+ examples and want the best prompts | Try dspy.MIPROv2 — it optimizes both instructions and few-shot demos |
| You want maximum quality and can fine-tune | Try dspy.BootstrapFinetune — it uses bootstrapped traces to fine-tune the LM weights |
A typical progression:
No demos were bootstrapped:
Accuracy didn't improve (or got worse):
max_bootstrapped_demos (e.g., 8)max_labeled_demos=0 to only use bootstrapped demosOptimization is slow:
max_rounds to 1max_labeled_demos too high, bloating the prompt. The default is 16, which adds up to 16 raw input/output pairs from the trainset to the prompt. For tasks with long inputs, this can consume most of the context window. Start with max_labeled_demos=4 and increase only if accuracy improves..with_inputs() on training examples. Every dspy.Example in the trainset must call .with_inputs("field1", "field2") to mark which fields are inputs vs labels. Without it, the optimizer cannot distinguish inputs from expected outputs and bootstrapping silently produces garbage demos.Install any skill:
npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill <name>
/dspy-data/dspy-evaluate/dspy-bootstrap-rs/dspy-miprov2/ai-improving-accuracy/ai-do if you do not have it — it routes any AI problem to the right skill and is the fastest way to work: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-dotools
See what is happening during optimizer.compile() instead of waiting blind. Use when you want to watch optimization progress, see scores as they come in, know if your optimizer is working, check if optimization is stuck, understand why optimization is taking too long, get live progress during compile, monitor convergence, detect overfitting during optimization, interpret optimization results, or pick the right tool for watching optimization. Also used for optimizer progress bar, is my optimizer doing anything, optimization seems stuck, how long will optimization take, watch GEPA run, watch MIPROv2 run, live optimization dashboard, optimizer not improving, scores not going up, optimization taking forever, see what optimizer is doing, debug slow optimization, optimization visibility, optimizer metrics, track compile progress, optimization observability.
testing
Use when you want the highest-quality prompt optimization DSPy offers — jointly optimizes instructions and few-shot demos, with auto=light/medium/heavy presets. Common scenarios - you want the best possible accuracy from prompt optimization, jointly tuning instructions and few-shot demonstrations, using auto presets for different compute budgets, or when COPRO or BootstrapFewShot alone are not reaching your accuracy target. Related - ai-improving-accuracy, dspy-copro, dspy-bootstrap-few-shot. Also used for dspy.MIPROv2, best DSPy optimizer, highest quality optimization, auto=light medium heavy, joint instruction and demo optimization, most powerful prompt optimizer, MIPROv2 vs COPRO vs BootstrapFewShot, which optimizer should I use, state of the art prompt optimization, when to use MIPROv2, optimize both instructions and examples, heavy optimization for production, best optimizer for accuracy.
testing
Use LangWatch for DSPy auto-tracing and real-time optimizer progress. Use when you want to set up LangWatch, langwatch.dspy.init, auto-tracing DSPy, real-time optimization dashboard, optimizer progress tracking, app.langwatch.ai, or DSPy optimizer dashboard. Also used for langwatch setup, pip install langwatch, langwatch trace, optimizer progress, real-time optimization, watch optimizer run, LangWatch self-hosted, langwatch docker, langwatch vs langtrace, langwatch autotrack_dspy.
data-ai
Use when you want to optimize instructions without few-shot examples — a lightweight alternative to COPRO when you do not have or do not want to use demonstrations. Common scenarios - optimizing instructions when you do not have or do not want to use few-shot demonstrations, lightweight instruction search as a first step, tasks where examples in the prompt confuse the model, or when you want fast instruction optimization without the cost of COPRO. Related - ai-improving-accuracy, dspy-copro, dspy-miprov2. Also used for dspy.GEPA, instruction optimization without demos, lightweight prompt optimization, optimize instructions only, no few-shot examples needed, GEPA vs COPRO, quick instruction search, when demonstrations hurt performance, zero-shot optimization, instruction-only optimizer, simplest instruction tuner, fast prompt optimization, skip few-shot and just tune instructions, optimize Pydantic field descriptions, GEPA structured output, GEPA does not optimize field desc.