skills/dspy-bootstrap-rs/SKILL.md
Use when basic BootstrapFewShot is not enough and you want to search over multiple candidate demo sets — better results at the cost of more LM calls. Common scenarios - BootstrapFewShot alone is not reaching target accuracy, you want to search over multiple candidate demo sets and pick the best, optimizing for tasks where example selection matters a lot, or when you have compute budget for a more thorough search. Related - ai-improving-accuracy, dspy-bootstrap-few-shot. Also used for dspy.BootstrapFewShotWithRandomSearch, random search over demonstrations, better than basic BootstrapFewShot, search for optimal few-shot examples, brute force demo selection, try many demo combinations, more compute for better demos, upgrade from BootstrapFewShot, intermediate optimizer between simple and MIPROv2, when basic few-shot optimization is not enough, explore demonstration space.
npx skillsauth add lebsral/dspy-programming-not-prompting-lms-skills dspy-bootstrap-rsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Guide the user through using DSPy's BootstrapFewShotWithRandomSearch optimizer to find the best set of few-shot demonstrations for their program. This optimizer runs BootstrapFewShot multiple times with different random seeds and keeps the candidate program that scores highest on a metric.
BootstrapFewShotWithRandomSearch (also known as BootstrapRS) is a prompt optimizer that searches over multiple candidate sets of few-shot demonstrations to find the best one. It wraps BootstrapFewShot and runs it repeatedly with different random subsets of training examples, then evaluates each candidate program on a held-out portion of the trainset.
trainset ──> [ BootstrapFewShot run 1 ] ──> candidate program 1 ──┐
──> [ BootstrapFewShot run 2 ] ──> candidate program 2 ──┤
──> [ BootstrapFewShot run 3 ] ──> candidate program 3 ──┼──> evaluate all ──> best program
──> ... │
──> [ BootstrapFewShot run N ] ──> candidate program N ──┘
BootstrapFewShot runs once: it bootstraps demonstrations from your training data, picks a fixed set, and returns a single optimized program. The result depends heavily on which examples happened to be selected and which traces succeeded. You might get lucky or unlucky.
BootstrapFewShotWithRandomSearch removes that luck factor. It runs the bootstrap process multiple times (controlled by num_candidate_programs), each time with a different random sample of training examples. Each candidate program gets scored on a validation set, and the optimizer returns the highest-scoring one.
The trade-off is straightforward: more compute for more reliable results.
| | BootstrapFewShot | BootstrapFewShotWithRandomSearch |
|---|-----------------|----------------------------------|
| Bootstrap runs | 1 | num_candidate_programs (default 16) |
| Selection | Returns the single result | Evaluates all candidates, returns the best |
| Reliability | Results vary between runs | More consistent, higher-quality results |
| Cost | 1x | ~Nx (N = num_candidate_programs) |
| When to use | Quick iteration, <50 examples | You want the best few-shot demos, 50-200+ examples |
import dspy
from dspy.evaluate import Evaluate
lm = dspy.LM("openai/gpt-4o-mini") # or any LiteLLM-supported provider
dspy.configure(lm=lm)
# 1. Define your program
qa = dspy.ChainOfThought("question -> answer")
# 2. Prepare training data (50-200+ examples recommended)
trainset = [
dspy.Example(question="What is the capital of France?", answer="Paris").with_inputs("question"),
dspy.Example(question="What is 2 + 2?", answer="4").with_inputs("question"),
# ... more examples
]
# 3. Define a metric
def metric(example, prediction, trace=None):
return prediction.answer.strip().lower() == example.answer.strip().lower()
# 4. Optimize with random search
optimizer = dspy.BootstrapFewShotWithRandomSearch(
metric=metric,
max_bootstrapped_demos=4,
max_labeled_demos=4,
num_candidate_programs=16,
)
optimized_qa = optimizer.compile(qa, trainset=trainset)
# 5. Use the optimized program
result = optimized_qa(question="What is the capital of Germany?")
print(result.answer)
# 6. Save for later
optimized_qa.save("optimized_qa.json")
dspy.BootstrapFewShotWithRandomSearch(
metric, # Scoring function: (example, prediction, trace) -> float|bool
max_bootstrapped_demos=4, # Max demos generated by running the program on training examples
max_labeled_demos=16, # Max demos taken directly from labeled training data
num_candidate_programs=16, # How many random bootstrap runs to try
num_threads=None, # Threads for parallel evaluation of candidates
stop_at_score=None, # Early-stop if a candidate reaches this score
metric_threshold=None, # Min metric score for a bootstrapped demo to be kept
)
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| metric | Callable | required | Scoring function (example, prediction, trace=None) -> float\|bool |
| max_bootstrapped_demos | int | 4 | Maximum bootstrapped (program-generated) demos per predictor |
| max_labeled_demos | int | 16 | Maximum labeled (from trainset) demos per predictor |
| num_candidate_programs | int | 16 | Number of random bootstrap attempts to evaluate |
| num_threads | int \| None | None | Threads for evaluating candidates. Falls back to dspy.settings.num_threads. |
| stop_at_score | float \| None | None | Early-stop search if a candidate reaches this score |
| metric_threshold | float \| None | None | Minimum metric score for a bootstrapped demo to be included |
| teacher_settings | dict \| None | None | LM config for the teacher model (e.g., {"lm": big_model}) |
| max_rounds | int | 1 | Bootstrap rounds per candidate (>1 generates diverse traces at temperature=1.0) |
| max_errors | int \| None | None | Error tolerance before aborting |
These two parameters control where demonstrations come from:
Bootstrapped demos are generated by running your program on training examples and keeping the traces where the metric passes. These are powerful because they show the LM its own successful reasoning patterns, including intermediate steps like chain-of-thought reasoning.
Labeled demos are taken directly from your training data as input-output pairs. They don't include intermediate reasoning steps, but they're reliable because they use your gold-standard answers.
The optimizer includes up to max_bootstrapped_demos bootstrapped demos plus up to max_labeled_demos labeled demos in each candidate program's prompt.
Guidance:
max_bootstrapped_demos=4, max_labeled_demos=4 for most tasks.max_labeled_demos (up to 8-16) if you have high-quality labeled data and your model benefits from more examples.max_bootstrapped_demos (up to 4-8) if your task involves chain-of-thought or multi-step reasoning where seeing worked examples helps.Each candidate program is built by a separate BootstrapFewShot run. The randomness comes from:
After all candidate programs are generated, the optimizer evaluates each one on a validation set (a portion of your trainset that was held out). The candidate with the highest validation score wins.
This is conceptually similar to hyperparameter random search: instead of searching over learning rates or layer sizes, you're searching over which few-shot demos to include in the prompt.
The cost scales linearly with num_candidate_programs:
| num_candidate_programs | Approximate cost multiplier | When to use | |------------------------|----------------------------|-------------| | 4-8 | 4-8x base BootstrapFewShot | Quick search, limited budget | | 16 (default) | 16x | Good balance for most tasks | | 25-50 | 25-50x | Maximum quality, budget allows |
Each candidate program requires:
Cost estimate: If a single BootstrapFewShot run costs ~$0.50, then 16 candidate programs costs ~$8. With a larger trainset or more expensive model, plan for $5-$20.
Tip: Start with num_candidate_programs=8 to get a quick sense of how much random search helps, then increase to 16 or 25 if the improvement justifies the cost.
Use BootstrapFewShotWithRandomSearch when:
Use BootstrapFewShot instead when:
Use MIPROv2 instead when:
Quick & cheap Solid middle ground Best quality
BootstrapFewShot --> BootstrapFewShotWithRS --> MIPROv2
~$0.50 ~$5-20 ~$5-50
Few-shot demos only Few-shot demos (searched) Instructions + few-shot demos
1 candidate N candidates Bayesian optimization
Use a larger model to generate high-quality bootstrapped demos, then deploy with a cheaper student model:
teacher_lm = dspy.LM("openai/gpt-4o") # or any LiteLLM-supported provider
student_lm = dspy.LM("openai/gpt-4o-mini") # or any LiteLLM-supported provider
dspy.configure(lm=student_lm)
optimizer = dspy.BootstrapFewShotWithRandomSearch(
metric=metric,
max_bootstrapped_demos=4,
num_candidate_programs=16,
teacher_settings={"lm": teacher_lm},
)
optimized = optimizer.compile(my_program, trainset=trainset)
# optimized runs on student_lm but uses demos generated by teacher_lm
Skip evaluating remaining candidates once a "good enough" program is found:
optimizer = dspy.BootstrapFewShotWithRandomSearch(
metric=metric,
num_candidate_programs=25,
stop_at_score=95.0, # stop as soon as a candidate scores >= 95%
)
This is useful when you set num_candidate_programs high but want to save cost if an early candidate is already excellent.
You can stack optimizers. Run BootstrapRS first to find great demos, then pass the result to MIPROv2 to refine instructions on top:
# Step 1: Find best demos with random search
bootstrap_optimizer = dspy.BootstrapFewShotWithRandomSearch(
metric=metric,
max_bootstrapped_demos=4,
max_labeled_demos=4,
num_candidate_programs=16,
)
bootstrapped = bootstrap_optimizer.compile(my_program, trainset=trainset)
# Step 2: Refine instructions with MIPROv2
mipro_optimizer = dspy.MIPROv2(metric=metric, auto="medium")
final = mipro_optimizer.compile(bootstrapped, trainset=trainset)
trainset without a separate valset, the optimizer splits it internally, but you get no control over the split. Pass valset explicitly for reproducible results: optimizer.compile(program, trainset=trainset, valset=devset).num_candidate_programs too low. With num_candidate_programs=3 the random search barely explores the space. The default of 16 is a good starting point. Fewer than 8 rarely finds materially better demos than plain BootstrapFewShot.max_labeled_demos=16 with multi-step pipelines. Each predictor in the pipeline gets up to max_labeled_demos + max_bootstrapped_demos demos. A 3-step pipeline with 16+4 demos per step = 60 demos total, which can blow past context limits. Use 2-4 demos per type for multi-step pipelines.candidate_programs attribute on the result. The optimized program has a candidate_programs attribute containing all scored candidates. This is useful for inspecting how much variance exists and whether more search would help.BootstrapFewShot instead, or collect more data.Install any skill:
npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill <name>
/ai-improving-accuracy/ai-improving-accuracy/dspy-evaluate/dspy-data/ai-improving-accuracy/ai-do if you do not have it — it routes any AI problem to the right skill and is the fastest way to work: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-dotools
See what is happening during optimizer.compile() instead of waiting blind. Use when you want to watch optimization progress, see scores as they come in, know if your optimizer is working, check if optimization is stuck, understand why optimization is taking too long, get live progress during compile, monitor convergence, detect overfitting during optimization, interpret optimization results, or pick the right tool for watching optimization. Also used for optimizer progress bar, is my optimizer doing anything, optimization seems stuck, how long will optimization take, watch GEPA run, watch MIPROv2 run, live optimization dashboard, optimizer not improving, scores not going up, optimization taking forever, see what optimizer is doing, debug slow optimization, optimization visibility, optimizer metrics, track compile progress, optimization observability.
testing
Use when you want the highest-quality prompt optimization DSPy offers — jointly optimizes instructions and few-shot demos, with auto=light/medium/heavy presets. Common scenarios - you want the best possible accuracy from prompt optimization, jointly tuning instructions and few-shot demonstrations, using auto presets for different compute budgets, or when COPRO or BootstrapFewShot alone are not reaching your accuracy target. Related - ai-improving-accuracy, dspy-copro, dspy-bootstrap-few-shot. Also used for dspy.MIPROv2, best DSPy optimizer, highest quality optimization, auto=light medium heavy, joint instruction and demo optimization, most powerful prompt optimizer, MIPROv2 vs COPRO vs BootstrapFewShot, which optimizer should I use, state of the art prompt optimization, when to use MIPROv2, optimize both instructions and examples, heavy optimization for production, best optimizer for accuracy.
testing
Use LangWatch for DSPy auto-tracing and real-time optimizer progress. Use when you want to set up LangWatch, langwatch.dspy.init, auto-tracing DSPy, real-time optimization dashboard, optimizer progress tracking, app.langwatch.ai, or DSPy optimizer dashboard. Also used for langwatch setup, pip install langwatch, langwatch trace, optimizer progress, real-time optimization, watch optimizer run, LangWatch self-hosted, langwatch docker, langwatch vs langtrace, langwatch autotrack_dspy.
data-ai
Use when you want to optimize instructions without few-shot examples — a lightweight alternative to COPRO when you do not have or do not want to use demonstrations. Common scenarios - optimizing instructions when you do not have or do not want to use few-shot demonstrations, lightweight instruction search as a first step, tasks where examples in the prompt confuse the model, or when you want fast instruction optimization without the cost of COPRO. Related - ai-improving-accuracy, dspy-copro, dspy-miprov2. Also used for dspy.GEPA, instruction optimization without demos, lightweight prompt optimization, optimize instructions only, no few-shot examples needed, GEPA vs COPRO, quick instruction search, when demonstrations hurt performance, zero-shot optimization, instruction-only optimizer, simplest instruction tuner, fast prompt optimization, skip few-shot and just tune instructions, optimize Pydantic field descriptions, GEPA structured output, GEPA does not optimize field desc.