skills/dspy-bootstrap-finetune/SKILL.md
Use when you need maximum quality from a smaller/cheaper model — generates training data from a teacher model and fine-tunes a student model weights. Common scenarios - distilling GPT-4 quality into a cheaper model, generating training data from a strong teacher to fine-tune a weak student, reducing inference costs by replacing an expensive model with a fine-tuned small one, or building a production model that is fast and cheap. Related - ai-fine-tuning, ai-cutting-costs, dspy-better-together. Also used for dspy.BootstrapFinetune, model distillation with DSPy, teacher-student training, fine-tune small model from GPT-4 outputs, reduce API costs with fine-tuning, generate training data then fine-tune, cheap model same quality, distill large model into small model, fine-tune Llama from GPT-4, production model training, move from API to self-hosted model.
npx skillsauth add lebsral/dspy-programming-not-prompting-lms-skills dspy-bootstrap-finetuneInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Guide the user through using DSPy's BootstrapFinetune optimizer to automatically generate training data from successful reasoning traces and fine-tune a language model's weights. This is the heaviest optimization DSPy offers -- it changes the model itself, not just the prompt.
dspy.BootstrapFinetune is an optimizer that tunes LM weights rather than prompts. It works in two phases:
The result is a version of your program backed by a fine-tuned model that has internalized the reasoning patterns from the bootstrapped traces.
Training examples ──> Run program ──> Keep passing traces ──> Fine-tune model weights
Use it when:
Do not use it when:
/ai-improving-accuracy with MIPROv2 or BootstrapFewShot insteadimport dspy
lm = dspy.LM("openai/gpt-4o-mini") # or "anthropic/claude-sonnet-4-5-20250929", etc.
dspy.configure(lm=lm)
# 1. Define your program
class Classify(dspy.Signature):
"""Classify the support ticket category."""
text: str = dspy.InputField()
category: str = dspy.OutputField()
program = dspy.ChainOfThought(Classify)
# 2. Prepare labeled data (500+ examples)
trainset = [
dspy.Example(text="Can't log in", category="auth").with_inputs("text"),
dspy.Example(text="Charge me twice", category="billing").with_inputs("text"),
# ... 500+ examples
]
# 3. Define a metric
def metric(example, prediction, trace=None):
return prediction.category.lower() == example.category.lower()
# 4. Fine-tune
optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=24)
finetuned = optimizer.compile(program, trainset=trainset)
# 5. Use the fine-tuned program
result = finetuned(text="My payment failed")
print(result.category)
After compile finishes, finetuned is a copy of your program that uses the newly fine-tuned model. Every module in the program that was backed by a fine-tunable LM gets updated.
The most powerful pattern: use an expensive, high-quality model (the teacher) to generate traces, then fine-tune a cheap model (the student) on those traces. This is model distillation.
# --- Teacher: expensive model, high quality ---
teacher_lm = dspy.LM("openai/gpt-4o") # or any strong model
dspy.configure(lm=teacher_lm)
teacher = dspy.ChainOfThought(Classify)
# Optionally optimize the teacher's prompts first for even better traces
prompt_optimizer = dspy.MIPROv2(metric=metric, auto="medium")
teacher_optimized = prompt_optimizer.compile(teacher, trainset=trainset)
# --- Student: cheap model, fine-tuned on teacher's traces ---
student_lm = dspy.LM("openai/gpt-4o-mini") # or any fine-tunable model
dspy.configure(lm=student_lm)
student = dspy.ChainOfThought(Classify)
ft_optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=24)
student_finetuned = ft_optimizer.compile(
student,
trainset=trainset,
teacher=teacher_optimized, # Teacher generates the traces
)
How it works with a teacher:
The student learns to mimic the teacher's reasoning at a fraction of the inference cost.
BootstrapFinetune fine-tunes whatever LM is configured when you call compile. To control which model gets fine-tuned:
# Fine-tune GPT-4o-mini
student_lm = dspy.LM("openai/gpt-4o-mini") # or any fine-tunable model
dspy.configure(lm=student_lm)
finetuned = optimizer.compile(student, trainset=trainset)
# Fine-tune an open-source model via Together AI
student_lm = dspy.LM("together_ai/meta-llama/Llama-3-70b-chat-hf") # or any fine-tunable model
dspy.configure(lm=student_lm)
finetuned = optimizer.compile(student, trainset=trainset)
The model must support fine-tuning through its provider's API. Common options:
| Provider | Fine-tunable models | Notes |
|----------|-------------------|-------|
| OpenAI | gpt-4o-mini, gpt-4o | Easiest setup, DSPy handles the API calls |
| Together AI | Llama, Mistral, etc. | Open-source models, competitive pricing |
| Local | Any HuggingFace model | Full control, needs GPU(s) |
dspy.BootstrapFinetune(
metric=None, # Scoring function: (example, prediction, trace) -> bool/float
multitask=True, # Share training data across predictors
train_kwargs=None, # Fine-tuning hyperparams (e.g., {"n_epochs": 2})
exclude_demos=False, # Clear few-shot demos after fine-tuning
num_threads=None, # Parallel threads for bootstrapping
)
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| metric | Callable \| None | None | Scores each trace during bootstrapping. Only passing traces become training data. |
| multitask | bool | True | When True, shares training data across predictors. When False, each predictor gets its own fine-tuning data. |
| train_kwargs | dict \| None | None | Fine-tuning hyperparameters passed to the provider (e.g., {"n_epochs": 2}). Can be LM-specific: {lm: {"n_epochs": 3}}. |
| exclude_demos | bool | False | If True, clears few-shot demos after fine-tuning (the model has internalized them). |
| num_threads | int \| None | None | Threads for bootstrapping. Must be >= the number of fine-tuning jobs. 24 is a good starting point. |
The compile method accepts:
optimizer.compile(
student, # Your dspy.Module to fine-tune
trainset, # List of dspy.Example with labeled data
teacher=None, # Optional: a teacher program for distillation
)
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| student | dspy.Module | required | The program whose backing LM will be fine-tuned |
| trainset | list[dspy.Example] | required | Labeled training data (500+ recommended) |
| teacher | dspy.Module \| None | None | If provided, the teacher generates traces instead of the student. Use for distillation. |
BootstrapFinetune is the most expensive optimizer in DSPy. Budget for three cost stages:
Every training example gets run through your program (or the teacher). With 1000 examples and a ChainOfThought module, that's 1000+ LM calls just for bootstrapping.
The model provider charges for training. Costs depend on the number of successful traces and their length.
Fine-tuned models may cost slightly more per token than base models (OpenAI charges ~1.5x for fine-tuned inference). But if you distilled from GPT-4o to GPT-4o-mini, the net savings are still 10-30x.
| Factor | Prompt optimization (MIPROv2) | BootstrapFinetune | |--------|------------------------------|-------------------| | What it changes | Prompt instructions + few-shot examples | Model weights | | Data needed | ~200 examples | ~500+ examples | | Cost | Low (just LM calls for optimization) | High (LM calls + fine-tuning fees) | | Time | Minutes | Hours | | Quality ceiling | Good, but limited by what prompts can do | Higher -- model learns domain patterns | | Portability | Optimized prompts work with any model | Weights are locked to one model | | Iteration speed | Fast -- re-optimize in minutes | Slow -- re-train takes hours | | Best for | Early development, quick iteration | Production, maximum quality, cost reduction via distillation |
Recommended progression:
dspy.BootstrapFewShot (quick, ~50 examples)dspy.MIPROv2 (better, ~200 examples)dspy.BootstrapFinetune when prompt optimization plateaus (500+ examples)dspy.BetterTogether for absolute maximum quality (combines prompt + weight optimization)# Save the fine-tuned program
finetuned.save("finetuned_classify.json")
# Load later for production
from my_module import MyProgram
production = MyProgram()
production.load("finetuned_classify.json")
result = production(text="New ticket text...")
The saved file stores the fine-tuned model identifier (e.g., ft:gpt-4o-mini-2024-07-18:org::abc123) so loading automatically points to the right model.
If only a small fraction of training examples produce passing traces, the fine-tuning data will be thin.
Fixes:
Fixes:
Fixes:
dspy.BetterTogether to combine prompt and weight optimizationBootstrapFewShot and MIPROv2 first — they are 10-100x cheaper and often close the gap enough. Fine-tune only when prompt optimization plateaus.dspy.configure(lm=student_lm) before calling compile. BootstrapFinetune fine-tunes whatever LM is configured at compile time. If the teacher LM is still configured, the optimizer fine-tunes the expensive model instead of the cheap student. Always switch to the student LM before calling compile.num_threads too low for multi-predictor programs. num_threads must be >= the number of fine-tuning jobs (one per unique LM across all predictors). If a program has 3 predictors all using the same LM, that is 1 job. If each uses a different LM, that is 3 jobs. BootstrapFinetune raises a ValueError if threads are insufficient.exclude_demos=True after fine-tuning. Once the model weights have internalized the reasoning patterns, few-shot demos in the prompt are redundant and waste tokens. Set exclude_demos=True to remove them automatically, reducing prompt length and inference cost.Install any skill:
npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill <name>
/ai-improving-accuracy/ai-fine-tuning/ai-cutting-costs/ai-do if you do not have it — it routes any AI problem to the right skill and is the fastest way to work: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-dotools
See what is happening during optimizer.compile() instead of waiting blind. Use when you want to watch optimization progress, see scores as they come in, know if your optimizer is working, check if optimization is stuck, understand why optimization is taking too long, get live progress during compile, monitor convergence, detect overfitting during optimization, interpret optimization results, or pick the right tool for watching optimization. Also used for optimizer progress bar, is my optimizer doing anything, optimization seems stuck, how long will optimization take, watch GEPA run, watch MIPROv2 run, live optimization dashboard, optimizer not improving, scores not going up, optimization taking forever, see what optimizer is doing, debug slow optimization, optimization visibility, optimizer metrics, track compile progress, optimization observability.
testing
Use when you want the highest-quality prompt optimization DSPy offers — jointly optimizes instructions and few-shot demos, with auto=light/medium/heavy presets. Common scenarios - you want the best possible accuracy from prompt optimization, jointly tuning instructions and few-shot demonstrations, using auto presets for different compute budgets, or when COPRO or BootstrapFewShot alone are not reaching your accuracy target. Related - ai-improving-accuracy, dspy-copro, dspy-bootstrap-few-shot. Also used for dspy.MIPROv2, best DSPy optimizer, highest quality optimization, auto=light medium heavy, joint instruction and demo optimization, most powerful prompt optimizer, MIPROv2 vs COPRO vs BootstrapFewShot, which optimizer should I use, state of the art prompt optimization, when to use MIPROv2, optimize both instructions and examples, heavy optimization for production, best optimizer for accuracy.
testing
Use LangWatch for DSPy auto-tracing and real-time optimizer progress. Use when you want to set up LangWatch, langwatch.dspy.init, auto-tracing DSPy, real-time optimization dashboard, optimizer progress tracking, app.langwatch.ai, or DSPy optimizer dashboard. Also used for langwatch setup, pip install langwatch, langwatch trace, optimizer progress, real-time optimization, watch optimizer run, LangWatch self-hosted, langwatch docker, langwatch vs langtrace, langwatch autotrack_dspy.
data-ai
Use when you want to optimize instructions without few-shot examples — a lightweight alternative to COPRO when you do not have or do not want to use demonstrations. Common scenarios - optimizing instructions when you do not have or do not want to use few-shot demonstrations, lightweight instruction search as a first step, tasks where examples in the prompt confuse the model, or when you want fast instruction optimization without the cost of COPRO. Related - ai-improving-accuracy, dspy-copro, dspy-miprov2. Also used for dspy.GEPA, instruction optimization without demos, lightweight prompt optimization, optimize instructions only, no few-shot examples needed, GEPA vs COPRO, quick instruction search, when demonstrations hurt performance, zero-shot optimization, instruction-only optimizer, simplest instruction tuner, fast prompt optimization, skip few-shot and just tune instructions, optimize Pydantic field descriptions, GEPA structured output, GEPA does not optimize field desc.