Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

lebsral/dspy-bootstrap-finetune

Name: dspy-bootstrap-finetune
Author: lebsral

skills/dspy-bootstrap-finetune/SKILL.md

npx skillsauth add lebsral/dspy-programming-not-prompting-lms-skills dspy-bootstrap-finetune

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Fine-Tune LM Weights with dspy.BootstrapFinetune

Guide the user through using DSPy's BootstrapFinetune optimizer to automatically generate training data from successful reasoning traces and fine-tune a language model's weights. This is the heaviest optimization DSPy offers -- it changes the model itself, not just the prompt.

What is BootstrapFinetune

dspy.BootstrapFinetune is an optimizer that tunes LM weights rather than prompts. It works in two phases:

Bootstrap: Run your program on every training example, keep the traces where your metric passes.
Fine-tune: Send those successful traces to the model provider's fine-tuning API (or a local training loop) and train the model weights on them.

The result is a version of your program backed by a fine-tuned model that has internalized the reasoning patterns from the bootstrapped traces.

Training examples ──> Run program ──> Keep passing traces ──> Fine-tune model weights

When to use BootstrapFinetune

Use it when:

You have 500+ labeled examples (1000+ is better -- more data means more successful traces to train on)
You have already tried prompt optimization (MIPROv2, BootstrapFewShot) and hit a quality ceiling
You want a smaller, cheaper model to match the quality of a larger one (model distillation)
You need maximum quality and are willing to pay the one-time cost of fine-tuning
Your domain has specialized patterns that the base model doesn't handle well out of the box

Do not use it when:

You have fewer than 500 examples -- use /ai-improving-accuracy with MIPROv2 or BootstrapFewShot instead
You haven't tried prompt optimization yet -- start there, it's 10x cheaper
Your baseline accuracy is below 50% -- fix your task definition or data first
You're still iterating on what the task is -- fine-tuning locks you into a specific behavior
You don't have a clear, automated metric -- you can't filter traces without one

Basic usage

import dspy

lm = dspy.LM("openai/gpt-4o-mini")  # or "anthropic/claude-sonnet-4-5-20250929", etc.
dspy.configure(lm=lm)

# 1. Define your program
class Classify(dspy.Signature):
    """Classify the support ticket category."""
    text: str = dspy.InputField()
    category: str = dspy.OutputField()

program = dspy.ChainOfThought(Classify)

# 2. Prepare labeled data (500+ examples)
trainset = [
    dspy.Example(text="Can't log in", category="auth").with_inputs("text"),
    dspy.Example(text="Charge me twice", category="billing").with_inputs("text"),
    # ... 500+ examples
]

# 3. Define a metric
def metric(example, prediction, trace=None):
    return prediction.category.lower() == example.category.lower()

# 4. Fine-tune
optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=24)
finetuned = optimizer.compile(program, trainset=trainset)

# 5. Use the fine-tuned program
result = finetuned(text="My payment failed")
print(result.category)

After compile finishes, finetuned is a copy of your program that uses the newly fine-tuned model. Every module in the program that was backed by a fine-tunable LM gets updated.

Teacher-student paradigm

The most powerful pattern: use an expensive, high-quality model (the teacher) to generate traces, then fine-tune a cheap model (the student) on those traces. This is model distillation.

# --- Teacher: expensive model, high quality ---
teacher_lm = dspy.LM("openai/gpt-4o")  # or any strong model
dspy.configure(lm=teacher_lm)

teacher = dspy.ChainOfThought(Classify)

# Optionally optimize the teacher's prompts first for even better traces
prompt_optimizer = dspy.MIPROv2(metric=metric, auto="medium")
teacher_optimized = prompt_optimizer.compile(teacher, trainset=trainset)

# --- Student: cheap model, fine-tuned on teacher's traces ---
student_lm = dspy.LM("openai/gpt-4o-mini")  # or any fine-tunable model
dspy.configure(lm=student_lm)

student = dspy.ChainOfThought(Classify)

ft_optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=24)
student_finetuned = ft_optimizer.compile(
    student,
    trainset=trainset,
    teacher=teacher_optimized,  # Teacher generates the traces
)

How it works with a teacher:

The teacher program runs on each training example using the expensive model
Only traces where the metric passes are kept
Those traces are reformatted as training data for the student model
The student model is fine-tuned on the teacher's successful reasoning patterns

The student learns to mimic the teacher's reasoning at a fraction of the inference cost.

Target model configuration

BootstrapFinetune fine-tunes whatever LM is configured when you call compile. To control which model gets fine-tuned:

# Fine-tune GPT-4o-mini
student_lm = dspy.LM("openai/gpt-4o-mini")  # or any fine-tunable model
dspy.configure(lm=student_lm)
finetuned = optimizer.compile(student, trainset=trainset)

# Fine-tune an open-source model via Together AI
student_lm = dspy.LM("together_ai/meta-llama/Llama-3-70b-chat-hf")  # or any fine-tunable model
dspy.configure(lm=student_lm)
finetuned = optimizer.compile(student, trainset=trainset)

The model must support fine-tuning through its provider's API. Common options:

| Provider | Fine-tunable models | Notes | |----------|-------------------|-------| | OpenAI | gpt-4o-mini, gpt-4o | Easiest setup, DSPy handles the API calls | | Together AI | Llama, Mistral, etc. | Open-source models, competitive pricing | | Local | Any HuggingFace model | Full control, needs GPU(s) |

Key parameters

dspy.BootstrapFinetune(
    metric=None,            # Scoring function: (example, prediction, trace) -> bool/float
    multitask=True,         # Share training data across predictors
    train_kwargs=None,      # Fine-tuning hyperparams (e.g., {"n_epochs": 2})
    exclude_demos=False,    # Clear few-shot demos after fine-tuning
    num_threads=None,       # Parallel threads for bootstrapping
)

| Parameter | Type | Default | Description | |-----------|------|---------|-------------| | metric | Callable \| None | None | Scores each trace during bootstrapping. Only passing traces become training data. | | multitask | bool | True | When True, shares training data across predictors. When False, each predictor gets its own fine-tuning data. | | train_kwargs | dict \| None | None | Fine-tuning hyperparameters passed to the provider (e.g., {"n_epochs": 2}). Can be LM-specific: {lm: {"n_epochs": 3}}. | | exclude_demos | bool | False | If True, clears few-shot demos after fine-tuning (the model has internalized them). | | num_threads | int \| None | None | Threads for bootstrapping. Must be >= the number of fine-tuning jobs. 24 is a good starting point. |

The compile method accepts:

optimizer.compile(
    student,         # Your dspy.Module to fine-tune
    trainset,        # List of dspy.Example with labeled data
    teacher=None,    # Optional: a teacher program for distillation
)

| Parameter | Type | Default | Description | |-----------|------|---------|-------------| | student | dspy.Module | required | The program whose backing LM will be fine-tuned | | trainset | list[dspy.Example] | required | Labeled training data (500+ recommended) | | teacher | dspy.Module \| None | None | If provided, the teacher generates traces instead of the student. Use for distillation. |

Computational cost

BootstrapFinetune is the most expensive optimizer in DSPy. Budget for three cost stages:

1. Bootstrapping (LM API calls)

Every training example gets run through your program (or the teacher). With 1000 examples and a ChainOfThought module, that's 1000+ LM calls just for bootstrapping.

With teacher (GPT-4o): ~$5-15 for 1000 examples (depends on input/output length)
Without teacher (GPT-4o-mini): ~$0.15-0.50 for 1000 examples

2. Fine-tuning (provider charges)

The model provider charges for training. Costs depend on the number of successful traces and their length.

OpenAI GPT-4o-mini: ~$0.008/1K training tokens
OpenAI GPT-4o: ~$0.025/1K training tokens
Together AI: Varies by model, generally cheaper for open-source

3. Inference (ongoing)

Fine-tuned models may cost slightly more per token than base models (OpenAI charges ~1.5x for fine-tuned inference). But if you distilled from GPT-4o to GPT-4o-mini, the net savings are still 10-30x.

Time

Bootstrapping: minutes to an hour (depends on dataset size and thread count)
Fine-tuning: 30 minutes to several hours (depends on provider and dataset size)
Total: plan for 1-4 hours end to end

When to use BootstrapFinetune vs prompt optimization

| Factor | Prompt optimization (MIPROv2) | BootstrapFinetune | |--------|------------------------------|-------------------| | What it changes | Prompt instructions + few-shot examples | Model weights | | Data needed | ~200 examples | ~500+ examples | | Cost | Low (just LM calls for optimization) | High (LM calls + fine-tuning fees) | | Time | Minutes | Hours | | Quality ceiling | Good, but limited by what prompts can do | Higher -- model learns domain patterns | | Portability | Optimized prompts work with any model | Weights are locked to one model | | Iteration speed | Fast -- re-optimize in minutes | Slow -- re-train takes hours | | Best for | Early development, quick iteration | Production, maximum quality, cost reduction via distillation |

Recommended progression:

Start with dspy.BootstrapFewShot (quick, ~50 examples)
Graduate to dspy.MIPROv2 (better, ~200 examples)
Use dspy.BootstrapFinetune when prompt optimization plateaus (500+ examples)
Try dspy.BetterTogether for absolute maximum quality (combines prompt + weight optimization)

Save and load

# Save the fine-tuned program
finetuned.save("finetuned_classify.json")

# Load later for production
from my_module import MyProgram
production = MyProgram()
production.load("finetuned_classify.json")
result = production(text="New ticket text...")

The saved file stores the fine-tuned model identifier (e.g., ft:gpt-4o-mini-2024-07-18:org::abc123) so loading automatically points to the right model.

Troubleshooting

Not enough successful traces

If only a small fraction of training examples produce passing traces, the fine-tuning data will be thin.

Fixes:

Use a stronger teacher model (GPT-4o instead of GPT-4o-mini)
Relax your metric temporarily (accept partial credit during bootstrapping)
Simplify your task or break multi-step programs into single steps
Add more training examples so even a low success rate yields enough traces

Overfitting (high train accuracy, low test accuracy)

Fixes:

Add more training data
Reduce fine-tuning epochs (if your provider exposes this setting)
Use a larger base model (less prone to memorization)
Simplify output format

Fine-tuning didn't beat prompt optimization

Fixes:

Verify bootstrapping produced 200+ successful traces (check logs)
Try dspy.BetterTogether to combine prompt and weight optimization
Confirm your metric correlates with actual quality
Try a different base model

Gotchas

Claude skips prompt optimization and jumps straight to fine-tuning. Fine-tuning is the heaviest, most expensive optimization in DSPy. Always try BootstrapFewShot and MIPROv2 first — they are 10-100x cheaper and often close the gap enough. Fine-tune only when prompt optimization plateaus.
Claude forgets to set dspy.configure(lm=student_lm) before calling compile. BootstrapFinetune fine-tunes whatever LM is configured at compile time. If the teacher LM is still configured, the optimizer fine-tunes the expensive model instead of the cheap student. Always switch to the student LM before calling compile.
Claude sets num_threads too low for multi-predictor programs. num_threads must be >= the number of fine-tuning jobs (one per unique LM across all predictors). If a program has 3 predictors all using the same LM, that is 1 job. If each uses a different LM, that is 3 jobs. BootstrapFinetune raises a ValueError if threads are insufficient.
Claude does not set exclude_demos=True after fine-tuning. Once the model weights have internalized the reasoning patterns, few-shot demos in the prompt are redundant and waste tokens. Set exclude_demos=True to remove them automatically, reducing prompt length and inference cost.
Claude uses BootstrapFinetune with fewer than 200 successful traces. The optimizer only keeps traces where the metric passes. If your dataset is 500 examples but only 20% pass, you get ~100 traces — too few for effective fine-tuning. Check your metric pass rate first and use a stronger teacher or relax the metric to get 200+ passing traces.

Additional resources

dspy.BootstrapFinetune API docs
reference.md — constructor parameters, compile() method, fine-tuning hyperparameters
examples.md — teacher-student distillation, production cost reduction workflow

Cross-references

Install any skill: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill <name>

BootstrapFewShot for lighter optimization without fine-tuning -- see /ai-improving-accuracy
Fine-tuning workflow for the full decision framework, prerequisites, and BetterTogether -- see /ai-fine-tuning
Cost reduction for distillation and other strategies to cut API spend -- see /ai-cutting-costs
For worked examples (distillation, production cost reduction), see examples.md
Install /ai-do if you do not have it — it routes any AI problem to the right skill and is the fastest way to work: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-do

lebsral/dspy-bootstrap-finetune

skills/dspy-bootstrap-finetune/SKILL.md

Use when you need maximum quality from a smaller/cheaper model — generates training data from a teacher model and fine-tunes a student model weights. Common scenarios - distilling GPT-4 quality into a cheaper model, generating training data from a strong teacher to fine-tune a weak student, reducing inference costs by replacing an expensive model with a fine-tuned small one, or building a production model that is fast and cheap. Related - ai-fine-tuning, ai-cutting-costs, dspy-better-together. Also used for dspy.BootstrapFinetune, model distillation with DSPy, teacher-student training, fine-tune small model from GPT-4 outputs, reduce API costs with fine-tuning, generate training data then fine-tune, cheap model same quality, distill large model into small model, fine-tune Llama from GPT-4, production model training, move from API to self-hosted model.

5 stars

development

Updated May 5, 2026

$ install --global

skillsauth

npx skillsauth add lebsral/dspy-programming-not-prompting-lms-skills dspy-bootstrap-finetune

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 5, 2026, 8:01 AM134.1s4 files scanned

SKILL.md

name:: dspy-bootstrap-finetune
description:: Use when you need maximum quality from a smaller/cheaper model — generates training data from a teacher model and fine-tunes a student model weights. Common scenarios - distilling GPT-4 quality into a cheaper model, generating training data from a strong teacher to fine-tune a weak student, reducing inference costs by replacing an expensive model with a fine-tuned small one, or building a production model that is fast and cheap. Related - ai-fine-tuning, ai-cutting-costs, dspy-better-together. Also used for dspy.BootstrapFinetune, model distillation with DSPy, teacher-student training, fine-tune small model from GPT-4 outputs, reduce API costs with fine-tuning, generate training data then fine-tune, cheap model same quality, distill large model into small model, fine-tune Llama from GPT-4, production model training, move from API to self-hosted model.

Fine-Tune LM Weights with dspy.BootstrapFinetune

What is BootstrapFinetune

dspy.BootstrapFinetune is an optimizer that tunes LM weights rather than prompts. It works in two phases:

Bootstrap: Run your program on every training example, keep the traces where your metric passes.
Fine-tune: Send those successful traces to the model provider's fine-tuning API (or a local training loop) and train the model weights on them.

The result is a version of your program backed by a fine-tuned model that has internalized the reasoning patterns from the bootstrapped traces.

Training examples ──> Run program ──> Keep passing traces ──> Fine-tune model weights

When to use BootstrapFinetune

Use it when:

You have 500+ labeled examples (1000+ is better -- more data means more successful traces to train on)
You have already tried prompt optimization (MIPROv2, BootstrapFewShot) and hit a quality ceiling
You want a smaller, cheaper model to match the quality of a larger one (model distillation)
You need maximum quality and are willing to pay the one-time cost of fine-tuning
Your domain has specialized patterns that the base model doesn't handle well out of the box

Do not use it when:

You have fewer than 500 examples -- use /ai-improving-accuracy with MIPROv2 or BootstrapFewShot instead
You haven't tried prompt optimization yet -- start there, it's 10x cheaper
Your baseline accuracy is below 50% -- fix your task definition or data first
You're still iterating on what the task is -- fine-tuning locks you into a specific behavior
You don't have a clear, automated metric -- you can't filter traces without one

Basic usage

import dspy

lm = dspy.LM("openai/gpt-4o-mini")  # or "anthropic/claude-sonnet-4-5-20250929", etc.
dspy.configure(lm=lm)

# 1. Define your program
class Classify(dspy.Signature):
    """Classify the support ticket category."""
    text: str = dspy.InputField()
    category: str = dspy.OutputField()

program = dspy.ChainOfThought(Classify)

# 2. Prepare labeled data (500+ examples)
trainset = [
    dspy.Example(text="Can't log in", category="auth").with_inputs("text"),
    dspy.Example(text="Charge me twice", category="billing").with_inputs("text"),
    # ... 500+ examples
]

# 3. Define a metric
def metric(example, prediction, trace=None):
    return prediction.category.lower() == example.category.lower()

# 4. Fine-tune
optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=24)
finetuned = optimizer.compile(program, trainset=trainset)

# 5. Use the fine-tuned program
result = finetuned(text="My payment failed")
print(result.category)

After compile finishes, finetuned is a copy of your program that uses the newly fine-tuned model. Every module in the program that was backed by a fine-tunable LM gets updated.

Teacher-student paradigm

The most powerful pattern: use an expensive, high-quality model (the teacher) to generate traces, then fine-tune a cheap model (the student) on those traces. This is model distillation.

# --- Teacher: expensive model, high quality ---
teacher_lm = dspy.LM("openai/gpt-4o")  # or any strong model
dspy.configure(lm=teacher_lm)

teacher = dspy.ChainOfThought(Classify)

# Optionally optimize the teacher's prompts first for even better traces
prompt_optimizer = dspy.MIPROv2(metric=metric, auto="medium")
teacher_optimized = prompt_optimizer.compile(teacher, trainset=trainset)

# --- Student: cheap model, fine-tuned on teacher's traces ---
student_lm = dspy.LM("openai/gpt-4o-mini")  # or any fine-tunable model
dspy.configure(lm=student_lm)

student = dspy.ChainOfThought(Classify)

ft_optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=24)
student_finetuned = ft_optimizer.compile(
    student,
    trainset=trainset,
    teacher=teacher_optimized,  # Teacher generates the traces
)

How it works with a teacher:

The teacher program runs on each training example using the expensive model
Only traces where the metric passes are kept
Those traces are reformatted as training data for the student model
The student model is fine-tuned on the teacher's successful reasoning patterns

The student learns to mimic the teacher's reasoning at a fraction of the inference cost.

Target model configuration

BootstrapFinetune fine-tunes whatever LM is configured when you call compile. To control which model gets fine-tuned:

# Fine-tune GPT-4o-mini
student_lm = dspy.LM("openai/gpt-4o-mini")  # or any fine-tunable model
dspy.configure(lm=student_lm)
finetuned = optimizer.compile(student, trainset=trainset)

# Fine-tune an open-source model via Together AI
student_lm = dspy.LM("together_ai/meta-llama/Llama-3-70b-chat-hf")  # or any fine-tunable model
dspy.configure(lm=student_lm)
finetuned = optimizer.compile(student, trainset=trainset)

The model must support fine-tuning through its provider's API. Common options:

Key parameters

dspy.BootstrapFinetune(
    metric=None,            # Scoring function: (example, prediction, trace) -> bool/float
    multitask=True,         # Share training data across predictors
    train_kwargs=None,      # Fine-tuning hyperparams (e.g., {"n_epochs": 2})
    exclude_demos=False,    # Clear few-shot demos after fine-tuning
    num_threads=None,       # Parallel threads for bootstrapping
)

The compile method accepts:

optimizer.compile(
    student,         # Your dspy.Module to fine-tune
    trainset,        # List of dspy.Example with labeled data
    teacher=None,    # Optional: a teacher program for distillation
)

Computational cost

BootstrapFinetune is the most expensive optimizer in DSPy. Budget for three cost stages:

1. Bootstrapping (LM API calls)

Every training example gets run through your program (or the teacher). With 1000 examples and a ChainOfThought module, that's 1000+ LM calls just for bootstrapping.

With teacher (GPT-4o): ~$5-15 for 1000 examples (depends on input/output length)
Without teacher (GPT-4o-mini): ~$0.15-0.50 for 1000 examples

2. Fine-tuning (provider charges)

The model provider charges for training. Costs depend on the number of successful traces and their length.

OpenAI GPT-4o-mini: ~$0.008/1K training tokens
OpenAI GPT-4o: ~$0.025/1K training tokens
Together AI: Varies by model, generally cheaper for open-source

3. Inference (ongoing)

Fine-tuned models may cost slightly more per token than base models (OpenAI charges ~1.5x for fine-tuned inference). But if you distilled from GPT-4o to GPT-4o-mini, the net savings are still 10-30x.

Time

Bootstrapping: minutes to an hour (depends on dataset size and thread count)
Fine-tuning: 30 minutes to several hours (depends on provider and dataset size)
Total: plan for 1-4 hours end to end

When to use BootstrapFinetune vs prompt optimization

Recommended progression:

Start with dspy.BootstrapFewShot (quick, ~50 examples)
Graduate to dspy.MIPROv2 (better, ~200 examples)
Use dspy.BootstrapFinetune when prompt optimization plateaus (500+ examples)
Try dspy.BetterTogether for absolute maximum quality (combines prompt + weight optimization)

Save and load

# Save the fine-tuned program
finetuned.save("finetuned_classify.json")

# Load later for production
from my_module import MyProgram
production = MyProgram()
production.load("finetuned_classify.json")
result = production(text="New ticket text...")

The saved file stores the fine-tuned model identifier (e.g., ft:gpt-4o-mini-2024-07-18:org::abc123) so loading automatically points to the right model.

Troubleshooting

Not enough successful traces

If only a small fraction of training examples produce passing traces, the fine-tuning data will be thin.

Fixes:

Use a stronger teacher model (GPT-4o instead of GPT-4o-mini)
Relax your metric temporarily (accept partial credit during bootstrapping)
Simplify your task or break multi-step programs into single steps
Add more training examples so even a low success rate yields enough traces

Overfitting (high train accuracy, low test accuracy)

Fixes:

Add more training data
Reduce fine-tuning epochs (if your provider exposes this setting)
Use a larger base model (less prone to memorization)
Simplify output format

Fine-tuning didn't beat prompt optimization

Fixes:

Verify bootstrapping produced 200+ successful traces (check logs)
Try dspy.BetterTogether to combine prompt and weight optimization
Confirm your metric correlates with actual quality
Try a different base model

Gotchas

Claude skips prompt optimization and jumps straight to fine-tuning. Fine-tuning is the heaviest, most expensive optimization in DSPy. Always try BootstrapFewShot and MIPROv2 first — they are 10-100x cheaper and often close the gap enough. Fine-tune only when prompt optimization plateaus.
Claude forgets to set dspy.configure(lm=student_lm) before calling compile. BootstrapFinetune fine-tunes whatever LM is configured at compile time. If the teacher LM is still configured, the optimizer fine-tunes the expensive model instead of the cheap student. Always switch to the student LM before calling compile.
Claude sets num_threads too low for multi-predictor programs. num_threads must be >= the number of fine-tuning jobs (one per unique LM across all predictors). If a program has 3 predictors all using the same LM, that is 1 job. If each uses a different LM, that is 3 jobs. BootstrapFinetune raises a ValueError if threads are insufficient.
Claude does not set exclude_demos=True after fine-tuning. Once the model weights have internalized the reasoning patterns, few-shot demos in the prompt are redundant and waste tokens. Set exclude_demos=True to remove them automatically, reducing prompt length and inference cost.
Claude uses BootstrapFinetune with fewer than 200 successful traces. The optimizer only keeps traces where the metric passes. If your dataset is 500 examples but only 20% pass, you get ~100 traces — too few for effective fine-tuning. Check your metric pass rate first and use a stronger teacher or relax the metric to get 200+ passing traces.

Additional resources

dspy.BootstrapFinetune API docs
reference.md — constructor parameters, compile() method, fine-tuning hyperparameters
examples.md — teacher-student distillation, production cost reduction workflow

Cross-references

Install any skill: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill <name>

BootstrapFewShot for lighter optimization without fine-tuning -- see /ai-improving-accuracy
Fine-tuning workflow for the full decision framework, prerequisites, and BetterTogether -- see /ai-fine-tuning
Cost reduction for distillation and other strategies to cut API spend -- see /ai-cutting-costs
For worked examples (distillation, production cost reduction), see examples.md
Install /ai-do if you do not have it — it routes any AI problem to the right skill and is the fastest way to work: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-do

Related Skills

lebsral/ai-watching-optimization

tools

VerifiedTrustedCommunity

See what is happening during optimizer.compile() instead of waiting blind. Use when you want to watch optimization progress, see scores as they come in, know if your optimizer is working, check if optimization is stuck, understand why optimization is taking too long, get live progress during compile, monitor convergence, detect overfitting during optimization, interpret optimization results, or pick the right tool for watching optimization. Also used for optimizer progress bar, is my optimizer doing anything, optimization seems stuck, how long will optimization take, watch GEPA run, watch MIPROv2 run, live optimization dashboard, optimizer not improving, scores not going up, optimization taking forever, see what optimizer is doing, debug slow optimization, optimization visibility, optimizer metrics, track compile progress, optimization observability.

6SKILL.mdUpdated May 31, 2026

lebsral/ai-watching-optimization

lebsral/dspy-miprov2

testing

VerifiedTrustedCommunity

Use when you want the highest-quality prompt optimization DSPy offers — jointly optimizes instructions and few-shot demos, with auto=light/medium/heavy presets. Common scenarios - you want the best possible accuracy from prompt optimization, jointly tuning instructions and few-shot demonstrations, using auto presets for different compute budgets, or when COPRO or BootstrapFewShot alone are not reaching your accuracy target. Related - ai-improving-accuracy, dspy-copro, dspy-bootstrap-few-shot. Also used for dspy.MIPROv2, best DSPy optimizer, highest quality optimization, auto=light medium heavy, joint instruction and demo optimization, most powerful prompt optimizer, MIPROv2 vs COPRO vs BootstrapFewShot, which optimizer should I use, state of the art prompt optimization, when to use MIPROv2, optimize both instructions and examples, heavy optimization for production, best optimizer for accuracy.

6SKILL.mdUpdated Apr 27, 2026

lebsral/dspy-langwatch

testing

VerifiedTrustedCommunity

Use LangWatch for DSPy auto-tracing and real-time optimizer progress. Use when you want to set up LangWatch, langwatch.dspy.init, auto-tracing DSPy, real-time optimization dashboard, optimizer progress tracking, app.langwatch.ai, or DSPy optimizer dashboard. Also used for langwatch setup, pip install langwatch, langwatch trace, optimizer progress, real-time optimization, watch optimizer run, LangWatch self-hosted, langwatch docker, langwatch vs langtrace, langwatch autotrack_dspy.

6SKILL.mdUpdated Apr 27, 2026

lebsral/dspy-langwatch

lebsral/dspy-gepa

data-ai

VerifiedTrustedCommunity

Use when you want to optimize instructions without few-shot examples — a lightweight alternative to COPRO when you do not have or do not want to use demonstrations. Common scenarios - optimizing instructions when you do not have or do not want to use few-shot demonstrations, lightweight instruction search as a first step, tasks where examples in the prompt confuse the model, or when you want fast instruction optimization without the cost of COPRO. Related - ai-improving-accuracy, dspy-copro, dspy-miprov2. Also used for dspy.GEPA, instruction optimization without demos, lightweight prompt optimization, optimize instructions only, no few-shot examples needed, GEPA vs COPRO, quick instruction search, when demonstrations hurt performance, zero-shot optimization, instruction-only optimizer, simplest instruction tuner, fast prompt optimization, skip few-shot and just tune instructions, optimize Pydantic field descriptions, GEPA structured output, GEPA does not optimize field desc.

6SKILL.mdUpdated Apr 27, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/lebsral/dspy-programming-not-prompting-lms-skills.git

# Copy into Claude Code skills folder (global)
cp -r dspy-programming-not-prompting-lms-skills/skills/dspy-bootstrap-finetune ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

lebsral/dspy-programming-not-prompting-lms-skills

5 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT