skills/ai-switching-models/SKILL.md
Switch AI providers or models without breaking things. Use when you want to switch from OpenAI to Anthropic, try a cheaper model, stop depending on one vendor, compare models side-by-side, a model update broke your outputs, you need vendor diversification, or you want to migrate to a local model. Also use when your prompt broke after a model update, prompts that work for GPT-4 do not work for Claude or Llama, or you need to do a model migration. Covers DSPy model portability with provider config, re-optimization, model comparison, and multi-model pipelines. Also used for migrate from OpenAI to Anthropic, GPT to Claude migration, try Llama instead of GPT, model comparison framework, multi-provider AI setup, avoid vendor lock-in for AI, prompts break when switching models, model-agnostic AI code.
npx skillsauth add lebsral/dspy-programming-not-prompting-lms-skills ai-switching-modelsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Guide the user through switching AI models or providers safely. The key insight: optimized prompts don't transfer between models (arxiv 2402.10949v2 — "The Unreasonable Effectiveness of Eccentric Automatic Prompts"). DSPy solves this by separating your task definition (signatures + modules) from model-specific prompts (compiled by optimizers).
Hand-tuned prompts are model-specific. A prompt engineered for GPT-4o will perform differently on Claude, Llama, or even GPT-4o-mini. Research shows optimized prompts for one model can actually hurt performance on another.
DSPy makes switching safe because:
The workflow: keep your program the same, swap the model, re-optimize. Done.
Ask the user:
/ai-improving-accuracy)Common scenarios:
DSPy uses LiteLLM under the hood, so you can use any supported provider with a simple string:
import dspy
# OpenAI
lm = dspy.LM("openai/gpt-4o")
lm = dspy.LM("openai/gpt-4o-mini")
# Anthropic
lm = dspy.LM("anthropic/claude-sonnet-4-5-20250929")
lm = dspy.LM("anthropic/claude-haiku-4-5-20251001")
# Azure OpenAI
lm = dspy.LM("azure/my-gpt4-deployment")
# Google
lm = dspy.LM("gemini/gemini-2.0-flash")
# Together AI (open-source models)
lm = dspy.LM("together_ai/meta-llama/Llama-3-70b-chat-hf")
# Local models (via Ollama)
lm = dspy.LM("ollama_chat/llama3.1", api_base="http://localhost:11434")
# Any OpenAI-compatible server (vLLM, TGI, etc.)
lm = dspy.LM("openai/my-model", api_base="http://localhost:8000/v1", api_key="none")
dspy.configure(lm=lm)
Set API keys as environment variables — don't hardcode them:
# .env file
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
TOGETHER_API_KEY=...
AZURE_API_KEY=...
AZURE_API_BASE=https://your-resource.openai.azure.com/
See LiteLLM provider docs for the full list of 100+ supported providers.
Before changing anything, measure your baseline. You need a metric and test data.
from dspy.evaluate import Evaluate
# Your existing program and metric
program = MyProgram()
program.load("current_optimized.json") # load your production prompts
evaluator = Evaluate(
devset=devset,
metric=metric,
num_threads=4,
display_progress=True,
display_table=5,
)
# Benchmark with your current model
current_lm = dspy.LM("openai/gpt-4o")
dspy.configure(lm=current_lm)
baseline_score = evaluator(program)
print(f"Current model baseline: {baseline_score:.1f}%")
If you don't have a metric or test data yet, use /ai-improving-accuracy to set them up first.
Swap the model and run your evaluation without re-optimizing. This demonstrates the problem — your old prompts don't transfer.
# Try the new model with your OLD optimized prompts
new_lm = dspy.LM("anthropic/claude-sonnet-4-5-20250929")
dspy.configure(lm=new_lm)
naive_score = evaluator(program)
print(f"Old model (optimized): {baseline_score:.1f}%")
print(f"New model (old prompts): {naive_score:.1f}%")
print(f"Drop: {baseline_score - naive_score:.1f}%")
You'll typically see a quality drop — this is expected. The optimized prompts were tuned for the old model.
Now re-optimize your program for the new model. Use the same signatures and modules — only the compiled prompts change.
# Configure the new model
new_lm = dspy.LM("anthropic/claude-sonnet-4-5-20250929")
dspy.configure(lm=new_lm)
# Start from a fresh (unoptimized) program
fresh_program = MyProgram()
# Re-optimize for the new model
optimizer = dspy.MIPROv2(metric=metric, auto="medium")
optimized_for_new = optimizer.compile(fresh_program, trainset=trainset)
# Evaluate
reoptimized_score = evaluator(optimized_for_new)
print(f"Old model (optimized): {baseline_score:.1f}%")
print(f"New model (old prompts): {naive_score:.1f}%")
print(f"New model (re-optimized): {reoptimized_score:.1f}%")
The re-optimized score should recover most or all of the quality. If it doesn't, either:
auto="heavy")For a quick check before committing to a full MIPROv2 run:
optimizer = dspy.BootstrapFewShot(
metric=metric,
max_bootstrapped_demos=4,
max_labeled_demos=4,
)
quick_optimized = optimizer.compile(fresh_program, trainset=trainset)
quick_score = evaluator(quick_optimized)
Loop over candidate models, optimize each, and build a comparison table:
candidates = [
("openai/gpt-4o", "GPT-4o"),
("openai/gpt-4o-mini", "GPT-4o-mini"),
("anthropic/claude-sonnet-4-5-20250929", "Claude Sonnet"),
("together_ai/meta-llama/Llama-3-70b-chat-hf", "Llama 3 70B"),
]
results = []
for model_id, label in candidates:
lm = dspy.LM(model_id)
dspy.configure(lm=lm)
# Optimize for this model
fresh = MyProgram()
optimizer = dspy.BootstrapFewShot(metric=metric, max_bootstrapped_demos=4)
optimized = optimizer.compile(fresh, trainset=trainset)
# Evaluate
score = evaluator(optimized)
# Save the optimized program
optimized.save(f"optimized_{label.lower().replace(' ', '_')}.json")
results.append({"model": label, "score": score})
print(f"{label}: {score:.1f}%")
# Print comparison table
print("\n--- Model Comparison ---")
print(f"{'Model':<25} {'Score':>8}")
print("-" * 35)
for r in sorted(results, key=lambda x: x["score"], reverse=True):
print(f"{r['model']:<25} {r['score']:>7.1f}%")
For a more thorough comparison with MIPROv2 and cost/latency tracking, see examples.md.
You don't have to use one model for everything. Assign different models to different steps — cheap for simple tasks, expensive for hard ones.
dspy.context (temporary, per-call)cheap_lm = dspy.LM("openai/gpt-4o-mini")
expensive_lm = dspy.LM("openai/gpt-4o")
dspy.configure(lm=expensive_lm) # default
class MyPipeline(dspy.Module):
def __init__(self):
self.classify = dspy.Predict(ClassifySignature)
self.generate = dspy.ChainOfThought(GenerateSignature)
def forward(self, text):
# Cheap model for simple classification
with dspy.context(lm=cheap_lm):
category = self.classify(text=text)
# Expensive model for complex generation
return self.generate(text=text, category=category.label)
set_lm (permanent, per-module)pipeline = MyPipeline()
pipeline.classify.set_lm(cheap_lm)
pipeline.generate.set_lm(expensive_lm)
See /ai-cutting-costs for more cost optimization patterns with per-module LM assignment.
Save a separate optimized program for each model you might use in production:
# Save per-model optimized programs
optimized_gpt4o.save("optimized_gpt4o.json")
optimized_claude.save("optimized_claude.json")
optimized_llama.save("optimized_llama.json")
# In production — load the right one
import os
model_name = os.environ.get("AI_MODEL", "openai/gpt-4o")
lm = dspy.LM(model_name)
dspy.configure(lm=lm)
program = MyProgram()
program.load(f"optimized_{model_name.split('/')[-1]}.json")
"openai/gpt-4o" to "anthropic/claude-sonnet-4-5-20250929"dspy.LM("ollama_chat/llama3.1", api_base="http://localhost:11434")When a provider updates their model (e.g., GPT-4o version bump):
/ai-improving-accuracy.dspy.cache), shorter signatures, or dspy.Predict instead of dspy.ChainOfThought before switching to a weaker model.dspy.Predict already scores 95%+, the model choice barely matters. Focus effort elsewhere.dspy.configure(lm=candidate_lm) without isolating the judge. Use dspy.context(lm=judge_lm) inside your metric function so the judge stays constant across all candidates.dspy.context when set_lm is needed (and vice versa). dspy.context(lm=...) is temporary and scoped to a with block -- good for per-call overrides. module.set_lm(lm) is permanent and persists through optimization -- use it when a module should always use a specific model. Mixing them up causes silent evaluation bugs.auto="heavy") to approach cloud model quality. Start with BootstrapFewShot with max_bootstrapped_demos=8 and move to MIPROv2 if scores are still low.Install any skill:
npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill <name>
/ai-improving-accuracy/ai-cutting-costs/ai-building-pipelines/ai-fine-tuning/dspy-optimizers/ai-do if you do not have it — it routes any AI problem to the right skill and is the fastest way to work: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-dotools
See what is happening during optimizer.compile() instead of waiting blind. Use when you want to watch optimization progress, see scores as they come in, know if your optimizer is working, check if optimization is stuck, understand why optimization is taking too long, get live progress during compile, monitor convergence, detect overfitting during optimization, interpret optimization results, or pick the right tool for watching optimization. Also used for optimizer progress bar, is my optimizer doing anything, optimization seems stuck, how long will optimization take, watch GEPA run, watch MIPROv2 run, live optimization dashboard, optimizer not improving, scores not going up, optimization taking forever, see what optimizer is doing, debug slow optimization, optimization visibility, optimizer metrics, track compile progress, optimization observability.
testing
Use when you want the highest-quality prompt optimization DSPy offers — jointly optimizes instructions and few-shot demos, with auto=light/medium/heavy presets. Common scenarios - you want the best possible accuracy from prompt optimization, jointly tuning instructions and few-shot demonstrations, using auto presets for different compute budgets, or when COPRO or BootstrapFewShot alone are not reaching your accuracy target. Related - ai-improving-accuracy, dspy-copro, dspy-bootstrap-few-shot. Also used for dspy.MIPROv2, best DSPy optimizer, highest quality optimization, auto=light medium heavy, joint instruction and demo optimization, most powerful prompt optimizer, MIPROv2 vs COPRO vs BootstrapFewShot, which optimizer should I use, state of the art prompt optimization, when to use MIPROv2, optimize both instructions and examples, heavy optimization for production, best optimizer for accuracy.
testing
Use LangWatch for DSPy auto-tracing and real-time optimizer progress. Use when you want to set up LangWatch, langwatch.dspy.init, auto-tracing DSPy, real-time optimization dashboard, optimizer progress tracking, app.langwatch.ai, or DSPy optimizer dashboard. Also used for langwatch setup, pip install langwatch, langwatch trace, optimizer progress, real-time optimization, watch optimizer run, LangWatch self-hosted, langwatch docker, langwatch vs langtrace, langwatch autotrack_dspy.
data-ai
Use when you want to optimize instructions without few-shot examples — a lightweight alternative to COPRO when you do not have or do not want to use demonstrations. Common scenarios - optimizing instructions when you do not have or do not want to use few-shot demonstrations, lightweight instruction search as a first step, tasks where examples in the prompt confuse the model, or when you want fast instruction optimization without the cost of COPRO. Related - ai-improving-accuracy, dspy-copro, dspy-miprov2. Also used for dspy.GEPA, instruction optimization without demos, lightweight prompt optimization, optimize instructions only, no few-shot examples needed, GEPA vs COPRO, quick instruction search, when demonstrations hurt performance, zero-shot optimization, instruction-only optimizer, simplest instruction tuner, fast prompt optimization, skip few-shot and just tune instructions, optimize Pydantic field descriptions, GEPA structured output, GEPA does not optimize field desc.