skills/ai-cutting-costs/SKILL.md
Reduce your AI API bill. Use when AI costs are too high, API calls are too expensive, you want to use cheaper models, optimize token usage, reduce LLM spending, route easy questions to cheap models, or make your AI feature more cost-effective. Also used for GPT-4 costs too much for production, AI bill keeps growing, how to reduce OpenAI costs, optimize LLM token usage, smart model routing saves money, prompt is too long and expensive, cheaper than GPT-4 with same quality.
npx skillsauth add lebsral/dspy-programming-not-prompting-lms-skills ai-cutting-costsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Guide the user through reducing AI API costs without sacrificing quality. Multiple strategies, from quick wins to advanced techniques.
Ask the user:
import dspy
# Run your program and check token usage
lm = dspy.LM("openai/gpt-4o-mini") # or "anthropic/claude-sonnet-4-5-20250929", etc.
dspy.configure(lm=lm)
result = my_program(question="test")
dspy.inspect_history(n=3) # Shows token counts per call
The simplest fix — switch to a cheaper model and see if quality holds:
# Instead of GPT-4o (~$5/M input tokens)
lm = dspy.LM("openai/gpt-4o-mini") # or "anthropic/claude-haiku-4-5-20251001", etc. — ~$0.15/M input tokens
# Or use an open-source model
lm = dspy.LM("together_ai/meta-llama/Llama-3-70b-chat-hf") # or any provider DSPy supports
Always measure quality before and after with /ai-improving-accuracy. When you switch models, re-optimize your prompts — they don't transfer. See /ai-switching-models for the full workflow.
DSPy caches LM calls by default. Make sure you're not disabling it:
# Caching is ON by default — same inputs won't re-call the API
lm = dspy.LM("openai/gpt-4o-mini") # or "anthropic/claude-haiku-4-5-20251001", etc. — cached automatically
# To verify caching is working, run the same input twice
# and check that the second call is instant
Not every step in your pipeline needs the expensive model. Use dspy.context or set_lm to assign cheaper models to simpler steps:
expensive_lm = dspy.LM("openai/gpt-4o") # or "anthropic/claude-sonnet-4-5-20250929", etc.
cheap_lm = dspy.LM("openai/gpt-4o-mini") # or "anthropic/claude-haiku-4-5-20251001", etc.
dspy.configure(lm=expensive_lm) # default
class MyPipeline(dspy.Module):
def __init__(self):
self.classify = dspy.ChainOfThought(ClassifySignature)
self.generate = dspy.ChainOfThought(GenerateSignature)
def forward(self, text):
# Use cheap model for simple classification
with dspy.context(lm=cheap_lm):
category = self.classify(text=text)
# Use expensive model only for complex generation
return self.generate(text=text, category=category.label)
# Set LM on specific modules permanently
my_program.classify.lm = cheap_lm
my_program.generate.lm = expensive_lm
Instead of sending everything to the expensive model, classify inputs by difficulty and route accordingly. This is the pattern behind FrugalGPT (up to 90% cost savings matching GPT-4 quality):
class ComplexityRouter(dspy.Module):
def __init__(self):
self.assess = dspy.Predict(AssessComplexity)
self.simple_handler = dspy.Predict(AnswerQuestion)
self.complex_handler = dspy.ChainOfThought(AnswerQuestion)
def forward(self, question):
# Use the cheap model to decide complexity
with dspy.context(lm=cheap_lm):
assessment = self.assess(question=question)
# Route to the right model
if assessment.complexity == "simple":
with dspy.context(lm=cheap_lm):
return self.simple_handler(question=question)
else:
with dspy.context(lm=expensive_lm):
return self.complex_handler(question=question)
class AssessComplexity(dspy.Signature):
"""Assess if this question needs a powerful model or a simple one can handle it."""
question: str = dspy.InputField()
complexity: Literal["simple", "complex"] = dspy.OutputField(
desc="simple = factual/straightforward, complex = reasoning/nuanced"
)
class CascadingPipeline(dspy.Module):
def __init__(self):
self.answer = dspy.ChainOfThought(AnswerQuestion)
self.verify = dspy.Predict(CheckConfidence)
def forward(self, question):
# Try cheap model first
with dspy.context(lm=cheap_lm):
result = self.answer(question=question)
check = self.verify(question=question, answer=result.answer)
# If cheap model isn't confident, escalate to expensive
if not check.is_confident:
with dspy.context(lm=expensive_lm):
result = self.answer(question=question)
return result
class CheckConfidence(dspy.Signature):
"""Is this answer confident and complete, or should we escalate to a better model?"""
question: str = dspy.InputField()
answer: str = dspy.InputField()
is_confident: bool = dspy.OutputField()
Typical savings: 50-90% cost reduction. Most real-world traffic is simple questions that a cheap model handles fine.
Long prompts = more tokens = more cost.
# Fewer demos = shorter prompts = lower cost
optimizer = dspy.BootstrapFewShot(
metric=metric,
max_bootstrapped_demos=2, # down from 4
max_labeled_demos=2, # down from 4
)
# Fewer passages = shorter context
class DocSearch(dspy.Module):
def __init__(self):
self.retrieve = dspy.Retrieve(k=2) # down from 5
self.answer = dspy.ChainOfThought(AnswerSignature)
# Verbose — costs more tokens
class Verbose(dspy.Signature):
"""Given the following text, carefully analyze the content and provide a detailed classification."""
text: str = dspy.InputField(desc="The full text content to be analyzed and classified")
label: str = dspy.OutputField(desc="The classification label for this text")
# Concise — same quality, fewer tokens
class Concise(dspy.Signature):
"""Classify the text."""
text: str = dspy.InputField()
label: str = dspy.OutputField()
The biggest cost saver: train a small cheap model to do what the expensive model does. Distill from an expensive teacher to a cheap student:
# Build and optimize with the expensive model, then fine-tune a cheap one
optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=24)
finetuned = optimizer.compile(my_program, trainset=trainset, teacher=teacher_optimized)
Requirements: 500+ training examples, a fine-tunable model. Typical savings: 10-50x cost reduction with 85-95% quality retention.
For the complete model distillation workflow (decision framework, prerequisites, BetterTogether, troubleshooting), see /ai-fine-tuning.
Predict instead of ChainOfThought where possibleChainOfThought adds a reasoning step which uses extra tokens. For simple tasks, Predict may be sufficient:
# ChainOfThought — more tokens, better for complex tasks
classifier = dspy.ChainOfThought(ClassifySignature)
# Predict — fewer tokens, fine for simple tasks
classifier = dspy.Predict(ClassifySignature)
Test with /ai-improving-accuracy to make sure quality doesn't drop.
When running prompt optimization (especially with GEPA or MIPROv2), monitor for score plateaus. Stopping early when the optimizer saturates can save 30-40% of optimization compute. See /dspy-gepa for saturation diagnosis details.
Predict instead of ChainOfThought for simple tasks/ai-switching-models.ChainOfThought for the complexity router itself. The router in Step 4 should use dspy.Predict, not dspy.ChainOfThought — adding reasoning to the routing step defeats the purpose of saving tokens on easy inputs.max_bootstrapped_demos from 4 to 2 is fine; setting it to 0 removes all few-shot learning and quality collapses. Keep at least 1-2 demos.dspy.evaluate before each change so you can attribute quality drops to the specific optimization that caused them.temperature > 0, cached results lock in one sample. Set temperature=0 for deterministic caching, or disable caching for calls where you want diversity.Do not cut costs if you have not baselined quality first. Optimizing costs on a system that already underperforms just locks in bad results at a lower price. Fix accuracy first with /ai-improving-accuracy, then reduce costs.
Do not route to cheap models if your traffic is uniformly complex. The routing pattern (Step 4) saves money when most inputs are easy — if 90% of your inputs genuinely need the expensive model, routing adds latency and complexity for minimal savings.
Do not fine-tune to save money if your use case changes frequently. Fine-tuned models are frozen in time — if your categories, policies, or domain shift monthly, the retraining cost and lag outweigh the per-call savings. Use prompt optimization instead.
Install any skill:
npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill <name>
/ai-building-pipelines/ai-improving-accuracy/ai-fixing-errors/ai-switching-models/dspy-modules/ai-fine-tuning/ai-do if you do not have it — it routes any AI problem to the right skill and is the fastest way to work: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-dotools
See what is happening during optimizer.compile() instead of waiting blind. Use when you want to watch optimization progress, see scores as they come in, know if your optimizer is working, check if optimization is stuck, understand why optimization is taking too long, get live progress during compile, monitor convergence, detect overfitting during optimization, interpret optimization results, or pick the right tool for watching optimization. Also used for optimizer progress bar, is my optimizer doing anything, optimization seems stuck, how long will optimization take, watch GEPA run, watch MIPROv2 run, live optimization dashboard, optimizer not improving, scores not going up, optimization taking forever, see what optimizer is doing, debug slow optimization, optimization visibility, optimizer metrics, track compile progress, optimization observability.
testing
Use when you want the highest-quality prompt optimization DSPy offers — jointly optimizes instructions and few-shot demos, with auto=light/medium/heavy presets. Common scenarios - you want the best possible accuracy from prompt optimization, jointly tuning instructions and few-shot demonstrations, using auto presets for different compute budgets, or when COPRO or BootstrapFewShot alone are not reaching your accuracy target. Related - ai-improving-accuracy, dspy-copro, dspy-bootstrap-few-shot. Also used for dspy.MIPROv2, best DSPy optimizer, highest quality optimization, auto=light medium heavy, joint instruction and demo optimization, most powerful prompt optimizer, MIPROv2 vs COPRO vs BootstrapFewShot, which optimizer should I use, state of the art prompt optimization, when to use MIPROv2, optimize both instructions and examples, heavy optimization for production, best optimizer for accuracy.
testing
Use LangWatch for DSPy auto-tracing and real-time optimizer progress. Use when you want to set up LangWatch, langwatch.dspy.init, auto-tracing DSPy, real-time optimization dashboard, optimizer progress tracking, app.langwatch.ai, or DSPy optimizer dashboard. Also used for langwatch setup, pip install langwatch, langwatch trace, optimizer progress, real-time optimization, watch optimizer run, LangWatch self-hosted, langwatch docker, langwatch vs langtrace, langwatch autotrack_dspy.
data-ai
Use when you want to optimize instructions without few-shot examples — a lightweight alternative to COPRO when you do not have or do not want to use demonstrations. Common scenarios - optimizing instructions when you do not have or do not want to use few-shot demonstrations, lightweight instruction search as a first step, tasks where examples in the prompt confuse the model, or when you want fast instruction optimization without the cost of COPRO. Related - ai-improving-accuracy, dspy-copro, dspy-miprov2. Also used for dspy.GEPA, instruction optimization without demos, lightweight prompt optimization, optimize instructions only, no few-shot examples needed, GEPA vs COPRO, quick instruction search, when demonstrations hurt performance, zero-shot optimization, instruction-only optimizer, simplest instruction tuner, fast prompt optimization, skip few-shot and just tune instructions, optimize Pydantic field descriptions, GEPA structured output, GEPA does not optimize field desc.