skills/ai-watching-optimization/SKILL.md
See what is happening during optimizer.compile() instead of waiting blind. Use when you want to watch optimization progress, see scores as they come in, know if your optimizer is working, check if optimization is stuck, understand why optimization is taking too long, get live progress during compile, monitor convergence, detect overfitting during optimization, interpret optimization results, or pick the right tool for watching optimization. Also used for optimizer progress bar, is my optimizer doing anything, optimization seems stuck, how long will optimization take, watch GEPA run, watch MIPROv2 run, live optimization dashboard, optimizer not improving, scores not going up, optimization taking forever, see what optimizer is doing, debug slow optimization, optimization visibility, optimizer metrics, track compile progress, optimization observability.
npx skillsauth add lebsral/dspy-programming-not-prompting-lms-skills ai-watching-optimizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Guide the user through monitoring their optimizer.compile() runs so they can see progress, catch problems early, and know when to stop.
You run optimizer.compile(program, trainset=trainset) and wait. Minutes pass. Sometimes hours. You have no idea if scores are improving, if the optimizer is stuck, or if you should stop and try something different.
This skill helps you pick the right monitoring approach and interpret what you see.
| Observable | Why it matters | Tools that show it | |-----------|---------------|-------------------| | Scores over time | Are they improving? Plateauing? Dropping? | All tools below | | Instructions evolving | What is the optimizer changing in your prompts? | GEPA logger (prompt diffs), LangWatch (predictor states) | | Cost accumulating | How much are you spending? Worth continuing? | LangWatch, MLflow | | Convergence pattern | Should you stop early or keep going? | All tools below (scores over time) | | LM calls | Is the optimizer calling the right model? | inspect_history, LangWatch, MLflow | | Acceptance decisions | Why did the optimizer accept/reject a candidate? | GEPA logger (event log), BaseCallback |
track_stats=True (Option 1)inspect_history() always works (Option 1)| Tool | Optimizers | Setup | Dashboard | Local/Cloud |
|------|-----------|-------|-----------|-------------|
| Built-in (track_stats) | GEPA only | One flag | No (dict) | Local |
| Built-in (BaseCallback) | All | ~20 lines | No (console) | Local |
| inspect_history(n) | All (post-hoc) | Zero setup | No (console) | Local |
| dspy-gepa-logger | GEPA only | pip install | Yes (web) | Local |
| LangWatch | BFS/BRS/COPRO/MIPROv2 | pip install + API key | Yes (cloud) | Cloud |
| MLflow | All (via autolog) | pip install | Yes (local) | Local |
Run dspy.Evaluate before AND after optimization. Without a baseline, you cannot tell if optimization helped.
from dspy.evaluate import Evaluate
evaluator = Evaluate(devset=devset, metric=your_metric, num_threads=8)
# Before optimization
baseline_score = evaluator(program)
print(f"Baseline: {baseline_score}")
# Run optimizer
optimized = optimizer.compile(program, trainset=trainset)
# After optimization
optimized_score = evaluator(optimized)
print(f"Optimized: {optimized_score}")
print(f"Improvement: {optimized_score - baseline_score:+.1f}")
GEPA supports a built-in stats flag that records detailed results per iteration.
optimizer = dspy.GEPA(
metric=your_metric,
task_lm=task_lm,
reflection_lm=reflection_lm,
track_stats=True, # Enable tracking
)
optimized = optimizer.compile(program, trainset=trainset)
# Inspect results after compilation
stats = optimizer.detailed_results
for iteration, result in enumerate(stats):
print(f"Iteration {iteration}: score={result['score']:.3f}")
Write a callback that prints progress as the optimizer evaluates candidates.
import dspy
class OptimizationProgressCallback(dspy.BaseCallback):
def __init__(self):
super().__init__()
self.eval_count = 0
def on_evaluate_end(self, instance, inputs, outputs, exception):
self.eval_count += 1
score = outputs.get("score", None)
if score is not None:
print(f"[Eval {self.eval_count}] Score: {score:.3f}")
# Register the callback
progress = OptimizationProgressCallback()
dspy.configure(callbacks=[progress])
# Now run your optimizer -- progress prints automatically
optimized = optimizer.compile(program, trainset=trainset)
After optimization completes (or if you interrupt it), inspect recent LM calls.
# Show the last 5 LM calls
dspy.inspect_history(n=5)
This shows the full prompt and completion for each call, useful for verifying the optimizer is calling the correct model and seeing what instructions it tried.
A drop-in replacement for GEPA's internal logger that adds a web dashboard with real-time stats, eval tables, and prompt diffs.
pip install dspy-gepa-logger
import dspy
from dspy_gepa_logger import GEPALogger
# Create the logger (starts web dashboard on port 3000)
logger = GEPALogger()
optimizer = dspy.GEPA(
metric=your_metric,
task_lm=task_lm,
reflection_lm=reflection_lm,
)
# Register the logger as an observer
optimizer.register_observer(logger)
optimized = optimizer.compile(program, trainset=trainset)
The dashboard shows:
The web dashboard requires Node.js and uses SQLite for storage.
For detailed GEPA usage, see /dspy-gepa.
LangWatch patches DSPy optimizers to stream live progress to a cloud dashboard. This is the only tool that shows real-time optimizer progress for non-GEPA optimizers.
pip install langwatch
import langwatch
import dspy
langwatch.dspy.init(experiment="my-optimization-run")
optimizer = dspy.MIPROv2(metric=your_metric, auto="light")
optimized = optimizer.compile(program, trainset=trainset)
The LangWatch dashboard shows:
Supported optimizers: BootstrapFewShot, BootstrapFewShotWithRandomSearch, COPRO, MIPROv2.
Not supported: GEPA (use dspy-gepa-logger instead).
Requires a LangWatch API key. Free tier available at app.langwatch.ai.
For detailed setup, see /dspy-langwatch.
MLflow's DSPy autolog captures optimization as parent/child runs with traces.
pip install mlflow
import mlflow
import dspy
mlflow.dspy.autolog(log_compiles=True)
# Set up experiment
mlflow.set_experiment("my-optimization")
with mlflow.start_run():
optimizer = dspy.MIPROv2(metric=your_metric, auto="light")
optimized = optimizer.compile(program, trainset=trainset)
# View results
mlflow ui
# Open http://localhost:5000
The MLflow dashboard shows:
Works with all optimizers via autolog.
For detailed setup, see /dspy-mlflow.
This is the most important section. Knowing which tool to use is step one. Knowing what the data means is what saves you hours.
If you have a separate validation set:
# Always evaluate on held-out data
eval_train = Evaluate(devset=trainset, metric=your_metric)
eval_val = Evaluate(devset=valset, metric=your_metric)
train_score = eval_train(optimized)
val_score = eval_val(optimized)
print(f"Train: {train_score}, Val: {val_score}")
if train_score - val_score > 10:
print("Warning: possible overfitting")
auto="medium" or auto="heavy").max_bootstrapped_demos or max_labeled_demos.A common bug: the optimizer or your program is calling a different LM than you intended.
dspy.configure(lm=...) and that per-module LM assignments are correct# Check what model was actually used
dspy.inspect_history(n=1)
# Look for the model identifier in the output
track_stats=True is GEPA-specific. It does not work with MIPROv2 or other optimizers. Use BaseCallback or an external tool for those.inspect_history() is post-hoc, not live. It shows recent LM calls from memory. It does not stream progress during optimization.Install any skill:
npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill <name>
/dspy-gepa/dspy-miprov2/dspy-langwatch/dspy-mlflow/dspy-phoenix/ai-tracking-experiments/ai-improving-accuracy/ai-cutting-costs/ai-do if you do not have it -- it routes any AI problem to the right skill and is the fastest way to work: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-dotesting
Use when you want the highest-quality prompt optimization DSPy offers — jointly optimizes instructions and few-shot demos, with auto=light/medium/heavy presets. Common scenarios - you want the best possible accuracy from prompt optimization, jointly tuning instructions and few-shot demonstrations, using auto presets for different compute budgets, or when COPRO or BootstrapFewShot alone are not reaching your accuracy target. Related - ai-improving-accuracy, dspy-copro, dspy-bootstrap-few-shot. Also used for dspy.MIPROv2, best DSPy optimizer, highest quality optimization, auto=light medium heavy, joint instruction and demo optimization, most powerful prompt optimizer, MIPROv2 vs COPRO vs BootstrapFewShot, which optimizer should I use, state of the art prompt optimization, when to use MIPROv2, optimize both instructions and examples, heavy optimization for production, best optimizer for accuracy.
testing
Use LangWatch for DSPy auto-tracing and real-time optimizer progress. Use when you want to set up LangWatch, langwatch.dspy.init, auto-tracing DSPy, real-time optimization dashboard, optimizer progress tracking, app.langwatch.ai, or DSPy optimizer dashboard. Also used for langwatch setup, pip install langwatch, langwatch trace, optimizer progress, real-time optimization, watch optimizer run, LangWatch self-hosted, langwatch docker, langwatch vs langtrace, langwatch autotrack_dspy.
data-ai
Use when you want to optimize instructions without few-shot examples — a lightweight alternative to COPRO when you do not have or do not want to use demonstrations. Common scenarios - optimizing instructions when you do not have or do not want to use few-shot demonstrations, lightweight instruction search as a first step, tasks where examples in the prompt confuse the model, or when you want fast instruction optimization without the cost of COPRO. Related - ai-improving-accuracy, dspy-copro, dspy-miprov2. Also used for dspy.GEPA, instruction optimization without demos, lightweight prompt optimization, optimize instructions only, no few-shot examples needed, GEPA vs COPRO, quick instruction search, when demonstrations hurt performance, zero-shot optimization, instruction-only optimizer, simplest instruction tuner, fast prompt optimization, skip few-shot and just tune instructions, optimize Pydantic field descriptions, GEPA structured output, GEPA does not optimize field desc.
development
Measure and improve how well your AI works. Use when AI gives wrong answers, accuracy is bad, responses are unreliable, you need to test AI quality, evaluate your AI, write metrics, benchmark performance, optimize prompts, improve results, or systematically make your AI better. Also used for spent hours tweaking prompts, trial and error prompt engineering is not working, quality plateaued early, stale prompts everywhere in your codebase, my AI is only 60% accurate, how to measure AI quality, AI evaluation framework, benchmark my LLM, prompt optimization not working, systematic way to improve AI, AI accuracy plateaued, DSPy optimizer tutorial, MIPROv2 optimization, how to go from 70% to 90% accuracy.