skills/khattab-2023-dspy/SKILL.md
Declarative programming framework for optimizing LLM prompts through compilation and automatic tuning
npx skillsauth add curiositech/windags-skills khattab-2023-dspyInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Input: LLM pipeline with N modules, training data size T, target metric M
├── T < 50 examples AND N ≤ 2 modules
│ └── Use manual prompt engineering → DSPy overhead not justified
├── T ≥ 50 examples AND no labeled intermediate steps
│ ├── N ≤ 3 modules → Use BootstrapFewShot teleprompter
│ └── N > 3 modules → Use BootstrapFewShotWithRandomSearch
├── T ≥ 200 examples AND have some labeled intermediates
│ ├── Simple accuracy metric → Use MIPRO (optimizes instructions + demos)
│ └── Complex composite metric → Use BootstrapFinetune if budget allows
└── T ≥ 1000 examples AND production system
└── Use ensemble of compiled programs → vote/rank outputs
Need: Transform input X to output Y
├── Single reasoning step (X → Y)
│ └── Use Predict(signature="X -> Y")
├── Multi-step reasoning required
│ ├── Steps are sequential (A→B→C→Y) → Use ChainOfThought("X -> reasoning, Y")
│ ├── Steps need external tools → Use ReAct("X -> thought, action, observation, Y")
│ └── Need confidence estimation → Use majority vote ensemble
├── Retrieval + generation needed
│ ├── Simple QA → Use Retrieve(k=3) + Generate("context, question -> answer")
│ └── Complex reasoning over docs → Use ColBERTv2 + ChainOfThought
└── Multiple valid approaches exist
└── Use ProgramOfThought for mathematical reasoning
Field count in signature:
├── 1-2 fields → Start here, expand only if metrics plateau
├── 3-4 fields → Acceptable for complex tasks, monitor token usage
├── 5+ fields → Likely over-specified, consider decomposition
└── If adding fields doesn't improve validation metrics → Remove them
Symptom: High training accuracy, poor real-world performance
Detection: training_metric >> validation_metric OR users report "system gives perfect but useless answers"
Root cause: Metric optimizes for wrong objective (exact string match vs semantic correctness)
Fix: Redesign metric to capture actual user success criteria, recompile with composite metrics
Symptom: Compiler generates repetitive or nonsensical demonstrations
Detection: len(set(demonstrations)) < 0.3 * len(demonstrations) OR average_demo_length < 5 tokens
Root cause: Training set too narrow, causing trace filtering to select degenerate examples
Fix: Expand training diversity, lower trace filtering threshold, or add minimum quality constraints
Symptom: Modules produce malformed outputs, compilation time increases exponentially
Detection: signature_fields > 5 OR parsing_errors > 20% OR compilation_time > 2x baseline
Root cause: Over-specified signatures make reliable generation harder
Fix: Decompose into simpler modules or merge related fields (reasoning_step_1, reasoning_step_2 → reasoning)
Symptom: Performance doesn't improve with compilation, prompts contain hard-coded instructions
Detection: str("Let's think step by step") in module.forward() OR instructions manually set in __init__
Root cause: Bypassing DSPy abstraction by injecting imperative prompts
Fix: Remove hard-coded strings, let compiler optimize instructions, trust the abstraction
Symptom: Perfect compiled performance that doesn't generalize to deployment
Detection: compiled_accuracy = 100% OR validation_set overlaps training_set
Root cause: Compiler overfit to training traces, no held-out validation
Fix: Create true validation split, use cross-validation during compilation, monitor test metrics
Scenario: Build a system that answers questions requiring 2-3 reasoning hops over a knowledge base.
Initial Manual Approach (what novices do):
# Brittle manual prompting
prompt = f"Given context: {context}\nQuestion: {question}\nLet's think step by step:\n1. First, I need to..."
DSPy Expert Approach:
# 1. Define signature (interface, not implementation)
class MultiHopQA(dspy.Module):
def __init__(self):
self.retrieve = dspy.Retrieve(k=5)
self.reason = dspy.ChainOfThought("context, question -> reasoning, answer")
def forward(self, question):
contexts = self.retrieve(question)
return self.reason(context=contexts, question=question)
# 2. Set up compilation
train_set = [...] # 100 question/answer pairs
metric = lambda example, prediction: example.answer.lower() in prediction.answer.lower()
# 3. Compile (this is where the magic happens)
teleprompter = BootstrapFewShot(metric=metric, max_bootstrapped_demos=8)
compiled_qa = teleprompter.compile(MultiHopQA(), trainset=train_set)
Key Decision Points Navigated:
What Expert Catches vs Novice Misses:
Scenario: Summarize technical papers with length/style constraints while preserving key findings.
DSPy Implementation with Trade-offs:
class ConstrainedSummarizer(dspy.Module):
def __init__(self):
# Trade-off: More fields = better control but harder optimization
self.summarize = dspy.ChainOfThought(
"document, style_guide, max_words -> key_findings, summary"
)
def forward(self, document, style_guide="academic", max_words=150):
result = self.summarize(
document=document,
style_guide=style_guide,
max_words=max_words
)
return result
# Composite metric balancing multiple objectives
def summary_quality(example, prediction):
word_count = len(prediction.summary.split())
length_ok = word_count <= example.max_words * 1.1 # 10% tolerance
# Trade-off: More sophisticated metrics = better quality but slower compilation
key_present = any(finding in prediction.summary for finding in example.key_findings)
return length_ok and key_present
# Compilation decision: MIPRO optimizes both demonstrations AND instructions
teleprompter = MIPRO(metric=summary_quality, num_candidates=10)
Trade-offs Shown:
Validation checklist for DSPy pipeline deployment:
Do NOT use DSPy for:
Delegate to other skills:
model-fine-tuning skill insteadrag-architecture skill insteadinference-optimization skill insteadprompt-injection-defense skill insteadllm-evaluation-frameworks skill insteadDSPy is for: Multi-module pipelines, automatic optimization, model-agnostic systems, compositional reasoning chains, and when you need programs that improve from their own execution traces.
tools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.