skills/dspy-parallel/SKILL.md
Use when you have independent LM calls that can run concurrently — batch processing, fan-out patterns, or speeding up pipelines with no data dependencies between steps. Common scenarios - processing a batch of inputs through a DSPy module concurrently, fan-out patterns where multiple independent LM calls run at once, speeding up evaluation by parallelizing predictions, or reducing wall-clock time for pipelines with no data dependencies. Related - ai-building-pipelines, ai-serving-apis. Also used for dspy.Parallel, concurrent LM calls, batch processing in DSPy, parallel DSPy execution, speed up DSPy pipeline, fan-out LM calls, concurrent predictions, parallelize evaluation, async DSPy calls, reduce latency with parallel execution, batch inference DSPy, process multiple inputs at once, throughput optimization, run DSPy modules concurrently, parallel map over inputs.
npx skillsauth add lebsral/dspy-programming-not-prompting-lms-skills dspy-parallelInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Guide the user through using DSPy's Parallel module to execute multiple LM calls concurrently. dspy.Parallel is the built-in way to speed up batch processing and fan-out patterns without writing threading code yourself.
dspy.Parallel takes a list of (module, inputs) pairs and executes them concurrently using a thread pool. It handles threading, progress bars, error limits, and timeouts so you don't have to.
Use it when you have:
If call B depends on the result of call A, those two calls must be sequential. Everything else can be parallel.
Pass a list of (module, inputs) pairs. Each pair is one unit of work:
import dspy
lm = dspy.LM("openai/gpt-4o-mini") # or any LiteLLM-supported provider
dspy.configure(lm=lm)
# A module to run on every input
classify = dspy.Predict("text -> label: str")
# A batch of inputs
texts = [
"I love this product!",
"Terrible experience, want a refund.",
"It's okay, nothing special.",
"Best purchase I've made this year.",
]
# Build execution pairs: (module, inputs_dict)
exec_pairs = [(classify, {"text": t}) for t in texts]
# Run them all in parallel
parallel = dspy.Parallel(num_threads=4)
results = parallel(exec_pairs)
for text, result in zip(texts, results):
print(f"{text[:30]:30s} -> {result.label}")
results is a list in the same order as exec_pairs, so results[i] corresponds to exec_pairs[i].
dspy.Parallel(
num_threads=4, # number of concurrent threads (default: settings.num_threads)
max_errors=5, # stop after this many failures (default: settings.max_errors)
return_failed_examples=False,# if True, return failures separately instead of raising
provide_traceback=False, # include tracebacks in error output
disable_progress_bar=False, # suppress the tqdm progress bar
timeout=120, # max seconds per task before timeout
)
| Parameter | Type | Default | Purpose |
|-----------|------|---------|---------|
| num_threads | int \| None | None | Number of concurrent threads. Falls back to dspy.settings.num_threads. |
| max_errors | int \| None | None | Stop execution after this many errors. Falls back to dspy.settings.max_errors. |
| access_examples | bool | True | Unpack Example objects via .inputs(). Set False to pass raw Examples. |
| return_failed_examples | bool | False | When True, return failed examples separately instead of raising. |
| provide_traceback | bool \| None | None | Include Python tracebacks for failed examples. |
| disable_progress_bar | bool | False | Suppress the progress bar. |
| timeout | int | 120 | Max seconds per individual task. |
| straggler_limit | int | 3 | Threshold for flagging slow-running tasks. |
Start with a thread count that matches your rate limits, not your CPU cores. LM calls are I/O-bound (waiting on HTTP responses), so you can safely use many threads:
# Conservative -- good starting point
parallel = dspy.Parallel(num_threads=4)
# Aggressive -- if your provider allows high concurrency
parallel = dspy.Parallel(num_threads=16)
# Match your provider's rate limit
# e.g., 60 requests/min = ~1/sec, so 4-8 threads keeps the pipeline full
parallel = dspy.Parallel(num_threads=8)
If you hit rate-limit errors (HTTP 429), reduce num_threads or add retry logic in your LM configuration.
Parallel accepts inputs as dictionaries, dspy.Example objects, or tuples:
module = dspy.Predict("question -> answer")
# Dict inputs (most common)
pairs = [(module, {"question": "What is DSPy?"})]
# dspy.Example inputs
example = dspy.Example(question="What is DSPy?").with_inputs("question")
pairs = [(module, example)]
# Both work the same way
parallel = dspy.Parallel(num_threads=2)
results = parallel(pairs)
Results come back as a list. Aggregate however your application needs:
import dspy
classify = dspy.Predict("text -> label: str, confidence: float")
texts = ["Great!", "Terrible.", "Meh.", "Amazing!", "Awful."]
parallel = dspy.Parallel(num_threads=4)
results = parallel([(classify, {"text": t}) for t in texts])
# Count labels
from collections import Counter
label_counts = Counter(r.label for r in results)
print(label_counts) # Counter({'positive': 2, 'negative': 2, 'neutral': 1})
# Filter by confidence
high_confidence = [
(text, r.label)
for text, r in zip(texts, results)
if r.confidence > 0.8
]
# Build a summary dict
output = [
{"text": t, "label": r.label, "confidence": r.confidence}
for t, r in zip(texts, results)
]
By default, Parallel raises an exception after max_errors failures. To handle errors gracefully, use return_failed_examples=True:
parallel = dspy.Parallel(
num_threads=4,
max_errors=10,
return_failed_examples=True,
provide_traceback=True,
)
results, failed_examples, exceptions = parallel(exec_pairs)
When return_failed_examples=True, the return value is a 3-tuple:
results -- list of successful predictions (same length as successes)failed_examples -- list of (module, inputs) pairs that failedexceptions -- list of exceptions corresponding to each failureHandle failures after the batch completes:
results, failed, errors = parallel(exec_pairs)
print(f"Succeeded: {len(results)}, Failed: {len(failed)}")
# Retry failures with a fallback module
if failed:
fallback = dspy.ChainOfThought("text -> label: str")
retry_pairs = [(fallback, inputs) for _, inputs in failed]
retry_results = parallel(retry_pairs)
Use max_errors to fail fast when too many calls are failing (e.g., provider outage):
# Stop the whole batch if more than 5 calls fail
parallel = dspy.Parallel(num_threads=4, max_errors=5)
try:
results = parallel(exec_pairs)
except Exception as e:
print(f"Batch aborted: {e}")
The timeout parameter sets a per-task time limit in seconds. Tasks that exceed this are terminated:
# Give each task up to 60 seconds
parallel = dspy.Parallel(num_threads=4, timeout=60)
Each pair can use a different module. This is useful for fan-out patterns where you run multiple analyses on the same input:
import dspy
sentiment = dspy.Predict("text -> sentiment: str")
topics = dspy.Predict("text -> topics: list[str]")
summary = dspy.ChainOfThought("text -> summary: str")
text = "DSPy is a framework for programming language models..."
# Fan out: three different modules, same input
exec_pairs = [
(sentiment, {"text": text}),
(topics, {"text": text}),
(summary, {"text": text}),
]
parallel = dspy.Parallel(num_threads=3)
results = parallel(exec_pairs)
combined = {
"sentiment": results[0].sentiment,
"topics": results[1].topics,
"summary": results[2].summary,
}
| Scenario | Use | Why |
|----------|-----|-----|
| Process 100+ items through the same module | Parallel | Massive speedup from concurrent HTTP requests |
| Run 3 independent analyses on one input | Parallel | All three calls happen at once |
| Pipeline where step 2 needs step 1's output | Sequential loop | There's a data dependency |
| Single LM call | Neither | No benefit from parallelism |
| Processing 2-3 items | Either works | Overhead is negligible either way |
# Slow: each call waits for the previous one to finish
results = []
for text in texts:
result = classify(text=text)
results.append(result)
# Fast: all calls run concurrently
parallel = dspy.Parallel(num_threads=8)
results = parallel([(classify, {"text": t}) for t in texts])
For a batch of 100 items with ~1 second per LM call:
Wrap Parallel usage inside a dspy.Module for clean composition:
class BatchClassifier(dspy.Module):
def __init__(self, num_threads=4):
self.classify = dspy.Predict("text -> label: str, confidence: float")
self.num_threads = num_threads
def forward(self, texts: list[str]):
parallel = dspy.Parallel(num_threads=self.num_threads)
exec_pairs = [(self.classify, {"text": t}) for t in texts]
results = parallel(exec_pairs)
return dspy.Prediction(
labels=[r.label for r in results],
confidences=[r.confidence for r in results],
)
# Usage
classifier = BatchClassifier(num_threads=8)
result = classifier(texts=["Great!", "Terrible.", "Meh."])
print(result.labels) # ["positive", "negative", "neutral"]
This keeps the parallelism as an implementation detail. Callers don't need to know about threading -- they just pass a list and get a list back.
for loop instead of using dspy.Parallel. When asked to process a batch of inputs, Claude defaults to a sequential loop. For any batch of 5+ independent LM calls, use dspy.Parallel — it is dramatically faster because LM calls are I/O-bound.num_threads to match CPU cores. LM calls are network-bound (waiting on HTTP responses), not CPU-bound. Thread count should match your provider rate limit, not your CPU count. 8-16 threads is typical even on a 4-core machine.return_failed_examples=True changes the return type. Without it, parallel(pairs) returns a flat list. With it, it returns a 3-tuple (results, failed_examples, exceptions). Destructure accordingly or the code will break.Parallel(num_threads=3) inside an outer Parallel(num_threads=4) creates up to 12 concurrent LM calls. This can exceed provider rate limits. Calculate the total: outer_threads * inner_threads.dspy.Parallel for 1-2 items. The threading overhead is not worth it for fewer than ~5 items. Just call the module directly.Install any skill:
npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill <name>
/dspy-modules/ai-building-pipelinesnum_threads -- see /dspy-evaluate/ai-do if you do not have it — it routes any AI problem to the right skill and is the fastest way to work: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-dotools
See what is happening during optimizer.compile() instead of waiting blind. Use when you want to watch optimization progress, see scores as they come in, know if your optimizer is working, check if optimization is stuck, understand why optimization is taking too long, get live progress during compile, monitor convergence, detect overfitting during optimization, interpret optimization results, or pick the right tool for watching optimization. Also used for optimizer progress bar, is my optimizer doing anything, optimization seems stuck, how long will optimization take, watch GEPA run, watch MIPROv2 run, live optimization dashboard, optimizer not improving, scores not going up, optimization taking forever, see what optimizer is doing, debug slow optimization, optimization visibility, optimizer metrics, track compile progress, optimization observability.
testing
Use when you want the highest-quality prompt optimization DSPy offers — jointly optimizes instructions and few-shot demos, with auto=light/medium/heavy presets. Common scenarios - you want the best possible accuracy from prompt optimization, jointly tuning instructions and few-shot demonstrations, using auto presets for different compute budgets, or when COPRO or BootstrapFewShot alone are not reaching your accuracy target. Related - ai-improving-accuracy, dspy-copro, dspy-bootstrap-few-shot. Also used for dspy.MIPROv2, best DSPy optimizer, highest quality optimization, auto=light medium heavy, joint instruction and demo optimization, most powerful prompt optimizer, MIPROv2 vs COPRO vs BootstrapFewShot, which optimizer should I use, state of the art prompt optimization, when to use MIPROv2, optimize both instructions and examples, heavy optimization for production, best optimizer for accuracy.
testing
Use LangWatch for DSPy auto-tracing and real-time optimizer progress. Use when you want to set up LangWatch, langwatch.dspy.init, auto-tracing DSPy, real-time optimization dashboard, optimizer progress tracking, app.langwatch.ai, or DSPy optimizer dashboard. Also used for langwatch setup, pip install langwatch, langwatch trace, optimizer progress, real-time optimization, watch optimizer run, LangWatch self-hosted, langwatch docker, langwatch vs langtrace, langwatch autotrack_dspy.
data-ai
Use when you want to optimize instructions without few-shot examples — a lightweight alternative to COPRO when you do not have or do not want to use demonstrations. Common scenarios - optimizing instructions when you do not have or do not want to use few-shot demonstrations, lightweight instruction search as a first step, tasks where examples in the prompt confuse the model, or when you want fast instruction optimization without the cost of COPRO. Related - ai-improving-accuracy, dspy-copro, dspy-miprov2. Also used for dspy.GEPA, instruction optimization without demos, lightweight prompt optimization, optimize instructions only, no few-shot examples needed, GEPA vs COPRO, quick instruction search, when demonstrations hurt performance, zero-shot optimization, instruction-only optimizer, simplest instruction tuner, fast prompt optimization, skip few-shot and just tune instructions, optimize Pydantic field descriptions, GEPA structured output, GEPA does not optimize field desc.