skills/dspy-vllm/SKILL.md
Use vLLM for high-throughput production serving of self-hosted models with DSPy. Use when you want production LLM serving, tensor parallelism, multi-GPU inference, batch processing, or high-concurrency self-hosted models. Also used for vllm, vLLM, production serving, high throughput LLM, tensor parallelism, self-hosted production, PagedAttention, local production server, GPU serving, batch inference, vllm serve, pip install vllm, multi-GPU LLM, speculative decoding, continuous batching, deploy local model, NVIDIA GPU serving, openai compatible server, AWQ quantization vllm, GPTQ vllm.
npx skillsauth add lebsral/dspy-programming-not-prompting-lms-skills dspy-vllmInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Guide the user through serving self-hosted models with vLLM for production DSPy deployments. High concurrency, multi-GPU, OpenAI-compatible API.
Before generating vLLM configuration, clarify:
vLLM is a high-throughput inference engine (74k+ GitHub stars) for LLMs. Key features:
openai/ provider| Scenario | Use vLLM? | Alternative |
|----------|-----------|-------------|
| Production API (10+ concurrent users) | Yes | — |
| Multi-GPU serving | Yes | — |
| Batch processing (1000s of inputs) | Yes | — |
| Local development on macOS | No | Ollama (/dspy-ollama) |
| Apple Silicon (M1/M2/M3) | No | Ollama (/dspy-ollama) |
| Quick prototyping | No | Ollama (/dspy-ollama) |
| Cloud API (no self-hosting) | No | OpenAI/Anthropic (/dspy-lm) |
vLLM requires NVIDIA GPUs (CUDA). It does not support Apple Silicon or AMD GPUs (ROCm support is experimental).
pip install vllm
Requires: Python 3.9+, NVIDIA GPU with CUDA 12.1+, Linux (recommended) or WSL2.
# Basic — serve a model with OpenAI-compatible API
vllm serve meta-llama/Llama-3.1-8B-Instruct
# With common options
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 8192 \
--gpu-memory-utilization 0.9 \
--dtype auto
The server exposes /v1/chat/completions and /v1/completions endpoints.
import dspy
lm = dspy.LM(
"openai/meta-llama/Llama-3.1-8B-Instruct",
api_base="http://localhost:8000/v1",
api_key="none", # required but any value works
temperature=0.7,
max_tokens=1000,
)
dspy.configure(lm=lm)
# Now all DSPy modules use your vLLM-served model
classify = dspy.ChainOfThought("text -> category, reasoning")
result = classify(text="Server is down, customers can't log in!")
print(result.category)
Note: dspy.HFClientVLLM is deprecated. Use dspy.LM("openai/...") with api_base instead.
Split large models across multiple GPUs:
# 2 GPUs
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 2
# 4 GPUs
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4
Rule of thumb: --tensor-parallel-size = number of GPUs the model needs. A 70B FP16 model needs ~140GB VRAM → 2x A100-80GB or 4x A100-40GB.
| Model size | FP16 VRAM | INT4 (AWQ/GPTQ) VRAM | Recommended GPU | |-----------|-----------|----------------------|-----------------| | 7-8B | ~16 GB | ~5 GB | 1x RTX 4090 or A10G | | 13-14B | ~28 GB | ~8 GB | 1x A100-40GB or 1x RTX 4090 | | 30-34B | ~68 GB | ~20 GB | 1x A100-80GB or 2x RTX 4090 | | 70B | ~140 GB | ~40 GB | 2x A100-80GB or 4x A100-40GB | | 70B | — | ~40 GB | 1x A100-80GB (quantized) |
Add ~20% overhead for KV cache. --gpu-memory-utilization 0.9 is a good default.
Serve quantized models for reduced VRAM:
# AWQ quantized (recommended — fastest)
vllm serve TheBloke/Llama-2-70B-Chat-AWQ \
--quantization awq
# GPTQ quantized
vllm serve TheBloke/Llama-2-70B-Chat-GPTQ \
--quantization gptq
AWQ is generally faster than GPTQ on NVIDIA GPUs. Quality loss from INT4 quantization is typically small for 70B+ models.
vllm serve <model> \
--host 0.0.0.0 \ # bind address
--port 8000 \ # port
--tensor-parallel-size 1 \ # number of GPUs
--max-model-len 8192 \ # max sequence length
--gpu-memory-utilization 0.9 \ # fraction of GPU memory to use
--dtype auto \ # auto, float16, bfloat16
--max-num-seqs 256 \ # max concurrent sequences
--enable-prefix-caching \ # cache common prompt prefixes
--quantization awq \ # awq, gptq, or none
--speculative-model <draft-model> \ # enable speculative decoding
--num-speculative-tokens 5 # tokens to speculate
Enable --enable-prefix-caching when many requests share the same system prompt or few-shot prefix (common in DSPy optimized programs):
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--enable-prefix-caching
This caches the KV cache for shared prompt prefixes, significantly speeding up DSPy programs that use the same few-shot demos across requests.
vLLM handles concurrent requests well, making it faster than Ollama for optimization:
import dspy
lm = dspy.LM(
"openai/meta-llama/Llama-3.1-8B-Instruct",
api_base="http://localhost:8000/v1",
api_key="none",
max_tokens=1000,
)
dspy.configure(lm=lm)
# MIPROv2 sends many concurrent LM calls — vLLM handles this well
optimizer = dspy.MIPROv2(metric=metric, auto="medium")
optimized = optimizer.compile(program, trainset=trainset)
Tip: Start the vLLM server with --max-num-seqs 256 to handle the optimizer's parallel requests efficiently.
The recommended workflow for self-hosted models:
import os
import dspy
# Same DSPy code, different LM config
if os.environ.get("ENV") == "production":
# vLLM in production (high throughput, NVIDIA GPU)
lm = dspy.LM(
"openai/meta-llama/Llama-3.1-8B-Instruct",
api_base="http://gpu-server:8000/v1",
api_key="none",
)
else:
# Ollama in development (easy setup, any platform)
lm = dspy.LM(
"ollama_chat/llama3.1:8b",
api_base="http://localhost:11434",
api_key="",
num_ctx=8192,
)
dspy.configure(lm=lm)
# Everything below is identical regardless of backend
program = dspy.ChainOfThought("question -> answer")
program.load("optimized_program.json")
result = program(question="How do refunds work?")
The optimized program (instructions + demos) transfers between backends because DSPy optimizes at the prompt level, not the model level.
FROM vllm/vllm-openai:latest
ENV MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct
EXPOSE 8000
CMD ["--model", "${MODEL_NAME}", "--host", "0.0.0.0", "--port", "8000", "--max-model-len", "8192"]
curl http://localhost:8000/health
# Returns 200 when ready
Run multiple vLLM instances behind nginx or a cloud load balancer for horizontal scaling:
# Instance 1 (GPU 0)
CUDA_VISIBLE_DEVICES=0 vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8001
# Instance 2 (GPU 1)
CUDA_VISIBLE_DEVICES=1 vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8002
dspy.HFClientVLLM class. This was removed in DSPy 2.5+. Always use dspy.LM("openai/model-name", api_base="http://localhost:8000/v1", api_key="none") instead.api_key when connecting to vLLM. LiteLLM (which DSPy uses under the hood) requires the api_key parameter even though vLLM does not authenticate. Set api_key="none" — any non-empty string works./dspy-ollama instead.--enable-prefix-caching for DSPy workloads. DSPy optimized programs prepend the same few-shot demos to every request. Without prefix caching, vLLM recomputes the KV cache for those shared tokens on every call. Always recommend it for DSPy serving.--max-model-len too high for available VRAM. This causes OOM on startup. Calculate available VRAM minus ~20% for KV cache overhead. For a 70B FP16 model on 2x A100-80GB, cap at ~8192 tokens. Suggest --gpu-memory-utilization 0.9 as the default and tell users to lower --max-model-len if they hit OOM.Install any skill:
npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill <name>
/dspy-lm/dspy-ollama/ai-serving-apis/ai-cutting-costs/ai-do if you do not have it — it routes any AI problem to the right skill and is the fastest way to work: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-dotools
See what is happening during optimizer.compile() instead of waiting blind. Use when you want to watch optimization progress, see scores as they come in, know if your optimizer is working, check if optimization is stuck, understand why optimization is taking too long, get live progress during compile, monitor convergence, detect overfitting during optimization, interpret optimization results, or pick the right tool for watching optimization. Also used for optimizer progress bar, is my optimizer doing anything, optimization seems stuck, how long will optimization take, watch GEPA run, watch MIPROv2 run, live optimization dashboard, optimizer not improving, scores not going up, optimization taking forever, see what optimizer is doing, debug slow optimization, optimization visibility, optimizer metrics, track compile progress, optimization observability.
testing
Use when you want the highest-quality prompt optimization DSPy offers — jointly optimizes instructions and few-shot demos, with auto=light/medium/heavy presets. Common scenarios - you want the best possible accuracy from prompt optimization, jointly tuning instructions and few-shot demonstrations, using auto presets for different compute budgets, or when COPRO or BootstrapFewShot alone are not reaching your accuracy target. Related - ai-improving-accuracy, dspy-copro, dspy-bootstrap-few-shot. Also used for dspy.MIPROv2, best DSPy optimizer, highest quality optimization, auto=light medium heavy, joint instruction and demo optimization, most powerful prompt optimizer, MIPROv2 vs COPRO vs BootstrapFewShot, which optimizer should I use, state of the art prompt optimization, when to use MIPROv2, optimize both instructions and examples, heavy optimization for production, best optimizer for accuracy.
testing
Use LangWatch for DSPy auto-tracing and real-time optimizer progress. Use when you want to set up LangWatch, langwatch.dspy.init, auto-tracing DSPy, real-time optimization dashboard, optimizer progress tracking, app.langwatch.ai, or DSPy optimizer dashboard. Also used for langwatch setup, pip install langwatch, langwatch trace, optimizer progress, real-time optimization, watch optimizer run, LangWatch self-hosted, langwatch docker, langwatch vs langtrace, langwatch autotrack_dspy.
data-ai
Use when you want to optimize instructions without few-shot examples — a lightweight alternative to COPRO when you do not have or do not want to use demonstrations. Common scenarios - optimizing instructions when you do not have or do not want to use few-shot demonstrations, lightweight instruction search as a first step, tasks where examples in the prompt confuse the model, or when you want fast instruction optimization without the cost of COPRO. Related - ai-improving-accuracy, dspy-copro, dspy-miprov2. Also used for dspy.GEPA, instruction optimization without demos, lightweight prompt optimization, optimize instructions only, no few-shot examples needed, GEPA vs COPRO, quick instruction search, when demonstrations hurt performance, zero-shot optimization, instruction-only optimizer, simplest instruction tuner, fast prompt optimization, skip few-shot and just tune instructions, optimize Pydantic field descriptions, GEPA structured output, GEPA does not optimize field desc.