infrastructure/local-ai/vllm-server/SKILL.md
Deploy and manage vLLM for high-throughput LLM inference. Configure continuous batching, tensor parallelism, quantization, and OpenAI-compatible API endpoints for production LLM serving.
npx skillsauth add bagelhole/devops-security-agent-skills vllm-serverInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Deploy production-grade LLM inference servers with vLLM — the fastest open-source LLM serving engine with PagedAttention and continuous batching.
Use this skill when:
nvidia-container-toolkit for Docker GPU passthrough# Install vLLM
pip install vllm
# Serve a model (OpenAI-compatible API)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--api-key your-secret-key
# Test the endpoint
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-secret-key" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Hello!"}]
}'
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--api-key your-secret-key
services:
vllm:
image: vllm/vllm-openai:latest
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
- HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
volumes:
- model-cache:/root/.cache/huggingface
ports:
- "8000:8000"
ipc: host
command: >
--model meta-llama/Llama-3.1-70B-Instruct
--tensor-parallel-size 2
--max-model-len 32768
--gpu-memory-utilization 0.90
--api-key ${VLLM_API_KEY}
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
volumes:
model-cache:
# Split one model across 4 GPUs
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.90
# AWQ quantization (70B on 2x A100 40GB)
vllm serve casperhansen/llama-3-70b-instruct-awq \
--quantization awq \
--tensor-parallel-size 2
# GPTQ quantization
vllm serve TheBloke/Llama-2-70B-Chat-GPTQ \
--quantization gptq
# FP8 (H100 NVL native)
vllm serve meta-llama/Llama-3.1-405B-Instruct \
--quantization fp8 \
--tensor-parallel-size 8
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--enable-auto-tool-choice \
--tool-call-parser llama3_json \
--guided-decoding-backend outlines
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--enable-lora \
--lora-modules sql-lora=/path/to/sql-lora \
code-lora=/path/to/code-lora \
--max-lora-rank 64
# Maximize throughput for batch workloads
vllm serve <model> \
--max-num-seqs 256 \ # max concurrent sequences
--max-num-batched-tokens 8192 \ # tokens per batch
--gpu-memory-utilization 0.95 \ # use 95% VRAM
--swap-space 4 # CPU swap (GiB)
# Minimize latency for interactive use
vllm serve <model> \
--max-num-seqs 32 \
--enforce-eager # disable CUDA graph capture
# Install benchmark tool
pip install vllm
# Run throughput benchmark
python -m vllm.entrypoints.openai.run_batch \
--model meta-llama/Llama-3.1-8B-Instruct \
--input-file prompts.jsonl \
--output-file results.jsonl
# Benchmark with vllm bench
vllm bench throughput \
--model meta-llama/Llama-3.1-8B-Instruct \
--num-prompts 1000 \
--input-len 512 \
--output-len 128
# Check running server stats
curl http://localhost:8000/metrics # Prometheus metrics
# Key metrics to watch:
# vllm:num_requests_running - active requests
# vllm:gpu_cache_usage_perc - KV cache utilization
# vllm:generation_tokens_per_s - throughput
# vllm:time_to_first_token_ms - TTFT latency
# vllm:e2e_request_latency_seconds - end-to-end latency
| Issue | Cause | Fix |
|-------|-------|-----|
| CUDA out of memory | Model too large for VRAM | Add --quantization awq or reduce --gpu-memory-utilization |
| Slow cold start | Model not cached | Pre-pull with huggingface-cli download <model> |
| Low throughput | Too few concurrent requests | Increase --max-num-seqs |
| KV cache full errors | Context length too long | Set --max-model-len lower |
| tokenizer error | Tokenizer mismatch | Use --tokenizer to specify correct tokenizer |
--gpu-memory-utilization 0.90 to leave headroom for CUDA kernels.--revision for reproducible deployments.HF_HUB_OFFLINE=1 in production to prevent unexpected downloads.--enable-chunked-prefill for long-context workloads.gpu_cache_usage_perc — above 95% causes queuing.development
Design and operationalize SRE dashboards that surface reliability, latency, error, saturation, and capacity signals across services. Use when building observability views for SLOs, incident response, and executive reliability reporting.
testing
Harden OpenClaw self-hosted environments with baseline host controls, auth tightening, secret handling, network segmentation, and safe update/rollback workflows. Use when deploying OpenClaw in home labs, startups, or production-like local AI infrastructure.
devops
Deploy, manage, and optimize vector databases for AI applications. Covers Qdrant, Weaviate, pgvector, and Pinecone — collection management, indexing strategies, backup, and performance tuning for production RAG and semantic search workloads.
testing
Deploy ML models on Kubernetes with KServe (formerly KFServing) and NVIDIA Triton Inference Server. Includes canary deployments, autoscaling, model versioning, A/B testing, and GPU resource management for production model serving.