skills/model-serving-api-builder/SKILL.md
Deploy ML models as production APIs with vLLM, TGI, ONNX Runtime, batching, autoscaling, and GPU optimization. Activate on: model serving, deploy LLM, vLLM setup, inference API, GPU serving. NOT for: model training (ai-engineer), prompt engineering (prompt-engineer).
npx skillsauth add curiositech/windags-skills model-serving-api-builderInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Deploy machine learning models as production APIs using vLLM, TGI, ONNX Runtime, and custom FastAPI services with batching, autoscaling, and GPU optimization.
Activate on: "deploy model API", "serve LLM", "vLLM setup", "inference server", "GPU serving", "model endpoint", "batch inference API", "TGI deployment", "ONNX serving", "autoscale inference"
NOT for: Model training or fine-tuning (ai-engineer), prompt design (prompt-engineer), or LLM application logic (ai-engineer)
| Domain | Technologies | Notes | |--------|-------------|-------| | LLM Serving | vLLM, TGI (HuggingFace), SGLang | PagedAttention, continuous batching, speculative decoding | | General ML | ONNX Runtime, Triton, TorchServe | Non-LLM models: vision, audio, tabular | | API Layer | FastAPI, gRPC, OpenAI-compatible endpoints | vLLM exposes OpenAI-compatible API natively | | Orchestration | Kubernetes + GPU operator, Docker, Modal, RunPod | GPU scheduling and resource management | | Quantization | AWQ, GPTQ, GGUF, bitsandbytes | 4-bit reduces VRAM 4x with < 2% quality loss | | Autoscaling | KEDA, HPA on GPU metrics, serverless (Modal) | Scale-to-zero for cost; scale-up for throughput |
Client ──→ [Load Balancer] ──→ [vLLM Instance(s)] ──→ [GPU(s)]
│ │
health checks OpenAI-compatible API
rate limiting /v1/completions
API key auth /v1/chat/completions
│
continuous batching
PagedAttention
tensor parallelism
# vLLM serving with optimizations (2026 best practices)
pip install vllm
# Single GPU (e.g., A100 80GB, Llama 3.1 8B)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 --port 8000 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--enable-prefix-caching \
--dtype auto
# Multi-GPU tensor parallelism (e.g., 2x A100 for 70B)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 2 \
--max-model-len 4096 \
--gpu-memory-utilization 0.90 \
--enable-prefix-caching
# Quantized serving (4-bit AWQ, fits 70B on single A100)
vllm serve TheBloke/Llama-3.1-70B-Instruct-AWQ \
--quantization awq \
--max-model-len 4096 \
--gpu-memory-utilization 0.95
Requests ──→ [API Gateway / Router]
│
┌────────┼────────┐
│ │ │
▼ ▼ ▼
[vLLM] [ONNX] [Triton]
LLM Vision Ensemble
Llama 3 CLIP Multi-step
ResNet Pipeline
│ │ │
▼ ▼ ▼
GPU 0 GPU 1 GPU 2-3
Router logic:
/v1/chat/* → vLLM (LLM inference)
/v1/embeddings/* → ONNX (embedding model)
/v1/classify/* → Triton (vision classifier)
# k8s deployment with GPU scheduling
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama3
spec:
replicas: 2
template:
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model=meta-llama/Llama-3.1-8B-Instruct
- --gpu-memory-utilization=0.90
- --enable-prefix-caching
resources:
limits:
nvidia.com/gpu: 1 # Request 1 GPU per pod
memory: "32Gi"
requests:
nvidia.com/gpu: 1
memory: "24Gi"
ports:
- containerPort: 8000
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120 # Model loading takes time
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
initialDelaySeconds: 5 causes restart loops. Profile startup time and set accordingly.gpu-memory-utilization to 1.0 leaves no room for request spikes. Use 0.85-0.92 as the safe range.tools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.