infrastructure/local-ai/llm-inference-scaling/SKILL.md
Auto-scale LLM inference clusters on Kubernetes using KEDA, custom GPU metrics, and horizontal pod autoscaling. Handle traffic spikes, implement queue-based scaling, and optimize cost with spot instances for AI workloads.
npx skillsauth add bagelhole/devops-security-agent-skills llm-inference-scalingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Scale LLM inference horizontally on Kubernetes with GPU-aware autoscaling, request queuing, and cost-efficient spot instance strategies.
Use this skill when:
dcgm-exporter or gpu-operator)# Install NVIDIA GPU Operator (handles drivers, container toolkit, DCGM)
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--set driver.enabled=true \
--set dcgm.enabled=true \
--set devicePlugin.enabled=true
# Verify GPU nodes are recognized
kubectl get nodes -l nvidia.com/gpu.present=true
kubectl describe node <gpu-node> | grep nvidia
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama-8b
labels:
app: vllm
model: llama-3.1-8b
spec:
replicas: 1
selector:
matchLabels:
app: vllm
model: llama-3.1-8b
template:
metadata:
labels:
app: vllm
model: llama-3.1-8b
spec:
nodeSelector:
nvidia.com/gpu.present: "true"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "meta-llama/Llama-3.1-8B-Instruct"
- "--tensor-parallel-size"
- "1"
- "--gpu-memory-utilization"
- "0.90"
- "--max-num-seqs"
- "128"
resources:
requests:
nvidia.com/gpu: "1"
memory: "20Gi"
cpu: "4"
limits:
nvidia.com/gpu: "1"
memory: "24Gi"
cpu: "8"
ports:
- containerPort: 8000
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-scaledobject
spec:
scaleTargetRef:
name: vllm-llama-8b
minReplicaCount: 1
maxReplicaCount: 8
cooldownPeriod: 300 # 5 min before scale-down
pollingInterval: 15
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-server.monitoring:9090
metricName: vllm_num_requests_waiting
threshold: "10" # scale up if >10 requests waiting
query: |
sum(vllm:num_requests_waiting{deployment="vllm-llama-8b"})
- type: prometheus
metadata:
serverAddress: http://prometheus-server.monitoring:9090
metricName: vllm_gpu_cache_usage
threshold: "0.8" # scale up if KV cache >80% full
query: |
avg(vllm:gpu_cache_usage_perc{deployment="vllm-llama-8b"})
# ScaledJob for async batch inference
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
name: llm-batch-inference
spec:
jobTargetRef:
template:
spec:
containers:
- name: inference-worker
image: myapp/inference-worker:latest
env:
- name: REDIS_URL
value: redis://redis:6379
- name: QUEUE_NAME
value: inference-jobs
restartPolicy: OnFailure
minReplicaCount: 0
maxReplicaCount: 20
pollingInterval: 5
successfulJobsHistoryLimit: 3
triggers:
- type: redis
metadata:
address: redis:6379
listName: inference-jobs
listLength: "5" # 1 worker per 5 queued jobs
# Mixed node pool: on-demand + spot GPUs
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-autoscaler-priority-config
data:
priorities: |
10: # low priority = prefer
- .*spot.*
50:
- .*on-demand.*
---
# Node affinity for spot with on-demand fallback
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: node.kubernetes.io/lifecycle
operator: In
values: [spot]
- weight: 20
preference:
matchExpressions:
- key: node.kubernetes.io/lifecycle
operator: In
values: [on-demand]
# AWS EKS — enable cluster autoscaler for GPU node group
helm install cluster-autoscaler autoscaler/cluster-autoscaler \
--namespace kube-system \
--set autoDiscovery.clusterName=my-cluster \
--set awsRegion=us-east-1 \
--set rbac.serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=arn:aws:iam::ACCOUNT:role/ClusterAutoscalerRole \
--set extraArgs.skip-nodes-with-local-storage=false \
--set extraArgs.expander=least-waste
# Annotate GPU node group for autoscaler
kubectl annotate node <node> \
cluster-autoscaler.kubernetes.io/safe-to-evict="false"
# Prometheus queries for scaling decisions
# Requests waiting in vLLM queue
sum(vllm:num_requests_waiting) by (model)
# GPU KV cache utilization (>80% = bottleneck)
avg(vllm:gpu_cache_usage_perc) by (pod)
# Tokens per second throughput
sum(rate(vllm:generation_tokens_total[5m])) by (model)
# P99 time-to-first-token
histogram_quantile(0.99, rate(vllm:time_to_first_token_seconds_bucket[5m]))
| Issue | Cause | Fix |
|-------|-------|-----|
| Pods stuck in Pending | No GPU nodes available | Check cluster autoscaler logs; verify node group limits |
| Scale-up too slow | Cluster autoscaler delay + model load time | Pre-warm replicas; increase minReplicaCount |
| GPU fragmentation | Multiple small models on large GPUs | Use MIG partitioning or consolidate model sizes |
| Spot eviction causes errors | Spot instance reclamation | Add PodDisruptionBudget; use graceful shutdown |
| KEDA not scaling | Prometheus query returns no data | Test query in Prometheus UI first |
minReplicaCount: 1 to avoid cold starts; scale to 0 only for batch jobs.PodDisruptionBudget with minAvailable: 1 to survive spot evictions.development
Design and operationalize SRE dashboards that surface reliability, latency, error, saturation, and capacity signals across services. Use when building observability views for SLOs, incident response, and executive reliability reporting.
testing
Harden OpenClaw self-hosted environments with baseline host controls, auth tightening, secret handling, network segmentation, and safe update/rollback workflows. Use when deploying OpenClaw in home labs, startups, or production-like local AI infrastructure.
devops
Deploy, manage, and optimize vector databases for AI applications. Covers Qdrant, Weaviate, pgvector, and Pinecone — collection management, indexing strategies, backup, and performance tuning for production RAG and semantic search workloads.
testing
Deploy ML models on Kubernetes with KServe (formerly KFServing) and NVIDIA Triton Inference Server. Includes canary deployments, autoscaling, model versioning, A/B testing, and GPU resource management for production model serving.