infrastructure/networking/llm-gateway/SKILL.md
Deploy an API gateway for LLM traffic with load balancing, rate limiting, key management, semantic caching, fallback routing, and cost tracking. Covers LiteLLM Proxy, OpenRouter-compatible setup, and custom Nginx/Traefik patterns.
npx skillsauth add bagelhole/devops-security-agent-skills llm-gatewayInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
A unified API gateway that routes LLM requests across providers and self-hosted models — with rate limiting, cost tracking, caching, and failover.
Use this skill when:
LiteLLM is the de facto open-source LLM gateway with OpenAI-compatible API.
# Run with Docker
docker run -d \
--name litellm-proxy \
-p 4000:4000 \
-e OPENAI_API_KEY=$OPENAI_API_KEY \
-e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
-v $(pwd)/litellm-config.yaml:/app/config.yaml \
ghcr.io/berriai/litellm:main-latest \
--config /app/config.yaml \
--detailed_debug
# litellm-config.yaml
model_list:
# OpenAI models
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
rpm: 10000
tpm: 2000000
- model_name: gpt-4o-mini
litellm_params:
model: openai/gpt-4o-mini
api_key: os.environ/OPENAI_API_KEY
# Anthropic
- model_name: claude-sonnet-4-6
litellm_params:
model: anthropic/claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_API_KEY
# Self-hosted vLLM instances (load balanced)
- model_name: llama-3.1-8b
litellm_params:
model: openai/meta-llama/Llama-3.1-8B-Instruct
api_base: http://vllm-1:8000/v1
api_key: fake # vLLM key
- model_name: llama-3.1-8b
litellm_params:
model: openai/meta-llama/Llama-3.1-8B-Instruct
api_base: http://vllm-2:8000/v1 # second replica — auto load balanced
api_key: fake
# Fallback: cheap model if primary fails
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o-mini # fallback to cheaper model
api_key: os.environ/OPENAI_API_KEY
router_settings:
routing_strategy: least-busy # or: latency-based, simple-shuffle
num_retries: 3
retry_after: 5
allowed_fails: 2
cooldown_time: 60
# Fallback configuration
fallbacks:
- gpt-4o: [claude-sonnet-4-6]
- claude-sonnet-4-6: [gpt-4o]
litellm_settings:
# Semantic caching
cache: true
cache_params:
type: redis
host: redis
port: 6379
similarity_threshold: 0.90 # cache if >90% semantic similarity
# Logging
success_callback: ["langfuse"]
failure_callback: ["langfuse"]
langfuse_public_key: os.environ/LANGFUSE_PUBLIC_KEY
langfuse_secret_key: os.environ/LANGFUSE_SECRET_KEY
general_settings:
master_key: os.environ/LITELLM_MASTER_KEY
database_url: postgresql://litellm:password@postgres:5432/litellm
store_model_in_db: true
services:
litellm:
image: ghcr.io/berriai/litellm:main-latest
command: ["--config", "/app/config.yaml", "--port", "4000"]
volumes:
- ./litellm-config.yaml:/app/config.yaml
ports:
- "4000:4000"
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY}
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
- LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}
- DATABASE_URL=postgresql://litellm:password@postgres:5432/litellm
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_started
restart: unless-stopped
postgres:
image: postgres:16-alpine
environment:
POSTGRES_DB: litellm
POSTGRES_USER: litellm
POSTGRES_PASSWORD: password
volumes:
- postgres-data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U litellm"]
interval: 5s
retries: 5
restart: unless-stopped
redis:
image: redis:7-alpine
command: redis-server --maxmemory 2gb --maxmemory-policy allkeys-lru
volumes:
- redis-data:/data
restart: unless-stopped
volumes:
postgres-data:
redis-data:
# Create a virtual API key for a team (via LiteLLM API)
curl -X POST http://localhost:4000/key/generate \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{
"team_id": "team-backend",
"key_alias": "backend-team-key",
"models": ["gpt-4o-mini", "llama-3.1-8b"],
"max_budget": 100, # USD limit
"budget_duration": "monthly",
"rpm_limit": 100, # requests per minute
"tpm_limit": 500000 # tokens per minute
}'
# View spend
curl http://localhost:4000/spend/keys \
-H "Authorization: Bearer $LITELLM_MASTER_KEY"
# nginx.conf — round-robin across vLLM replicas
upstream vllm_backends {
least_conn;
server vllm-1:8000 max_fails=3 fail_timeout=30s;
server vllm-2:8000 max_fails=3 fail_timeout=30s;
server vllm-3:8000 max_fails=3 fail_timeout=30s;
keepalive 32;
}
server {
listen 80;
server_name llm-api.internal;
# Rate limiting
limit_req_zone $http_authorization zone=per_key:10m rate=100r/m;
limit_req zone=per_key burst=20 nodelay;
location /v1/ {
proxy_pass http://vllm_backends;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_read_timeout 300s; # long timeout for streaming
proxy_buffering off; # required for SSE streaming
proxy_cache_bypass 1;
}
}
# Check LiteLLM health
curl http://localhost:4000/health
# Model-level health
curl http://localhost:4000/health/liveliness
# Spend by model
curl http://localhost:4000/spend/models \
-H "Authorization: Bearer $LITELLM_MASTER_KEY"
# Active virtual keys
curl http://localhost:4000/key/list \
-H "Authorization: Bearer $LITELLM_MASTER_KEY"
| Issue | Cause | Fix |
|-------|-------|-----|
| ConnectionRefusedError to backend | Backend not reachable | Check api_base URL; verify backend is healthy |
| Rate limit errors (429) | Budget/RPM exceeded | Increase limits or rotate to fallback model |
| Slow streaming responses | proxy_buffering enabled | Set proxy_buffering off in Nginx |
| Cache miss rate high | Threshold too strict | Lower similarity_threshold to 0.85 |
| Postgres connection errors | DB not ready | Add depends_on with condition: service_healthy |
cache: true with Redis for repeated or similar queries; can cut costs 30–50%.num_retries: 3 with fallbacks to handle provider outages gracefully.least-busy routing strategy for self-hosted models to avoid GPU saturation.development
Design and operationalize SRE dashboards that surface reliability, latency, error, saturation, and capacity signals across services. Use when building observability views for SLOs, incident response, and executive reliability reporting.
testing
Harden OpenClaw self-hosted environments with baseline host controls, auth tightening, secret handling, network segmentation, and safe update/rollback workflows. Use when deploying OpenClaw in home labs, startups, or production-like local AI infrastructure.
devops
Deploy, manage, and optimize vector databases for AI applications. Covers Qdrant, Weaviate, pgvector, and Pinecone — collection management, indexing strategies, backup, and performance tuning for production RAG and semantic search workloads.
testing
Deploy ML models on Kubernetes with KServe (formerly KFServing) and NVIDIA Triton Inference Server. Includes canary deployments, autoscaling, model versioning, A/B testing, and GPU resource management for production model serving.