devops/ai/llmops-platform-engineering/SKILL.md
Build production LLMOps platforms with CI/CD, model promotion workflows, evaluation gates, rollback, and governance across cloud and self-hosted inference.
npx skillsauth add bagelhole/devops-security-agent-skills llmops-platform-engineeringInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Design and operate an internal LLM platform that supports rapid experimentation without compromising reliability, cost, or compliance.
# .github/workflows/model-promotion.yaml
name: Model Promotion Pipeline
on:
workflow_dispatch:
inputs:
model_name:
description: "Model identifier"
required: true
model_version:
description: "Model version to promote"
required: true
target_env:
description: "Target environment"
required: true
type: choice
options: [staging, production]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run quality evaluation suite
run: |
python -m evals.run \
--model "${{ inputs.model_name }}:${{ inputs.model_version }}" \
--suite quality \
--output results/quality.json
- name: Run safety evaluation suite
run: |
python -m evals.run \
--model "${{ inputs.model_name }}:${{ inputs.model_version }}" \
--suite safety \
--output results/safety.json
- name: Run latency benchmark
run: |
python -m evals.benchmark \
--model "${{ inputs.model_name }}:${{ inputs.model_version }}" \
--concurrent-users 50 \
--duration 300 \
--output results/latency.json
- name: Gate check - quality
run: |
python -m evals.gate_check \
--results results/quality.json \
--threshold-file thresholds/quality.yaml
- name: Gate check - safety
run: |
python -m evals.gate_check \
--results results/safety.json \
--threshold-file thresholds/safety.yaml
- name: Gate check - latency
run: |
python -m evals.gate_check \
--results results/latency.json \
--threshold-file thresholds/latency.yaml
- name: Upload eval evidence
uses: actions/upload-artifact@v4
with:
name: eval-results-${{ inputs.model_version }}
path: results/
approve:
needs: evaluate
runs-on: ubuntu-latest
environment: ${{ inputs.target_env }}
steps:
- name: Record approval
run: |
echo "Approved by: ${{ github.actor }}"
echo "Model: ${{ inputs.model_name }}:${{ inputs.model_version }}"
echo "Target: ${{ inputs.target_env }}"
echo "Time: $(date -u +%Y-%m-%dT%H:%M:%SZ)"
deploy:
needs: approve
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Deploy canary
run: |
kubectl set image deployment/${{ inputs.model_name }}-canary \
model=${{ inputs.model_name }}:${{ inputs.model_version }} \
-n ai-${{ inputs.target_env }}
- name: Wait for canary validation (15 min)
run: |
python -m canary.validate \
--deployment ${{ inputs.model_name }}-canary \
--namespace ai-${{ inputs.target_env }} \
--duration 900 \
--quality-threshold 0.85 \
--error-rate-threshold 0.02
- name: Promote to full rollout
run: |
kubectl set image deployment/${{ inputs.model_name }} \
model=${{ inputs.model_name }}:${{ inputs.model_version }} \
-n ai-${{ inputs.target_env }}
kubectl rollout status deployment/${{ inputs.model_name }} \
-n ai-${{ inputs.target_env }} --timeout=300s
# thresholds/quality.yaml
gates:
groundedness:
metric: groundedness_score
min: 0.85
comparison: gte
task_success:
metric: task_success_rate
min: 0.90
comparison: gte
hallucination:
metric: hallucination_rate
max: 0.08
comparison: lte
regression:
metric: quality_delta_vs_baseline
min: -0.02
comparison: gte
description: "Must not regress more than 2% vs current production"
# thresholds/latency.yaml
gates:
p50_latency:
metric: latency_p50_ms
max: 800
comparison: lte
p95_latency:
metric: latency_p95_ms
max: 2000
comparison: lte
p99_latency:
metric: latency_p99_ms
max: 5000
comparison: lte
throughput:
metric: requests_per_second
min: 50
comparison: gte
# ab-test-config.yaml
apiVersion: gateway.ai/v1
kind: ABTest
metadata:
name: model-comparison-q1
namespace: ai-production
spec:
duration: 7d
traffic_split:
control:
model: gpt-4o-2024-08-06
weight: 70
treatment:
model: gpt-4o-2025-01-15
weight: 30
metrics:
primary:
- task_success_rate
- user_satisfaction_score
secondary:
- latency_p95
- cost_per_request
- hallucination_rate
guardrails:
auto_rollback_if:
- metric: task_success_rate
threshold: 0.80
window: 1h
- metric: hallucination_rate
threshold: 0.15
window: 30m
assignment:
strategy: sticky_user
hash_key: user_id
# model-serving-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-inference
namespace: ai-production
labels:
app: llm-inference
model: gpt-4o
version: "2025-01"
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: llm-inference
template:
metadata:
labels:
app: llm-inference
model: gpt-4o
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: llm-inference
containers:
- name: model
image: registry.internal/vllm-server:0.4.1
args:
- "--model=/models/current"
- "--tensor-parallel-size=1"
- "--max-model-len=8192"
- "--gpu-memory-utilization=0.90"
ports:
- containerPort: 8000
name: inference
- containerPort: 8080
name: metrics
resources:
requests:
cpu: "4"
memory: "16Gi"
nvidia.com/gpu: "1"
limits:
cpu: "8"
memory: "32Gi"
nvidia.com/gpu: "1"
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 30
volumeMounts:
- name: model-weights
mountPath: /models
readOnly: true
- name: config
mountPath: /etc/vllm
volumes:
- name: model-weights
persistentVolumeClaim:
claimName: model-weights-pvc
- name: config
configMap:
name: vllm-config
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
nodeSelector:
gpu-type: a100
---
apiVersion: v1
kind: Service
metadata:
name: llm-inference
namespace: ai-production
spec:
selector:
app: llm-inference
ports:
- name: inference
port: 8000
targetPort: 8000
- name: metrics
port: 8080
targetPort: 8080
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llm-inference-hpa
namespace: ai-production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llm-inference
minReplicas: 2
maxReplicas: 10
metrics:
- type: Pods
pods:
metric:
name: llm_queue_depth
target:
type: AverageValue
averageValue: "5"
- type: Pods
pods:
metric:
name: gpu_utilization_percent
target:
type: AverageValue
averageValue: "75"
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 2
periodSeconds: 120
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1
periodSeconds: 300
dev -> stage -> prod.| Signal | Target | Measurement Window | |--------|--------|--------------------| | Availability | 99.9% | 30-day rolling | | p95 Latency | < 1200ms | 5-min buckets | | Cost per request | < $0.05 | 1-hour average | | Task success rate | > 90% | 24-hour rolling | | Groundedness | > 85% | 24-hour rolling |
| Layer | Tools | |-------|-------| | Orchestration | Argo Workflows, GitHub Actions, Airflow | | Model Registry | MLflow, custom metadata DB | | Gateway | LiteLLM, Envoy-based API gateway | | Observability | OpenTelemetry + Prometheus + Grafana + Langfuse | | Policy | OPA/Rego for deployment and runtime checks | | Evaluation | RAGAS, custom eval harness, Promptfoo | | Serving | vLLM, TGI, Triton Inference Server |
| Issue | Diagnosis | Resolution | |-------|-----------|------------| | Canary fails quality gate | Compare eval results with baseline | Adjust model config or revert version | | Deployment stuck in rollout | Check pod events and resource quotas | Fix resource limits or node availability | | A/B test shows no significant difference | Verify traffic split and sample size | Extend test duration or increase treatment weight | | Model cold start too slow | Large model weight download | Use pre-cached PVCs or init containers | | Eval pipeline flaky | Non-deterministic model outputs | Set temperature=0 for evals, increase sample size |
development
Design and operationalize SRE dashboards that surface reliability, latency, error, saturation, and capacity signals across services. Use when building observability views for SLOs, incident response, and executive reliability reporting.
testing
Harden OpenClaw self-hosted environments with baseline host controls, auth tightening, secret handling, network segmentation, and safe update/rollback workflows. Use when deploying OpenClaw in home labs, startups, or production-like local AI infrastructure.
devops
Deploy, manage, and optimize vector databases for AI applications. Covers Qdrant, Weaviate, pgvector, and Pinecone — collection management, indexing strategies, backup, and performance tuning for production RAG and semantic search workloads.
testing
Deploy ML models on Kubernetes with KServe (formerly KFServing) and NVIDIA Triton Inference Server. Includes canary deployments, autoscaling, model versioning, A/B testing, and GPU resource management for production model serving.