devops/ai/ai-sre-incident-response/SKILL.md
Build AI-focused SRE incident response practices for LLM outages, degraded quality, runaway cost events, and safety regressions.
npx skillsauth add bagelhole/devops-security-agent-skills ai-sre-incident-responseInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Apply SRE rigor to AI systems where incidents include quality regressions, unsafe outputs, and budget explosions.
| Severity | Criteria | Response Time | Notification | |----------|----------|---------------|--------------| | SEV1 | User-facing outage, compliance risk, data leak | 5 min | Page on-call + incident commander | | SEV2 | Major degradation in key flows | 15 min | Page on-call | | SEV3 | Limited impact or internal-only issue | 1 hour | Slack alert | | SEV4 | Cosmetic or low-priority regression | Next business day | Ticket |
# prometheus-ai-alerts.yaml
groups:
- name: ai-service-alerts
rules:
- alert: ModelEndpointDown
expr: up{job="llm-inference"} == 0
for: 2m
labels:
severity: sev1
annotations:
summary: "LLM inference endpoint {{ $labels.instance }} is down"
runbook_url: "https://runbooks.internal/ai/model-outage"
- alert: HighHallucinationRate
expr: |
rate(llm_hallucination_detected_total[10m])
/ rate(llm_requests_total[10m]) > 0.15
for: 5m
labels:
severity: sev2
annotations:
summary: "Hallucination rate above 15% for {{ $labels.model }}"
runbook_url: "https://runbooks.internal/ai/quality-regression"
- alert: TokenCostExplosion
expr: |
sum(rate(llm_token_cost_dollars[5m])) by (tenant)
> 0.50
for: 3m
labels:
severity: sev2
annotations:
summary: "Token spend exceeds $0.50/min for tenant {{ $labels.tenant }}"
runbook_url: "https://runbooks.internal/ai/cost-spike"
- alert: LatencyP95Exceeded
expr: |
histogram_quantile(0.95,
rate(llm_request_duration_seconds_bucket[5m])
) > 5
for: 5m
labels:
severity: sev2
annotations:
summary: "LLM p95 latency exceeds 5s for {{ $labels.service }}"
- alert: GuardrailViolationSpike
expr: |
rate(llm_guardrail_violations_total[10m])
/ rate(llm_requests_total[10m]) > 0.05
for: 5m
labels:
severity: sev1
annotations:
summary: "Guardrail violations above 5% for {{ $labels.model }}"
runbook_url: "https://runbooks.internal/ai/safety-incident"
- alert: ModelQualityDrop
expr: |
llm_eval_score{metric="groundedness"} < 0.70
for: 10m
labels:
severity: sev2
annotations:
summary: "Groundedness score dropped below 0.70 for {{ $labels.model }}"
- alert: ProviderErrorRateHigh
expr: |
rate(llm_provider_errors_total[5m])
/ rate(llm_provider_requests_total[5m]) > 0.10
for: 3m
labels:
severity: sev2
annotations:
summary: "Provider {{ $labels.provider }} error rate above 10%"
TRIGGER: ModelEndpointDown fires for > 2 minutes
RESPONDER: On-call AI platform engineer
1. Acknowledge alert in PagerDuty.
2. Check provider status page (e.g., status.openai.com).
3. Verify network connectivity:
curl -s -o /dev/null -w "%{http_code}" https://api.provider.com/health
4. If provider is down:
a. Enable fallback model route in gateway config.
b. kubectl set env deployment/llm-gateway FALLBACK_ENABLED=true
c. Verify fallback traffic is flowing via Grafana dashboard.
5. If self-hosted model is down:
a. Check pod status: kubectl get pods -l app=llm-inference -n ai
b. Check GPU health: kubectl logs -l app=llm-inference --tail=50
c. Restart if OOM: kubectl rollout restart deployment/llm-inference -n ai
6. Freeze all deployments:
kubectl annotate deployment --all deploy-freeze=true -n ai
7. Communicate ETA in #incident-channel.
8. When resolved, unfreeze and run smoke tests.
TRIGGER: HighHallucinationRate or ModelQualityDrop fires
RESPONDER: On-call AI engineer + ML lead
1. Acknowledge alert. Open incident ticket.
2. Identify scope:
- Which model version? Check deployment metadata.
- Which routes/tenants affected? Filter by labels in Grafana.
3. Check recent changes:
- Model version promotion in last 24h?
- Prompt template changes in last 24h?
- Retrieval index rebuild in last 24h?
4. If recent model change:
kubectl rollout undo deployment/llm-inference -n ai
5. If recent prompt change:
git revert <commit> && git push # triggers GitOps redeploy
6. Increase trace sampling to 100% for affected route:
kubectl set env deployment/llm-gateway TRACE_SAMPLE_RATE=1.0
7. Run offline eval suite against current production:
python run_evals.py --target prod --suite quality --compare baseline
8. Confirm metrics return to baseline before closing.
TRIGGER: TokenCostExplosion fires
RESPONDER: On-call platform engineer
1. Identify top consumers:
Query: topk(10, sum(rate(llm_token_cost_dollars[15m])) by (tenant, model, route))
2. Check for runaway loops:
- Agent retry storms (exponential token growth per request)
- Missing max_tokens caps on new routes
- Cache bypass due to config change
3. Apply immediate caps:
kubectl patch configmap llm-quotas -n ai --patch '
data:
max_tokens_per_request: "4096"
rpm_limit: "60"
'
4. Enable semantic cache if disabled:
kubectl set env deployment/llm-gateway CACHE_ENABLED=true
5. Route traffic to cheaper model tier:
kubectl set env deployment/llm-gateway DEFAULT_MODEL=gpt-4o-mini
6. Notify affected tenants of temporary limits.
7. Open postmortem with cost attribution analysis.
Level 1 (0-15 min): On-call AI platform engineer
Level 2 (15-30 min): AI platform team lead + affected product owner
Level 3 (30-60 min): Engineering director + security (if safety incident)
Level 4 (60+ min): VP Engineering + legal (if compliance/data incident)
Safety incidents always start at Level 2 minimum.
Provider-side incidents: open support ticket immediately at Level 1.
# Request success rate by model
1 - (
sum(rate(llm_requests_total{status="error"}[5m])) by (model)
/ sum(rate(llm_requests_total[5m])) by (model)
)
# Cost per successful answer
sum(rate(llm_token_cost_dollars[5m])) by (route)
/ sum(rate(llm_requests_total{status="success"}[5m])) by (route)
# Hallucination rate trend (1h window, 5m steps)
rate(llm_hallucination_detected_total[1h])
/ rate(llm_requests_total[1h])
# Latency breakdown by stage
histogram_quantile(0.95, rate(llm_retrieval_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(llm_generation_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(llm_tool_execution_duration_seconds_bucket[5m]))
# Tenant cost leaderboard
topk(10, sum(rate(llm_token_cost_dollars[1h])) by (tenant))
## Incident Summary
- **Severity**: SEVx
- **Duration**: start_time - end_time (Xh Ym)
- **Detection**: How was it detected? (alert / customer report / manual)
- **Impact**: X tenants, Y requests, $Z cost
## Timeline
| Time (UTC) | Event |
|------------|-------|
| HH:MM | Alert fired |
| HH:MM | Responder acknowledged |
| HH:MM | Root cause identified |
| HH:MM | Mitigation applied |
| HH:MM | Incident resolved |
## Root Cause
[Description]
## Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| Tune alert threshold | @engineer | YYYY-MM-DD | Open |
| Add fallback route | @platform | YYYY-MM-DD | Open |
Regularly test incident readiness:
| Symptom | Check | Fix | |---------|-------|-----| | All requests timing out | Provider status page, DNS resolution | Enable fallback provider | | Gradual quality decline | Recent model/prompt deployments | Roll back to last known good | | Sudden cost spike | Per-tenant token usage dashboard | Apply emergency token caps | | Guardrail violations spike | Model version, prompt injection logs | Enable stricter input filtering | | Intermittent 503 errors | Pod restarts, GPU OOM events | Increase memory limits or reduce batch size |
development
Design and operationalize SRE dashboards that surface reliability, latency, error, saturation, and capacity signals across services. Use when building observability views for SLOs, incident response, and executive reliability reporting.
testing
Harden OpenClaw self-hosted environments with baseline host controls, auth tightening, secret handling, network segmentation, and safe update/rollback workflows. Use when deploying OpenClaw in home labs, startups, or production-like local AI infrastructure.
devops
Deploy, manage, and optimize vector databases for AI applications. Covers Qdrant, Weaviate, pgvector, and Pinecone — collection management, indexing strategies, backup, and performance tuning for production RAG and semantic search workloads.
testing
Deploy ML models on Kubernetes with KServe (formerly KFServing) and NVIDIA Triton Inference Server. Includes canary deployments, autoscaling, model versioning, A/B testing, and GPU resource management for production model serving.