areas/devops/sre/skills/slo-sli-design/SKILL.md
Define SLIs, SLOs, and error budgets; implement burn rate alerts; integrate with Prometheus.
npx skillsauth add sawrus/agent-guides slo-sli-designInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Expertise: SLI selection, SLO target setting, error budget calculation, burn rate alerting, Sloth/pyrra integration.
When defining SLOs for a new service, setting up error budget tracking, or reviewing existing SLOs after an incident.
Step 1: What does the user care about?
→ "The checkout completes successfully and quickly"
Step 2: What CAN we measure?
→ HTTP 2xx responses, p99 latency
Step 3: Define the SLI formula
→ Availability SLI: good_requests / total_requests
where good = status < 500 AND latency < 500ms
Step 4: Pick SLO target (start conservative, tighten later)
→ 99.5% (don't chase 99.99% without data — high budget wasted on caution)
Step 5: Calculate error budget
→ 100% - 99.5% = 0.5% over 28 days = 0.5% × 28 × 24 × 60 = 201.6 minutes
# Recording rules for SLO tracking
groups:
- name: slo.checkout-service
interval: 30s
rules:
# Good requests (2xx, latency < 500ms)
- record: slo:http_requests_good:rate5m
expr: |
sum(rate(http_requests_total{
service="checkout-service",
status=~"2..",
duration_bucket="0.5"
}[5m]))
# Total requests
- record: slo:http_requests_total:rate5m
expr: |
sum(rate(http_requests_total{service="checkout-service"}[5m]))
# SLI = good / total
- record: slo:http_availability:ratio_rate5m
expr: slo:http_requests_good:rate5m / slo:http_requests_total:rate5m
# 28-day rolling availability
- record: slo:http_availability:ratio_rate28d
expr: |
sum_over_time(slo:http_availability:ratio_rate5m[28d]) / (28 * 24 * 12)
# Multi-window, multi-burn-rate alerting (Google SRE Workbook pattern)
groups:
- name: slo.checkout-service.burn-rate
rules:
# Fast burn: 14.4× rate (burns 1h of budget in 5 min)
- alert: SLOFastBurn
expr: |
(
slo:http_availability:ratio_rate1h{service="checkout-service"} < (1 - 14.4 * 0.005)
) and (
slo:http_availability:ratio_rate5m{service="checkout-service"} < (1 - 14.4 * 0.005)
)
labels:
severity: critical
slo: checkout-service-availability
annotations:
summary: "Fast error budget burn — checkout-service (> 14.4× rate)"
runbook_url: "https://runbooks.internal/slo-fast-burn"
# Slow burn: 3× rate (burns 10% of budget in 6h)
- alert: SLOSlowBurn
expr: |
(
slo:http_availability:ratio_rate6h{service="checkout-service"} < (1 - 3 * 0.005)
) and (
slo:http_availability:ratio_rate30m{service="checkout-service"} < (1 - 3 * 0.005)
)
labels:
severity: warning
slo: checkout-service-availability
annotations:
summary: "Slow error budget burn — checkout-service (> 3× rate)"
# slo/checkout-service.yaml — Sloth generates all recording rules + alerts
version: "prometheus/v1"
service: checkout-service
labels:
team: backend
slos:
- name: requests-availability
objective: 99.5
description: "99.5% of checkout requests succeed with latency < 500ms"
sli:
events:
error_query: |
sum(rate(http_requests_total{service="checkout-service", status=~"5.."}[{{.window}}]))
total_query: |
sum(rate(http_requests_total{service="checkout-service"}[{{.window}}]))
alerting:
name: CheckoutServiceSLO
page_alert:
labels: { severity: critical }
ticket_alert:
labels: { severity: warning }
# Generate Prometheus rules from Sloth definition
sloth generate -i slo/checkout-service.yaml -o prometheus-rules/slo-checkout.yaml
Key panels:
testing
QA Expert for writing E2E tests, test scenarios, test plans, and ensuring test coverage quality.
development
Expert UI/UX design intelligence for creating distinctive, high-craft, and mobile-first interfaces. Focuses on premium aesthetics, touch-first ergonomics, and Flutter performance.
development
Code Review Expert for static analysis, security auditing, architecture review, and ensuring code quality standards.
development
Babysit a GitHub pull request after creation by continuously polling review comments, CI checks/workflow runs, and mergeability state until the PR is merged/closed or user help is required. Diagnose failures, retry likely flaky failures up to 3 times, auto-fix/push branch-related issues when appropriate, and keep watching open PRs so fresh review feedback is surfaced promptly. Use when the user asks Codex to monitor a PR, watch CI, handle review comments, or keep an eye on failures and feedback on an open PR.