plugins/devops-sre/skills/reliability/slo-sli-error-budgets/SKILL.md
Implement Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets. Use this skill when defining reliability targets, measuring service health, or balancing reliability vs velocity. Activate when: SLO, SLI, SLA, error budget, reliability targets, service level, uptime target, availability target, latency target, nine nines, 99.9%.
npx skillsauth add latestaiagents/agent-skills slo-sli-error-budgetsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Define and measure reliability in terms that matter to users.
| Term | Definition | Example | |------|------------|---------| | SLI | Service Level Indicator - What you measure | 99.2% of requests succeed | | SLO | Service Level Objective - Your target | 99.9% availability | | SLA | Service Level Agreement - Contract with customers | 99.5% with refund clause | | Error Budget | Allowed unreliability (100% - SLO) | 0.1% = 43 min/month downtime |
availability = successful_requests / total_requests
# Prometheus query
sum(rate(http_requests_total{status!~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))
latency_sli = requests_under_threshold / total_requests
# Example: 99% of requests under 200ms
sum(rate(http_request_duration_seconds_bucket{le="0.2"}[30d]))
/
sum(rate(http_request_duration_seconds_count[30d]))
throughput_sli = successful_operations / attempted_operations
# Example: Batch jobs
sum(job_succeeded_total) / sum(job_attempted_total)
freshness_sli = fresh_data_requests / total_requests
# Example: Data updated within 1 minute
sum(data_age_seconds < 60) / count(data_age_seconds)
| Availability | Downtime/Year | Downtime/Month | Downtime/Week | |--------------|---------------|----------------|---------------| | 99% | 3.65 days | 7.31 hours | 1.68 hours | | 99.5% | 1.83 days | 3.65 hours | 50.4 min | | 99.9% | 8.77 hours | 43.8 min | 10.1 min | | 99.95% | 4.38 hours | 21.9 min | 5.04 min | | 99.99% | 52.6 min | 4.38 min | 1.01 min | | 99.999% | 5.26 min | 26.3 sec | 6.05 sec |
1. Start with user expectations
- What do users actually need?
- What are they getting today?
2. Consider dependencies
- Your SLO can't exceed your dependencies
- If database is 99.9%, you can't be 99.99%
3. Start conservative, tighten later
- Easier to tighten SLO than loosen
- Build confidence before committing
4. Different SLOs for different tiers
- Premium customers: 99.99%
- Free tier: 99.5%
Monthly Error Budget = (1 - SLO) × Time Period
Example for 99.9% SLO:
- Monthly budget = 0.1% × 43,200 minutes = 43.2 minutes
- Weekly budget = 0.1% × 10,080 minutes = 10.08 minutes
# Example Error Budget Policy
error_budget_policy:
healthy: # >50% budget remaining
- Continue feature development
- Normal deployment velocity
caution: # 25-50% budget remaining
- Reduce deployment frequency
- Prioritize reliability work
- Review recent incidents
critical: # <25% budget remaining
- Freeze non-critical deployments
- All hands on reliability
- Daily error budget review
exhausted: # 0% budget remaining
- Emergency only deployments
- Postmortem all incidents
- Leadership escalation
Error Budget: January 2026
SLO: 99.9% availability
Budget: 43.2 minutes
Consumption:
Week 1: ████░░░░░░░░░░░░░░░░ 8 min (INC-121)
Week 2: ██░░░░░░░░░░░░░░░░░░ 3 min
Week 3: ████████░░░░░░░░░░░░ 15 min (INC-125, INC-126)
Week 4: ████░░░░░░░░░░░░░░░░ 7 min (INC-128)
────────────────────
Total: ████████████████░░░░ 33 min consumed
Remaining: 10.2 min (24% of budget)
Status: ⚠️ CAUTION
# slo-config.yaml
slis:
- name: availability
description: Proportion of successful HTTP requests
query: |
sum(rate(http_requests_total{status!~"5.."}[{{window}}]))
/
sum(rate(http_requests_total[{{window}}]))
- name: latency_p99
description: 99th percentile request latency under 200ms
query: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[{{window}}])) by (le)
) < 0.2
slos:
- name: api-availability
sli: availability
target: 0.999 # 99.9%
window: 30d
- name: api-latency
sli: latency_p99
target: 0.99 # 99% of requests under 200ms
window: 30d
# Alert at different burn rates
alerts:
- name: SLOBurnRateCritical
slo: api-availability
burn_rate: 14.4 # Exhausts monthly budget in 2 days
window: 1h
severity: critical
- name: SLOBurnRateWarning
slo: api-availability
burn_rate: 6 # Exhausts monthly budget in 5 days
window: 6h
severity: warning
## SLO Weekly Review - Week 4, January 2026
### Summary
| SLO | Target | Actual | Status |
|-----|--------|--------|--------|
| Availability | 99.9% | 99.85% | 🟡 |
| Latency P99 | <200ms | 187ms | 🟢 |
| Error Rate | <0.1% | 0.08% | 🟢 |
### Error Budget
- Consumed this week: 7 minutes
- Remaining this month: 10.2 minutes (24%)
- Projected end-of-month: 5 minutes (12%)
### Incidents
- INC-128: 7 min downtime (database failover)
### Actions
- [ ] Review INC-128 postmortem action items
- [ ] Consider pausing non-critical deploys
## SLO Quarterly Review - Q1 2026
### SLO Performance
| SLO | Target | Q1 Actual | Trend |
|-----|--------|-----------|-------|
| Availability | 99.9% | 99.92% | ↗️ |
| Latency | <200ms | 178ms | ↗️ |
### Error Budget Utilization
- January: 76% consumed
- February: 45% consumed
- March: 23% consumed
- Average: 48% consumed ✓
### Recommendations
1. Consider tightening availability SLO to 99.95%
2. Add latency SLO for P50 (currently unmeasured)
3. Review alerting thresholds based on budget consumption
| Level | Characteristics | |-------|----------------| | L1: Ad-hoc | No formal SLOs, react to incidents | | L2: Defined | SLOs documented, basic monitoring | | L3: Measured | SLIs tracked, dashboards exist | | L4: Managed | Error budgets enforced, policies in place | | L5: Optimized | SLOs drive prioritization, continuous improvement |
development
Test skills for correct activation, content quality, and regression — both automated checks (frontmatter validity, lint) and manual verification (query-suite activation testing). Covers CI integration and how to catch skill regressions before users do. Use this skill when adding skills to a repo, setting up CI for a skill library, or debugging "the skill exists but doesn't work". Activate when: test skills, validate skills, skill CI, skill linting, skill activation test, skill regression.
documentation
Write the YAML frontmatter for a SKILL.md file so it activates reliably — name, description, and activation keywords that the model matches against. Covers length, tone, and the most common frontmatter mistakes. Use this skill when authoring a new skill, fixing a skill that isn't auto-activating, or reviewing skills for publication. Activate when: SKILL.md frontmatter, skill description, skill activation, skill YAML, write a skill, author a skill.
development
Design skills that fire at the right moment — neither over-eager (noise) nor under-eager (silent). Covers activation specificity, trigger phrases, disambiguation between overlapping skills, and debugging activation. Use this skill when multiple skills could fire on the same query, a skill never fires, or a skill fires too often. Activate when: skill won't activate, skill over-activates, overlapping skills, skill triggers, skill selection, skill disambiguation.
development
Structure SKILL.md content so the model reads just enough — concise summary up front, progressively deeper detail, examples on demand. Covers section ordering, length budgets, when to split into multiple skills. Use this skill when writing or refactoring a skill body, one skill has grown too long, or a skill is wordy but not useful. Activate when: SKILL.md structure, skill content, skill too long, split skill, progressive disclosure, skill body.