skills/slo-error-budget/SKILL.md
Define Service Level Objectives (SLOs) and an error budget policy for a service. Use when asked to write SLOs, define SLIs, calculate an error budget, set reliability targets, or create an error budget policy. Produces a complete SLO document with SLI definitions, target calculation, error budget policy, burn rate alerts, and review cadence.
npx skillsauth add mohitagw15856/pm-claude-skills slo-error-budgetInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Produce a complete, implementable SLO document for a service — covering what to measure, what target to set, how to calculate the error budget, and what to do when it burns.
A good SLO is not a target to hit. It is an agreement about what reliability means for your users — and a framework for making principled trade-offs between reliability and velocity.
Ask for these if not already provided:
Always establish these before writing the SLO:
| Term | Definition | |---|---| | SLI (Service Level Indicator) | The metric being measured — e.g. "% of requests completing successfully in <500ms" | | SLO (Service Level Objective) | The target for that metric — e.g. "99.5% of requests" | | SLA (Service Level Agreement) | The contractual commitment to customers — must be looser than the SLO | | Error budget | The allowed headroom below 100% — the budget for planned and unplanned downtime | | Burn rate | How fast the error budget is being consumed |
Service: [Name] | Team: [Team name] Owner: [Name / role] | Approved by: [Name] Effective date: [Date] | Review date: [Date + 3 months] Version: [1.0]
[2–3 sentences. What reliability problem are we solving? What was happening before this SLO that made us need it? What decision-making does this SLO enable?]
What this service does: [One sentence] Who depends on it: [Internal teams / external customers / both — describe] Critical user journeys protected by this SLO:
Define one SLI per user journey or reliability dimension. Keep it to 3–5 SLIs maximum.
| Field | Detail | |---|---| | What it measures | [e.g. "% of API requests that return a non-5xx response"] | | Good event definition | [e.g. "HTTP response with status 2xx or 4xx, completed within 500ms"] | | Bad event definition | [e.g. "HTTP response with status 5xx, or any response taking >500ms"] | | Measurement source | [e.g. "Application load balancer access logs / Datadog APM / Prometheus"] | | Measured over | Rolling 28-day window | | Exclusions | [e.g. "Health check endpoints excluded / Requests during planned maintenance excluded"] |
| Field | Detail | |---|---| | What it measures | [e.g. "P99 response time for the /checkout endpoint"] | | Good event definition | [e.g. "Request completes in ≤500ms at P99"] | | Bad event definition | [e.g. "Request takes >500ms at P99"] | | Measurement source | [Source] | | Measured over | Rolling 28-day window | | Exclusions | [Any exclusions] |
[Same structure]
| SLI | Target | Window | Error Budget | |---|---|---|---| | [SLI 1 name] | [X]% | 28-day rolling | [100 - X]% = [Y minutes/month] | | [SLI 2 name] | [X]% | 28-day rolling | [100 - X]% = [Y minutes/month] | | [SLI 3 name] | [X]% | 28-day rolling | [100 - X]% = [Y minutes/month] |
How targets were set:
What 100% is NOT the target: [Brief explanation of why targeting 100% is counterproductive — it discourages feature development and doesn't reflect user reality]
For SLI 1 ([Name]), at [X]% target:
Error budget = (100% - SLO target) × measurement window
= (100% - [X]%) × 28 days × 24 hours × 60 minutes
= [Y]% × [Z total minutes]
= [N] minutes of allowed failure per 28-day window
In plain terms: We can afford [N] minutes of [bad events] in any rolling 28-day window before we breach the SLO.
Burn rate = how fast the error budget is being consumed relative to the budget window. A burn rate of 1 = consuming the budget at exactly the rate that would exhaust it over 28 days.
| Alert | Burn rate | Window | Severity | Response | |---|---|---|---|---| | Page (critical) | >14× | 1 hour | P1 | Page on-call immediately — budget exhausted in <2 hours | | Page (high) | >6× | 6 hours | P2 | Page on-call — budget exhausted in <5 days | | Ticket (warning) | >3× | 3 days | P3 | Create ticket — review at next team meeting | | Info | >1× | 28 days | Info | Log only — budget on track to exhaust by end of window |
Alert implementation: [Link to alert config in monitoring tool — e.g. Datadog, Prometheus/Alertmanager, Grafana]
This policy defines what to do with the error budget — both when it's healthy and when it's burning.
SLO dashboard: [Link to Datadog / Grafana / etc. dashboard]
Metrics exposed:
Reporting cadence:
| Audience | Frequency | Format | |---|---|---| | Engineering team | Weekly | Slack summary — #[service]-slo | | Engineering manager | Monthly | SLO review meeting | | Stakeholders / customers | Quarterly | SLO compliance summary |
Planned maintenance: Error budget is not consumed during pre-announced maintenance windows. Maintenance must be communicated [X hours] in advance via [channel].
Dependency failures: If SLO breach is caused by an upstream dependency outside our control, document it — but it still counts against our error budget (our users don't distinguish between our failures and our dependencies' failures).
Force majeure: [Policy for cloud provider outages, major infrastructure events]
| Review | When | Who | Output | |---|---|---|---| | Error budget review | Weekly | Team | Budget health check — adjust if burning fast | | SLO target review | Quarterly | Team + EM | Adjust targets if baseline has shifted significantly | | Annual SLO audit | Annually | Team + Stakeholders | Review SLIs — are we measuring the right things? |
When to change the SLO target:
development
Analyse competitor moves and translate them into strategic implications for your product roadmap. Use when a competitor announces a new feature, pricing change, partnership, or strategic shift, or when producing a periodic competitive intelligence report. Produces a categorised signal analysis with reactive-vs-proactive assessment, threat ratings, specific roadmap implications, and recommended responses with owners.
development
Build a community management playbook for a brand's social media channels. Use when asked to create guidelines for managing comments, DMs, and community interactions, define a moderation policy, or build response frameworks for social media community managers. Produces a complete playbook with response templates, escalation paths, moderation rules, and tone guidelines.
development
Activate a 4-stage coding discipline framework that forces Claude to plan before coding, isolate changes on a branch, write tests first, and self-review output twice before presenting it. Use when starting a complex coding task, when past Claude sessions produced broken first drafts, or when you want to prevent rework cycles. Produces a confirmed written plan, isolated feature branch, test-first implementation, and a double-reviewed output with a correctness and code-quality checklist.
development
Optimize an article for Answer Engine Optimization (AEO) — restructuring content so AI engines like ChatGPT, Perplexity, and Claude can extract, quote, and cite it. Rewrites headings as questions, drops 50-80 word answer capsules, audits paragraph length, and flags trust signals. Use when asked to AEO-optimize, make content AI-readable, improve AI citation chances, or adapt an article for answer engines.