plugins/pm-engineering/skills/slo-error-budget/SKILL.md
Define Service Level Objectives (SLOs) and an error budget policy for a service. Use when asked to write SLOs, define SLIs, calculate an error budget, set reliability targets, or create an error budget policy. Produces a complete SLO document with SLI definitions, target calculation, error budget policy, burn rate alerts, and review cadence.
npx skillsauth add mohitagw15856/pm-claude-skills slo-error-budgetInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Produce a complete, implementable SLO document for a service — covering what to measure, what target to set, how to calculate the error budget, and what to do when it burns.
A good SLO is not a target to hit. It is an agreement about what reliability means for your users — and a framework for making principled trade-offs between reliability and velocity.
Ask for these if not already provided:
Always establish these before writing the SLO:
| Term | Definition | |---|---| | SLI (Service Level Indicator) | The metric being measured — e.g. "% of requests completing successfully in <500ms" | | SLO (Service Level Objective) | The target for that metric — e.g. "99.5% of requests" | | SLA (Service Level Agreement) | The contractual commitment to customers — must be looser than the SLO | | Error budget | The allowed headroom below 100% — the budget for planned and unplanned downtime | | Burn rate | How fast the error budget is being consumed |
Service: [Name] | Team: [Team name] Owner: [Name / role] | Approved by: [Name] Effective date: [Date] | Review date: [Date + 3 months] Version: [1.0]
[2–3 sentences. What reliability problem are we solving? What was happening before this SLO that made us need it? What decision-making does this SLO enable?]
What this service does: [One sentence] Who depends on it: [Internal teams / external customers / both — describe] Critical user journeys protected by this SLO:
Define one SLI per user journey or reliability dimension. Keep it to 3–5 SLIs maximum.
| Field | Detail | |---|---| | What it measures | [e.g. "% of API requests that return a non-5xx response"] | | Good event definition | [e.g. "HTTP response with status 2xx or 4xx, completed within 500ms"] | | Bad event definition | [e.g. "HTTP response with status 5xx, or any response taking >500ms"] | | Measurement source | [e.g. "Application load balancer access logs / Datadog APM / Prometheus"] | | Measured over | Rolling 28-day window | | Exclusions | [e.g. "Health check endpoints excluded / Requests during planned maintenance excluded"] |
| Field | Detail | |---|---| | What it measures | [e.g. "P99 response time for the /checkout endpoint"] | | Good event definition | [e.g. "Request completes in ≤500ms at P99"] | | Bad event definition | [e.g. "Request takes >500ms at P99"] | | Measurement source | [Source] | | Measured over | Rolling 28-day window | | Exclusions | [Any exclusions] |
[Same structure]
| SLI | Target | Window | Error Budget | |---|---|---|---| | [SLI 1 name] | [X]% | 28-day rolling | [100 - X]% = [Y minutes/month] | | [SLI 2 name] | [X]% | 28-day rolling | [100 - X]% = [Y minutes/month] | | [SLI 3 name] | [X]% | 28-day rolling | [100 - X]% = [Y minutes/month] |
How targets were set:
What 100% is NOT the target: [Brief explanation of why targeting 100% is counterproductive — it discourages feature development and doesn't reflect user reality]
For SLI 1 ([Name]), at [X]% target:
Error budget = (100% - SLO target) × measurement window
= (100% - [X]%) × 28 days × 24 hours × 60 minutes
= [Y]% × [Z total minutes]
= [N] minutes of allowed failure per 28-day window
In plain terms: We can afford [N] minutes of [bad events] in any rolling 28-day window before we breach the SLO.
Burn rate = how fast the error budget is being consumed relative to the budget window. A burn rate of 1 = consuming the budget at exactly the rate that would exhaust it over 28 days.
| Alert | Burn rate | Window | Severity | Response | |---|---|---|---|---| | Page (critical) | >14× | 1 hour | P1 | Page on-call immediately — budget exhausted in <2 hours | | Page (high) | >6× | 6 hours | P2 | Page on-call — budget exhausted in <5 days | | Ticket (warning) | >3× | 3 days | P3 | Create ticket — review at next team meeting | | Info | >1× | 28 days | Info | Log only — budget on track to exhaust by end of window |
Alert implementation: [Link to alert config in monitoring tool — e.g. Datadog, Prometheus/Alertmanager, Grafana]
This policy defines what to do with the error budget — both when it's healthy and when it's burning.
SLO dashboard: [Link to Datadog / Grafana / etc. dashboard]
Metrics exposed:
Reporting cadence:
| Audience | Frequency | Format | |---|---|---| | Engineering team | Weekly | Slack summary — #[service]-slo | | Engineering manager | Monthly | SLO review meeting | | Stakeholders / customers | Quarterly | SLO compliance summary |
Planned maintenance: Error budget is not consumed during pre-announced maintenance windows. Maintenance must be communicated [X hours] in advance via [channel].
Dependency failures: If SLO breach is caused by an upstream dependency outside our control, document it — but it still counts against our error budget (our users don't distinguish between our failures and our dependencies' failures).
Force majeure: [Policy for cloud provider outages, major infrastructure events]
| Review | When | Who | Output | |---|---|---|---| | Error budget review | Weekly | Team | Budget health check — adjust if burning fast | | SLO target review | Quarterly | Team + EM | Adjust targets if baseline has shifted significantly | | Annual SLO audit | Annually | Team + Stakeholders | Review SLIs — are we measuring the right things? |
When to change the SLO target:
development
Build a framework for creating shareable, high-reach social media content. Use when asked to plan viral content, develop a shareable content strategy, create a hook writing system, or build a repeatable process for content that gets shared. Produces a platform-specific viral content framework with hook formulas, content structures, shareability triggers, and a content testing system.
development
Generate article or newsletter thumbnail candidates using the Gemini API from inside Claude Code. Claude reads article copy, proposes composition concepts, writes image generation prompts incorporating brand specs, calls Gemini to generate the images, evaluates the results via computer vision, and returns ranked candidates with rationale. Use when asked to create thumbnails, generate cover images, or produce visual candidates for an article or newsletter.
testing
Flips Claude's default from "find reasons you're right" to "find reasons you're wrong." A genuine thinking partner, not a mirror with grammar. Use before high-stakes decisions, plans, assumptions, or pitches you haven't stress-tested.
development
Scrapes a Substack Notes page and exports engagement data (likes, comments, restacks) to a formatted .xlsx file with conditional formatting and summary stats.