Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

mohitagw15856/slo-error-budget

Name: slo-error-budget
Author: mohitagw15856

skills/slo-error-budget/SKILL.md

npx skillsauth add mohitagw15856/pm-claude-skills slo-error-budget

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

SLO and Error Budget Skill

Produce a complete, implementable SLO document for a service — covering what to measure, what target to set, how to calculate the error budget, and what to do when it burns.

A good SLO is not a target to hit. It is an agreement about what reliability means for your users — and a framework for making principled trade-offs between reliability and velocity.

Required Inputs

Ask for these if not already provided:

Service name and brief description of what it does
Primary users — who depends on this service and how
User-facing interactions to protect — e.g. API calls, page loads, transactions
Current reliability data — error rate, latency, uptime (last 30–90 days if available)
Existing on-call setup — who responds to alerts?
Deployment frequency — how often does the team ship?
Any existing SLAs with customers — these constrain SLO targets

Key Definitions

Always establish these before writing the SLO:

| Term | Definition | |---|---| | SLI (Service Level Indicator) | The metric being measured — e.g. "% of requests completing successfully in <500ms" | | SLO (Service Level Objective) | The target for that metric — e.g. "99.5% of requests" | | SLA (Service Level Agreement) | The contractual commitment to customers — must be looser than the SLO | | Error budget | The allowed headroom below 100% — the budget for planned and unplanned downtime | | Burn rate | How fast the error budget is being consumed |

Output Format

SLO Document: [Service Name]

Service: [Name] | Team: [Team name] Owner: [Name / role] | Approved by: [Name] Effective date: [Date] | Review date: [Date + 3 months] Version: [1.0]

Why This SLO Exists

[2–3 sentences. What reliability problem are we solving? What was happening before this SLO that made us need it? What decision-making does this SLO enable?]

Service Overview

What this service does: [One sentence] Who depends on it: [Internal teams / external customers / both — describe] Critical user journeys protected by this SLO:

[Journey 1 — e.g. "User completes a payment"]
[Journey 2]
[Journey 3]

SLIs — What We Measure

Define one SLI per user journey or reliability dimension. Keep it to 3–5 SLIs maximum.

SLI 1: [Name — e.g. Request Success Rate]

| Field | Detail | |---|---| | What it measures | [e.g. "% of API requests that return a non-5xx response"] | | Good event definition | [e.g. "HTTP response with status 2xx or 4xx, completed within 500ms"] | | Bad event definition | [e.g. "HTTP response with status 5xx, or any response taking >500ms"] | | Measurement source | [e.g. "Application load balancer access logs / Datadog APM / Prometheus"] | | Measured over | Rolling 28-day window | | Exclusions | [e.g. "Health check endpoints excluded / Requests during planned maintenance excluded"] |

SLI 2: [Name — e.g. Latency]

| Field | Detail | |---|---| | What it measures | [e.g. "P99 response time for the /checkout endpoint"] | | Good event definition | [e.g. "Request completes in ≤500ms at P99"] | | Bad event definition | [e.g. "Request takes >500ms at P99"] | | Measurement source | [Source] | | Measured over | Rolling 28-day window | | Exclusions | [Any exclusions] |

SLI 3: [Name — e.g. Data Freshness / Queue Depth / etc.]

[Same structure]

SLO Targets

| SLI | Target | Window | Error Budget | |---|---|---|---| | [SLI 1 name] | [X]% | 28-day rolling | [100 - X]% = [Y minutes/month] | | [SLI 2 name] | [X]% | 28-day rolling | [100 - X]% = [Y minutes/month] | | [SLI 3 name] | [X]% | 28-day rolling | [100 - X]% = [Y minutes/month] |

How targets were set:

Historical baseline (last 90 days): [X]%
Target is set [above / at] historical baseline to [improve reliability / reflect current reality while formalising the commitment]
Rationale: [1–2 sentences]

What 100% is NOT the target: [Brief explanation of why targeting 100% is counterproductive — it discourages feature development and doesn't reflect user reality]

Error Budget Calculation

For SLI 1 ([Name]), at [X]% target:

Error budget = (100% - SLO target) × measurement window
             = (100% - [X]%) × 28 days × 24 hours × 60 minutes
             = [Y]% × [Z total minutes]
             = [N] minutes of allowed failure per 28-day window

In plain terms: We can afford [N] minutes of [bad events] in any rolling 28-day window before we breach the SLO.

Burn Rate Alerts

Burn rate = how fast the error budget is being consumed relative to the budget window. A burn rate of 1 = consuming the budget at exactly the rate that would exhaust it over 28 days.

| Alert | Burn rate | Window | Severity | Response | |---|---|---|---|---| | Page (critical) | >14× | 1 hour | P1 | Page on-call immediately — budget exhausted in <2 hours | | Page (high) | >6× | 6 hours | P2 | Page on-call — budget exhausted in <5 days | | Ticket (warning) | >3× | 3 days | P3 | Create ticket — review at next team meeting | | Info | >1× | 28 days | Info | Log only — budget on track to exhaust by end of window |

Alert implementation: [Link to alert config in monitoring tool — e.g. Datadog, Prometheus/Alertmanager, Grafana]

Error Budget Policy

This policy defines what to do with the error budget — both when it's healthy and when it's burning.

When budget is healthy (>50% remaining)

Feature development and deployments proceed at normal pace
The team may take on riskier experiments
Reliability improvements are scheduled but not urgent

When budget is at risk (25–50% remaining)

Deployment frequency reduced — team ships only well-tested changes
One reliability improvement added to current sprint
Weekly error budget review added to team standup

When budget is nearly exhausted (<25% remaining)

Feature work paused in favour of reliability improvements
No new deployments without explicit on-call approval
Daily review of error budget burn rate
CSM / support notified to manage customer expectations

When budget is exhausted (0% remaining — SLO breached)

All feature work stops
On-call engineer and engineering manager notified immediately
Post-incident review (PIR) required within 5 business days
SLO target may be temporarily relaxed (with stakeholder approval) while root cause is addressed

Dashboard and Reporting

SLO dashboard: [Link to Datadog / Grafana / etc. dashboard]

Metrics exposed:

Current SLO compliance (rolling 28-day)
Error budget remaining (% and minutes)
Burn rate (current and trend)
Incident count and MTTR this window

Reporting cadence:

| Audience | Frequency | Format | |---|---|---| | Engineering team | Weekly | Slack summary — #[service]-slo | | Engineering manager | Monthly | SLO review meeting | | Stakeholders / customers | Quarterly | SLO compliance summary |

Exclusions and Edge Cases

Planned maintenance: Error budget is not consumed during pre-announced maintenance windows. Maintenance must be communicated [X hours] in advance via [channel].

Dependency failures: If SLO breach is caused by an upstream dependency outside our control, document it — but it still counts against our error budget (our users don't distinguish between our failures and our dependencies' failures).

Force majeure: [Policy for cloud provider outages, major infrastructure events]

SLO Review Cadence

| Review | When | Who | Output | |---|---|---|---| | Error budget review | Weekly | Team | Budget health check — adjust if burning fast | | SLO target review | Quarterly | Team + EM | Adjust targets if baseline has shifted significantly | | Annual SLO audit | Annually | Team + Stakeholders | Review SLIs — are we measuring the right things? |

When to change the SLO target:

Historical baseline has improved significantly and target no longer reflects real reliability
User feedback indicates the target is misaligned with what users actually experience
The SLO is being gamed (metric is healthy but users are unhappy)

Quality Checks

[ ] SLIs are user-facing — they measure what users experience, not internal system metrics
[ ] Good and bad events are precisely defined — no ambiguity about what counts
[ ] Targets are based on historical data, not aspirational round numbers
[ ] Error budget policy has clear triggers and clear actions — not "discuss as a team"
[ ] Burn rate alerts have different windows to catch both fast burns and slow burns
[ ] Exclusions are documented so they don't silently inflate the SLO number

Anti-Patterns

[ ] Do not set SLO targets at 100% — this discourages feature development and does not reflect how users experience reliability
[ ] Do not measure internal system metrics as SLIs — SLIs must reflect what users directly experience, not internal CPU or memory
[ ] Do not write an error budget policy with vague triggers — "discuss as a team" is not an actionable policy; triggers must be specific percentages
[ ] Do not base targets on aspirational round numbers — always derive from historical baseline data
[ ] Do not configure only one burn-rate alert window — a single window misses both fast burns and slow burns that exhaust the budget quietly

mohitagw15856/slo-error-budget

skills/slo-error-budget/SKILL.md

Define Service Level Objectives (SLOs) and an error budget policy for a service. Use when asked to write SLOs, define SLIs, calculate an error budget, set reliability targets, or create an error budget policy. Produces a complete SLO document with SLI definitions, target calculation, error budget policy, burn rate alerts, and review cadence.

979 stars

documentation

Updated Jun 18, 2026

$ install --global

skillsauth

npx skillsauth add mohitagw15856/pm-claude-skills slo-error-budget

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Jun 9, 2026, 7:27 AM106.7s1 file scanned

SKILL.md

name:: slo-error-budget
description:: Define Service Level Objectives (SLOs) and an error budget policy for a service. Use when asked to write SLOs, define SLIs, calculate an error budget, set reliability targets, or create an error budget policy. Produces a complete SLO document with SLI definitions, target calculation, error budget policy, burn rate alerts, and review cadence.

SLO and Error Budget Skill

Produce a complete, implementable SLO document for a service — covering what to measure, what target to set, how to calculate the error budget, and what to do when it burns.

A good SLO is not a target to hit. It is an agreement about what reliability means for your users — and a framework for making principled trade-offs between reliability and velocity.

Required Inputs

Ask for these if not already provided:

Service name and brief description of what it does
Primary users — who depends on this service and how
User-facing interactions to protect — e.g. API calls, page loads, transactions
Current reliability data — error rate, latency, uptime (last 30–90 days if available)
Existing on-call setup — who responds to alerts?
Deployment frequency — how often does the team ship?
Any existing SLAs with customers — these constrain SLO targets

Key Definitions

Always establish these before writing the SLO:

Output Format

SLO Document: [Service Name]

Service: [Name] | Team: [Team name] Owner: [Name / role] | Approved by: [Name] Effective date: [Date] | Review date: [Date + 3 months] Version: [1.0]

Why This SLO Exists

[2–3 sentences. What reliability problem are we solving? What was happening before this SLO that made us need it? What decision-making does this SLO enable?]

Service Overview

What this service does: [One sentence] Who depends on it: [Internal teams / external customers / both — describe] Critical user journeys protected by this SLO:

[Journey 1 — e.g. "User completes a payment"]
[Journey 2]
[Journey 3]

SLIs — What We Measure

Define one SLI per user journey or reliability dimension. Keep it to 3–5 SLIs maximum.

SLI 1: [Name — e.g. Request Success Rate]

SLI 2: [Name — e.g. Latency]

SLI 3: [Name — e.g. Data Freshness / Queue Depth / etc.]

[Same structure]

SLO Targets

How targets were set:

Historical baseline (last 90 days): [X]%
Target is set [above / at] historical baseline to [improve reliability / reflect current reality while formalising the commitment]
Rationale: [1–2 sentences]

What 100% is NOT the target: [Brief explanation of why targeting 100% is counterproductive — it discourages feature development and doesn't reflect user reality]

Error Budget Calculation

For SLI 1 ([Name]), at [X]% target:

Error budget = (100% - SLO target) × measurement window
             = (100% - [X]%) × 28 days × 24 hours × 60 minutes
             = [Y]% × [Z total minutes]
             = [N] minutes of allowed failure per 28-day window

In plain terms: We can afford [N] minutes of [bad events] in any rolling 28-day window before we breach the SLO.

Burn Rate Alerts

Burn rate = how fast the error budget is being consumed relative to the budget window. A burn rate of 1 = consuming the budget at exactly the rate that would exhaust it over 28 days.

Alert implementation: [Link to alert config in monitoring tool — e.g. Datadog, Prometheus/Alertmanager, Grafana]

Error Budget Policy

This policy defines what to do with the error budget — both when it's healthy and when it's burning.

When budget is healthy (>50% remaining)

Feature development and deployments proceed at normal pace
The team may take on riskier experiments
Reliability improvements are scheduled but not urgent

When budget is at risk (25–50% remaining)

Deployment frequency reduced — team ships only well-tested changes
One reliability improvement added to current sprint
Weekly error budget review added to team standup

When budget is nearly exhausted (<25% remaining)

Feature work paused in favour of reliability improvements
No new deployments without explicit on-call approval
Daily review of error budget burn rate
CSM / support notified to manage customer expectations

When budget is exhausted (0% remaining — SLO breached)

All feature work stops
On-call engineer and engineering manager notified immediately
Post-incident review (PIR) required within 5 business days
SLO target may be temporarily relaxed (with stakeholder approval) while root cause is addressed

Dashboard and Reporting

SLO dashboard: [Link to Datadog / Grafana / etc. dashboard]

Metrics exposed:

Current SLO compliance (rolling 28-day)
Error budget remaining (% and minutes)
Burn rate (current and trend)
Incident count and MTTR this window

Reporting cadence:

Exclusions and Edge Cases

Planned maintenance: Error budget is not consumed during pre-announced maintenance windows. Maintenance must be communicated [X hours] in advance via [channel].

Force majeure: [Policy for cloud provider outages, major infrastructure events]

SLO Review Cadence

When to change the SLO target:

Historical baseline has improved significantly and target no longer reflects real reliability
User feedback indicates the target is misaligned with what users actually experience
The SLO is being gamed (metric is healthy but users are unhappy)

Quality Checks

[ ] SLIs are user-facing — they measure what users experience, not internal system metrics
[ ] Good and bad events are precisely defined — no ambiguity about what counts
[ ] Targets are based on historical data, not aspirational round numbers
[ ] Error budget policy has clear triggers and clear actions — not "discuss as a team"
[ ] Burn rate alerts have different windows to catch both fast burns and slow burns
[ ] Exclusions are documented so they don't silently inflate the SLO number

Anti-Patterns

[ ] Do not set SLO targets at 100% — this discourages feature development and does not reflect how users experience reliability
[ ] Do not measure internal system metrics as SLIs — SLIs must reflect what users directly experience, not internal CPU or memory
[ ] Do not write an error budget policy with vague triggers — "discuss as a team" is not an actionable policy; triggers must be specific percentages
[ ] Do not base targets on aspirational round numbers — always derive from historical baseline data
[ ] Do not configure only one burn-rate alert window — a single window misses both fast burns and slow burns that exhaust the budget quietly

Related Skills

mohitagw15856/win-loss-analysis

business

VerifiedTrustedCommunity

Analyze why deals are won and lost and turn it into an action plan. Use when asked to run a win/loss analysis, review closed-won and closed-lost deals, understand why the team is losing to a competitor, or summarize sales feedback into patterns. Produces a structured win/loss report with themes, win/loss rates by segment and competitor, representative quotes, and prioritized actions for product, marketing, and sales.

1,117SKILL.mdUpdated Jul 2, 2026

mohitagw15856/win-loss-analysis

mohitagw15856/which-skill

development

VerifiedTrustedCommunity

Route a fuzzy request to the right skill in this library. Use when the user is unsure which skill fits, asks 'which skill should I use for X', describes a task without naming a skill, or when a request could plausibly match several skills. Produces a best-fit recommendation with the inputs to gather, a runner-up with the tie-breaker, and a workflow recipe when the job spans multiple skills.

1,117SKILL.mdUpdated Jul 2, 2026

mohitagw15856/which-skill

mohitagw15856/vuln-triage

testing

VerifiedTrustedCommunity

Triage a vulnerability or scanner finding — assess real severity, exploitability, and how urgently to fix. Use when asked to triage a CVE, prioritize scanner/pentest findings, assess a vuln's risk, or decide what to patch first. Produces a triage verdict: CVSS-informed severity adjusted for your context, exploitability, real risk, a fix/mitigation, and an SLA — so you fix what matters, not just what's red.

1,117SKILL.mdUpdated Jul 2, 2026

mohitagw15856/vuln-triage

mohitagw15856/voice-of-customer-program

development

VerifiedTrustedCommunity

Stand up a Voice of Customer (VoC) program that turns feedback into action. Use when asked to build a VoC program, design a customer feedback loop, consolidate feedback sources, or set up a closed-loop feedback process. Produces a VoC program design — objectives, feedback sources and channels, a taxonomy, collection and analysis cadence, closed-loop routing, ownership, and success metrics.

1,117SKILL.mdUpdated Jul 2, 2026

mohitagw15856/voice-of-customer-program

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/mohitagw15856/pm-claude-skills.git

# Copy into Claude Code skills folder (global)
cp -r pm-claude-skills/skills/slo-error-budget ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

mohitagw15856/pm-claude-skills

979 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT