Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

alirezarezvani/slo-architect

Name: slo-architect
Author: alirezarezvani

engineering/slo-architect/skills/slo-architect/SKILL.md

npx skillsauth add alirezarezvani/claude-skills slo-architect

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

SLO Architect

Define SLOs that mean something. Most "SLOs" in the wild are arbitrary numbers no one believes — 99.9% on every endpoint, no SLI definition, no error budget, no policy for what happens when budget burns. This skill enforces the discipline from Google's SRE Workbook: pick the right SLI, set a target users actually care about, calculate the error budget, wire multi-window burn-rate alerts, and have a written policy for when budget runs out.

When to use

Defining a new SLO for a service or feature
Reviewing existing SLOs for common bugs
Picking the right SLI (event-based vs time-window based vs request-based)
Computing error budgets and burn-rate alert thresholds
Tying SLOs to existing controls — feature flags abort, chaos blast radius, operator capability levels

When NOT to use

General observability strategy (metrics + logs + traces) → use observability-designer
Customer-facing SLAs with legal teeth → that's contract drafting, not engineering
Performance load testing (capacity, not reliability) → use performance-profiler
Active incident response → use incident-response

Core principle: an SLO is a promise about user experience

SLI  ⟶  measurable signal of user-perceived health (e.g., HTTP 2xx rate)
SLO  ⟶  target for the SLI over a window (e.g., 99.9% over 30 days)
SLA  ⟶  customer-facing commitment with consequences (separate concern)
EB   ⟶  error budget: 100% − SLO target = how much "bad" you can spend
BR   ⟶  burn rate: how fast you're consuming the error budget

The four cardinal mistakes:

Target too high (99.99%+ on services that can't support it) — every minor blip violates SLO; alerts become noise.
Wrong SLI (CPU usage as proxy for user experience) — system can be "green" while users suffer.
No error budget policy — burning budget means nothing if there's no agreed action.
Single-window burn-rate alert — either too noisy (page on a 5-min spike) or too slow (notice budget exhausted after the fact).

The 3 tools below catch each of these.

Quick start

SKILL=engineering/slo-architect/skills/slo-architect

# 1. Design an SLO
python "$SKILL/scripts/slo_designer.py" \
  --service checkout-svc \
  --sli-type request-success-rate \
  --target 99.9 \
  --window-days 30

# 2. Compute error budget + multi-window burn-rate alerts
python "$SKILL/scripts/error_budget_calculator.py" \
  --target 99.9 --window-days 30

# 3. Review existing SLO definitions for common bugs
python "$SKILL/scripts/slo_review.py" --slo-doc docs/slos/

The 3 Python tools

All stdlib-only.

`slo_designer.py`

Generates a structured SLO definition with required fields. Refuses to render if any required field is missing (exit 1).

python scripts/slo_designer.py \
  --service checkout-svc \
  --sli-type request-success-rate \
  --target 99.9 \
  --window-days 30 \
  --owner team-checkout

SLI types supported:

request-success-rate — (total_requests - bad_requests) / total_requests
request-latency — count(requests < threshold) / total_requests
availability-time — (window - downtime) / window
data-freshness — count(data_age < threshold) / total_data_points
correctness — count(correct_outputs) / total_outputs

Output is markdown by default with all required fields filled or marked <must define>. JSON output (--format json) is consumed by slo_review.py.

`error_budget_calculator.py`

Given target availability + window, computes:

Allowed downtime in the window
Multi-window burn-rate thresholds per Google SRE Workbook (Chapter 5):
- Fast burn — page if 2% of monthly budget consumed in 1 hour
- Slow burn — page if 10% consumed in 6 hours, ticket if 10% in 3 days
Recommended alerting rules (PromQL-shaped output)

python scripts/error_budget_calculator.py --target 99.9 --window-days 30
python scripts/error_budget_calculator.py --target 99.95 --window-days 7 --format json

`slo_review.py`

Audits a directory of SLO definitions (markdown or JSON) for the common bugs.

python scripts/slo_review.py --slo-doc docs/slos/

Checks:

target_too_high: target ≥ 99.99% (sustainable only with massive engineering investment)
target_too_low: target ≤ 99.0% (probably wrong SLI; users will notice)
window_too_short: window < 7 days (statistical noise dominates)
window_too_long: window > 90 days (slow feedback)
no_sli_definition: SLI section missing or vague ("everything OK")
no_error_budget_policy: no documented action when budget burns
cpu_as_sli: CPU/memory used as user-experience proxy (wrong signal)

SLI selection cheatsheet

| User experience | SLI type | What you measure | |---|---|---| | "Did the request succeed?" | request-success-rate | 2xx / total | | "Was the response fast?" | request-latency | count(p99 < threshold) / total | | "Was the service up?" | availability-time | (window - downtime) / window | | "Is the data current?" | data-freshness | count(data_age < threshold) / total | | "Was the answer correct?" | correctness | count(correct) / total |

See references/sli_design.md for examples and anti-patterns.

Error budget math (the basics)

For 99.9% SLO over 30 days:

Allowed unavailability: 0.1% × 30 × 24 × 60 = 43.2 minutes
1-hour fast-burn threshold (2% of monthly budget burned): 2% × 43.2 / 60 ≈ 1.44 ratio multiplier
6-hour slow-burn threshold (10% in 6h): 10% × 43.2 / 360 ≈ 0.6 ratio multiplier

error_budget_calculator.py does this math for you and emits ready-to-paste alert rules.

Composition with the rest of the portfolio

This skill explicitly composes with three others:

| Skill | Composition | |---|---| | feature-flags-architect | Rollout abort criteria reference SLO burn-rate thresholds | | chaos-engineering | Blast-radius calculator already takes monthly error budget as input — define it here | | kubernetes-operator | Operator capability L4 (Deep Insights) requires SLOs + Prometheus rules |

The error_budget_calculator.py output is in the same shape engineering/skills/chaos-engineering/scripts/blast_radius_calculator.py expects on stdin.

Workflows

Workflow 1: Define a new SLO

1. Pick the user journey to protect (e.g., "checkout completion").
2. Choose SLI type (request-success-rate, latency, availability, freshness, correctness).
3. Define the SLI precisely: numerator/denominator with concrete labels.
4. Pick a target by measuring 30 days of historical SLI value:
     target = floor(p50 of last 30 days × 100) / 100
   This avoids targets the system has never sustained.
5. Pick a window (28 days = 4 calendar weeks, recommended).
6. Run slo_designer.py to render the SLO definition.
7. Run error_budget_calculator.py to get burn-rate alerts.
8. Write the error budget policy (what happens when budget burns).
9. Run slo_review.py — must pass before the SLO is "live".

Workflow 2: Quarterly SLO review

1. For every active SLO, run slo_review.py — fix any FAIL findings.
2. Look at last quarter's data:
   - Was the SLO too easy (never burned budget)? Tighten target.
   - Was it too hard (frequently burned)? Loosen target OR fix the system.
   - Did burn-rate alerts fire usefully (not too noisy, not too late)? Adjust thresholds.
3. Audit error budget policies — were they actually followed when budget burned?
4. Commit revised SLOs; archive old versions with date stamps.

Workflow 3: SLO-driven rollback

1. New deploy starts burning error budget faster than baseline.
2. Burn-rate alert fires (from error_budget_calculator.py thresholds).
3. Auto-rollback via feature flag (kill switch from feature-flags-architect).
4. Postmortem feeds into next SLO revision.

References

references/slo_principles.md — SLI vs SLO vs SLA, Google SRE Workbook canon
references/sli_design.md — picking the right SLI; 5 types with examples
references/error_budget.md — error budget math, burn-rate alerts, budget policy
references/composition.md — how SLOs feed feature flags, chaos, operators

Slash command

/slo-design — interactive SLO design wizard that runs all 3 tools.

Asset templates

assets/slo_template.yaml — fillable SLO YAML
assets/error_budget_policy.md — fillable policy template

Anti-patterns

99.99% on every endpoint — copy-paste SLOs that nobody verified the system can sustain
CPU usage as SLI — system metrics aren't user experience
Single-window burn-rate alert — too noisy if 5-min, too slow if 30-day
No error budget policy — burning budget means nothing without an action
SLOs without owners — no one is responsible; they bit-rot
SLOs reviewed once a year — system characteristics change faster than that
SLAs in the SLO doc — different audience, different stakes; keep them separate
SLO target = SLA target — SLO must be tighter (you should beat your contract before customers notice)

Verifiable success

A team using this skill should achieve:

100% of SLOs pass slo_review.py with 0 FAIL findings
Every SLO has a documented owner, error budget, burn-rate alerts, and policy
Burn-rate alerts fire ≤2 times/month per SLO that's hit (signal, not noise)
Mean time to detect SLO violation: <30 min (multi-window burn-rate alerts working)
Quarterly SLO review happens every quarter (not annually)

alirezarezvani/slo-architect

engineering/slo-architect/skills/slo-architect/SKILL.md

Use when defining, reviewing, or operating SLOs/SLIs/error budgets. Triggers on "define an SLO", "what should our SLO be", "error budget", "burn rate", "SLI", "service level objective", "Google SRE workbook", "multi-window burn-rate alert", or any reliability-target question. Ships SLO designer, error-budget calculator with multi-window burn-rate thresholds, and SLO reviewer that catches the common bugs (target too aggressive, window too short, conflicting SLOs, no SLI definition). 4 references on SLO principles + SLI design + error budget math + composition with feature-flags-architect/chaos-engineering/kubernetes-operator. NOT a generic observability skill — specifically the SLO discipline.

17,936 stars

development

Updated Jun 13, 2026

$ install --global

skillsauth

npx skillsauth add alirezarezvani/claude-skills slo-architect

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Jun 13, 2026, 4:23 AM33.7s10 files scanned

SKILL.md

name:: slo-architect
description:: Use when defining, reviewing, or operating SLOs/SLIs/error budgets. Triggers on "define an SLO", "what should our SLO be", "error budget", "burn rate", "SLI", "service level objective", "Google SRE workbook", "multi-window burn-rate alert", or any reliability-target question. Ships SLO designer, error-budget calculator with multi-window burn-rate thresholds, and SLO reviewer that catches the common bugs (target too aggressive, window too short, conflicting SLOs, no SLI definition). 4 references on SLO principles + SLI design + error budget math + composition with feature-flags-architect/chaos-engineering/kubernetes-operator. NOT a generic observability skill — specifically the SLO discipline.
context:: fork
version:: 2.9.0
author:: claude-code-skills
license:: MIT
tags:: [slo, sli, sla, error-budget, burn-rate, sre, reliability, google-sre-workbook, observability]
compatible_tools:: [claude-code, codex-cli, cursor, antigravity, opencode, gemini-cli]

SLO Architect

When to use

Defining a new SLO for a service or feature
Reviewing existing SLOs for common bugs
Picking the right SLI (event-based vs time-window based vs request-based)
Computing error budgets and burn-rate alert thresholds
Tying SLOs to existing controls — feature flags abort, chaos blast radius, operator capability levels

When NOT to use

General observability strategy (metrics + logs + traces) → use observability-designer
Customer-facing SLAs with legal teeth → that's contract drafting, not engineering
Performance load testing (capacity, not reliability) → use performance-profiler
Active incident response → use incident-response

Core principle: an SLO is a promise about user experience

SLI  ⟶  measurable signal of user-perceived health (e.g., HTTP 2xx rate)
SLO  ⟶  target for the SLI over a window (e.g., 99.9% over 30 days)
SLA  ⟶  customer-facing commitment with consequences (separate concern)
EB   ⟶  error budget: 100% − SLO target = how much "bad" you can spend
BR   ⟶  burn rate: how fast you're consuming the error budget

The four cardinal mistakes:

Target too high (99.99%+ on services that can't support it) — every minor blip violates SLO; alerts become noise.
Wrong SLI (CPU usage as proxy for user experience) — system can be "green" while users suffer.
No error budget policy — burning budget means nothing if there's no agreed action.
Single-window burn-rate alert — either too noisy (page on a 5-min spike) or too slow (notice budget exhausted after the fact).

The 3 tools below catch each of these.

Quick start

SKILL=engineering/slo-architect/skills/slo-architect

# 1. Design an SLO
python "$SKILL/scripts/slo_designer.py" \
  --service checkout-svc \
  --sli-type request-success-rate \
  --target 99.9 \
  --window-days 30

# 2. Compute error budget + multi-window burn-rate alerts
python "$SKILL/scripts/error_budget_calculator.py" \
  --target 99.9 --window-days 30

# 3. Review existing SLO definitions for common bugs
python "$SKILL/scripts/slo_review.py" --slo-doc docs/slos/

The 3 Python tools

All stdlib-only.

`slo_designer.py`

Generates a structured SLO definition with required fields. Refuses to render if any required field is missing (exit 1).

python scripts/slo_designer.py \
  --service checkout-svc \
  --sli-type request-success-rate \
  --target 99.9 \
  --window-days 30 \
  --owner team-checkout

SLI types supported:

request-success-rate — (total_requests - bad_requests) / total_requests
request-latency — count(requests < threshold) / total_requests
availability-time — (window - downtime) / window
data-freshness — count(data_age < threshold) / total_data_points
correctness — count(correct_outputs) / total_outputs

Output is markdown by default with all required fields filled or marked <must define>. JSON output (--format json) is consumed by slo_review.py.

`error_budget_calculator.py`

Given target availability + window, computes:

Allowed downtime in the window
Multi-window burn-rate thresholds per Google SRE Workbook (Chapter 5):
- Fast burn — page if 2% of monthly budget consumed in 1 hour
- Slow burn — page if 10% consumed in 6 hours, ticket if 10% in 3 days
Recommended alerting rules (PromQL-shaped output)

python scripts/error_budget_calculator.py --target 99.9 --window-days 30
python scripts/error_budget_calculator.py --target 99.95 --window-days 7 --format json

`slo_review.py`

Audits a directory of SLO definitions (markdown or JSON) for the common bugs.

python scripts/slo_review.py --slo-doc docs/slos/

Checks:

target_too_high: target ≥ 99.99% (sustainable only with massive engineering investment)
target_too_low: target ≤ 99.0% (probably wrong SLI; users will notice)
window_too_short: window < 7 days (statistical noise dominates)
window_too_long: window > 90 days (slow feedback)
no_sli_definition: SLI section missing or vague ("everything OK")
no_error_budget_policy: no documented action when budget burns
cpu_as_sli: CPU/memory used as user-experience proxy (wrong signal)

SLI selection cheatsheet

See references/sli_design.md for examples and anti-patterns.

Error budget math (the basics)

For 99.9% SLO over 30 days:

Allowed unavailability: 0.1% × 30 × 24 × 60 = 43.2 minutes
1-hour fast-burn threshold (2% of monthly budget burned): 2% × 43.2 / 60 ≈ 1.44 ratio multiplier
6-hour slow-burn threshold (10% in 6h): 10% × 43.2 / 360 ≈ 0.6 ratio multiplier

error_budget_calculator.py does this math for you and emits ready-to-paste alert rules.

Composition with the rest of the portfolio

This skill explicitly composes with three others:

The error_budget_calculator.py output is in the same shape engineering/skills/chaos-engineering/scripts/blast_radius_calculator.py expects on stdin.

Workflows

Workflow 1: Define a new SLO

1. Pick the user journey to protect (e.g., "checkout completion").
2. Choose SLI type (request-success-rate, latency, availability, freshness, correctness).
3. Define the SLI precisely: numerator/denominator with concrete labels.
4. Pick a target by measuring 30 days of historical SLI value:
     target = floor(p50 of last 30 days × 100) / 100
   This avoids targets the system has never sustained.
5. Pick a window (28 days = 4 calendar weeks, recommended).
6. Run slo_designer.py to render the SLO definition.
7. Run error_budget_calculator.py to get burn-rate alerts.
8. Write the error budget policy (what happens when budget burns).
9. Run slo_review.py — must pass before the SLO is "live".

Workflow 2: Quarterly SLO review

1. For every active SLO, run slo_review.py — fix any FAIL findings.
2. Look at last quarter's data:
   - Was the SLO too easy (never burned budget)? Tighten target.
   - Was it too hard (frequently burned)? Loosen target OR fix the system.
   - Did burn-rate alerts fire usefully (not too noisy, not too late)? Adjust thresholds.
3. Audit error budget policies — were they actually followed when budget burned?
4. Commit revised SLOs; archive old versions with date stamps.

Workflow 3: SLO-driven rollback

1. New deploy starts burning error budget faster than baseline.
2. Burn-rate alert fires (from error_budget_calculator.py thresholds).
3. Auto-rollback via feature flag (kill switch from feature-flags-architect).
4. Postmortem feeds into next SLO revision.

References

references/slo_principles.md — SLI vs SLO vs SLA, Google SRE Workbook canon
references/sli_design.md — picking the right SLI; 5 types with examples
references/error_budget.md — error budget math, burn-rate alerts, budget policy
references/composition.md — how SLOs feed feature flags, chaos, operators

Slash command

/slo-design — interactive SLO design wizard that runs all 3 tools.

Asset templates

assets/slo_template.yaml — fillable SLO YAML
assets/error_budget_policy.md — fillable policy template

Anti-patterns

99.99% on every endpoint — copy-paste SLOs that nobody verified the system can sustain
CPU usage as SLI — system metrics aren't user experience
Single-window burn-rate alert — too noisy if 5-min, too slow if 30-day
No error budget policy — burning budget means nothing without an action
SLOs without owners — no one is responsible; they bit-rot
SLOs reviewed once a year — system characteristics change faster than that
SLAs in the SLO doc — different audience, different stakes; keep them separate
SLO target = SLA target — SLO must be tighter (you should beat your contract before customers notice)

Verifiable success

A team using this skill should achieve:

100% of SLOs pass slo_review.py with 0 FAIL findings
Every SLO has a documented owner, error budget, burn-rate alerts, and policy
Burn-rate alerts fire ≤2 times/month per SLO that's hit (signal, not noise)
Mean time to detect SLO violation: <30 min (multi-window burn-rate alerts working)
Quarterly SLO review happens every quarter (not annually)

Related Skills

alirezarezvani/weekly-review

development

VerifiedTrustedCommunity

Use when someone wants to run a weekly review, close open loops, audit stalled projects and commitments, get their system back to trusted, restart a lapsed review habit, or says "/cs:weekly-review". Walks David Allen's three-phase loop — GET CLEAR, GET CURRENT, GET CREATIVE — with deterministic scripts that inventory open loops, gate the checklist with named gaps, and score commitment health 0-100.

22,702SKILL.mdUpdated Jul 18, 2026

alirezarezvani/weekly-review

alirezarezvani/meetings

development

VerifiedTrustedCommunity

Use when someone wants to decide whether a meeting is worth calling, price a meeting in dollars, build a timeboxed agenda with desired outcomes, or turn messy meeting notes into owned action items — or says "should this be a meeting", "/cs:meeting-prep", or "/cs:meeting-actions". Runs a cost gate (ASYNC / NOT-READY / MEET), builds a decision-first agenda, and extracts an owner + due-date checklist that flags every orphan.

22,702SKILL.mdUpdated Jul 18, 2026

alirezarezvani/meetings

alirezarezvani/fable-goal

development

VerifiedTrustedCommunity

Convert a rambling description of a desired outcome into one polished, autonomous /goal prompt ready to paste into a fresh session. Use when the user says "/fable-goal", "turn this into a goal prompt", "write me a fable prompt", "write the prompt that builds X", or rambles about something they want made and asks for the prompt that makes it happen. The output is a single copy-paste prompt, never the build itself. Do NOT use when the user wants the thing built right now in this session — only when they want the PROMPT that will make it happen in a fresh session.

22,702SKILL.mdUpdated Jul 18, 2026

alirezarezvani/fable-goal

alirezarezvani/deep-work

development

VerifiedTrustedCommunity

Use when someone wants to plan a deep work day, time-block their calendar or task list, budget or cut shallow work, protect focus hours, track deep-work sessions and streaks, run an end-of-day shutdown ritual, or says "/deep-work" or "/time-block". Classifies tasks deep vs shallow, builds an energy-first time-blocked schedule that refuses deep demand past the 4-hour ceiling, batches shallow work into at most two windows, and logs focus sessions against a weekly target.

22,702SKILL.mdUpdated Jul 18, 2026

alirezarezvani/deep-work

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/alirezarezvani/claude-skills.git

# Copy into Claude Code skills folder (global)
cp -r claude-skills/engineering/slo-architect/skills/slo-architect ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

alirezarezvani/claude-skills

17,936 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT