engineering/slo-architect/skills/slo-architect/SKILL.md
Use when defining, reviewing, or operating SLOs/SLIs/error budgets. Triggers on "define an SLO", "what should our SLO be", "error budget", "burn rate", "SLI", "service level objective", "Google SRE workbook", "multi-window burn-rate alert", or any reliability-target question. Ships SLO designer, error-budget calculator with multi-window burn-rate thresholds, and SLO reviewer that catches the common bugs (target too aggressive, window too short, conflicting SLOs, no SLI definition). 4 references on SLO principles + SLI design + error budget math + composition with feature-flags-architect/chaos-engineering/kubernetes-operator. NOT a generic observability skill — specifically the SLO discipline.
npx skillsauth add alirezarezvani/claude-skills slo-architectInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Define SLOs that mean something. Most "SLOs" in the wild are arbitrary numbers no one believes — 99.9% on every endpoint, no SLI definition, no error budget, no policy for what happens when budget burns. This skill enforces the discipline from Google's SRE Workbook: pick the right SLI, set a target users actually care about, calculate the error budget, wire multi-window burn-rate alerts, and have a written policy for when budget runs out.
observability-designerperformance-profilerincident-responseSLI ⟶ measurable signal of user-perceived health (e.g., HTTP 2xx rate)
SLO ⟶ target for the SLI over a window (e.g., 99.9% over 30 days)
SLA ⟶ customer-facing commitment with consequences (separate concern)
EB ⟶ error budget: 100% − SLO target = how much "bad" you can spend
BR ⟶ burn rate: how fast you're consuming the error budget
The four cardinal mistakes:
The 3 tools below catch each of these.
SKILL=engineering/slo-architect/skills/slo-architect
# 1. Design an SLO
python "$SKILL/scripts/slo_designer.py" \
--service checkout-svc \
--sli-type request-success-rate \
--target 99.9 \
--window-days 30
# 2. Compute error budget + multi-window burn-rate alerts
python "$SKILL/scripts/error_budget_calculator.py" \
--target 99.9 --window-days 30
# 3. Review existing SLO definitions for common bugs
python "$SKILL/scripts/slo_review.py" --slo-doc docs/slos/
All stdlib-only.
slo_designer.pyGenerates a structured SLO definition with required fields. Refuses to render if any required field is missing (exit 1).
python scripts/slo_designer.py \
--service checkout-svc \
--sli-type request-success-rate \
--target 99.9 \
--window-days 30 \
--owner team-checkout
SLI types supported:
request-success-rate — (total_requests - bad_requests) / total_requestsrequest-latency — count(requests < threshold) / total_requestsavailability-time — (window - downtime) / windowdata-freshness — count(data_age < threshold) / total_data_pointscorrectness — count(correct_outputs) / total_outputsOutput is markdown by default with all required fields filled or marked <must define>. JSON output (--format json) is consumed by slo_review.py.
error_budget_calculator.pyGiven target availability + window, computes:
python scripts/error_budget_calculator.py --target 99.9 --window-days 30
python scripts/error_budget_calculator.py --target 99.95 --window-days 7 --format json
slo_review.pyAudits a directory of SLO definitions (markdown or JSON) for the common bugs.
python scripts/slo_review.py --slo-doc docs/slos/
Checks:
target_too_high: target ≥ 99.99% (sustainable only with massive engineering investment)target_too_low: target ≤ 99.0% (probably wrong SLI; users will notice)window_too_short: window < 7 days (statistical noise dominates)window_too_long: window > 90 days (slow feedback)no_sli_definition: SLI section missing or vague ("everything OK")no_error_budget_policy: no documented action when budget burnscpu_as_sli: CPU/memory used as user-experience proxy (wrong signal)| User experience | SLI type | What you measure |
|---|---|---|
| "Did the request succeed?" | request-success-rate | 2xx / total |
| "Was the response fast?" | request-latency | count(p99 < threshold) / total |
| "Was the service up?" | availability-time | (window - downtime) / window |
| "Is the data current?" | data-freshness | count(data_age < threshold) / total |
| "Was the answer correct?" | correctness | count(correct) / total |
See references/sli_design.md for examples and anti-patterns.
For 99.9% SLO over 30 days:
0.1% × 30 × 24 × 60 = 43.2 minutes2% × 43.2 / 60 ≈ 1.44 ratio multiplier10% × 43.2 / 360 ≈ 0.6 ratio multipliererror_budget_calculator.py does this math for you and emits ready-to-paste alert rules.
This skill explicitly composes with three others:
| Skill | Composition |
|---|---|
| feature-flags-architect | Rollout abort criteria reference SLO burn-rate thresholds |
| chaos-engineering | Blast-radius calculator already takes monthly error budget as input — define it here |
| kubernetes-operator | Operator capability L4 (Deep Insights) requires SLOs + Prometheus rules |
The error_budget_calculator.py output is in the same shape chaos-engineering/scripts/blast_radius_calculator.py expects on stdin.
1. Pick the user journey to protect (e.g., "checkout completion").
2. Choose SLI type (request-success-rate, latency, availability, freshness, correctness).
3. Define the SLI precisely: numerator/denominator with concrete labels.
4. Pick a target by measuring 30 days of historical SLI value:
target = floor(p50 of last 30 days × 100) / 100
This avoids targets the system has never sustained.
5. Pick a window (28 days = 4 calendar weeks, recommended).
6. Run slo_designer.py to render the SLO definition.
7. Run error_budget_calculator.py to get burn-rate alerts.
8. Write the error budget policy (what happens when budget burns).
9. Run slo_review.py — must pass before the SLO is "live".
1. For every active SLO, run slo_review.py — fix any FAIL findings.
2. Look at last quarter's data:
- Was the SLO too easy (never burned budget)? Tighten target.
- Was it too hard (frequently burned)? Loosen target OR fix the system.
- Did burn-rate alerts fire usefully (not too noisy, not too late)? Adjust thresholds.
3. Audit error budget policies — were they actually followed when budget burned?
4. Commit revised SLOs; archive old versions with date stamps.
1. New deploy starts burning error budget faster than baseline.
2. Burn-rate alert fires (from error_budget_calculator.py thresholds).
3. Auto-rollback via feature flag (kill switch from feature-flags-architect).
4. Postmortem feeds into next SLO revision.
references/slo_principles.md — SLI vs SLO vs SLA, Google SRE Workbook canonreferences/sli_design.md — picking the right SLI; 5 types with examplesreferences/error_budget.md — error budget math, burn-rate alerts, budget policyreferences/composition.md — how SLOs feed feature flags, chaos, operators/slo-design — interactive SLO design wizard that runs all 3 tools.
assets/slo_template.yaml — fillable SLO YAMLassets/error_budget_policy.md — fillable policy templateA team using this skill should achieve:
slo_review.py with 0 FAIL findingstools
Code review automation for TypeScript, JavaScript, Python, Go, Swift, Kotlin, C#, .NET, Java, C, C++, Rust, Ruby, PHP, and Dart/Flutter. Analyzes PRs for complexity and risk, checks code quality for SOLID violations and code smells, generates review reports. Use when reviewing pull requests, analyzing code quality, identifying issues, generating review checklists.
tools
Use when planning, funding, scoping, or synthesizing enterprise research across workstreams — clinical study design, R&D program finance, market sizing/surveys, or product/user research. Triggers on "design this clinical study", "what sample size", "R&D budget", "burn rate", "capitalize or expense", "TAM SAM SOM", "market sizing", "survey design", "segment the market", "plan user interviews", "usability test", "synthesize research insights". Forks context to route to one of four Research-Operations sub-skills (clinical-research, research-finance, market-research, product-research) and returns a digest. Distinct from ra-qm-team (regulatory submission), finance (corporate close/valuation), research/grants (funding discovery), product-team (persona/journey/live experiments), and marketing-skill (campaign analytics).
development
Use when managing the money for an internal R&D program or portfolio — building a multi-period program budget with the F&A (indirect) split, tracking burn rate and runway against value-inflection milestones, or routing R&D cost items to a capitalize-vs-expense determination. Every budget output surfaces its assumptions block; capitalize-vs-expense is decision-support only and routes to a named finance owner — it never books an entry or decides accounting treatment. Distinct from finance/financial-analysis (corporate DCF, close, valuation) and research/grants (funding discovery — this manages money already won).
development
Use when planning and synthesizing product/user research as a method-and-repository discipline — selecting the right method for the goal (generative interviews vs usability test vs concept test vs validation), computing method-based saturation/sample size with an explicit confidence level, or synthesizing coded observations into insights while flagging single-source anecdotes. Never fabricates user insight; an insight requires recurrence across independent participants. Distinct from product-team/ux-researcher-designer (persona/journey artifacts), product-discovery (discovery-sprint planning), and experiment-designer (live A/B) — this is the research-ops method + insight-repository layer.