skills/reliability/SKILL.md
Site reliability engineering -- SLO/SLI/SLA, error budgets, toil, on-call, runbooks, incidents.
npx skillsauth add arbazkhan971/godmode reliabilityInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
/godmode:reliability, "SRE", "SLO", "error budget"grep -r "healthcheck\|health-check\|/health" \
--include="*.ts" --include="*.py" -l 2>/dev/null
grep -r "pagerduty\|opsgenie\|alertmanager" \
--include="*.yaml" --include="*.yml" -l 2>/dev/null
Service: <name> | Criticality: Tier 1/2/3
Current state: monitoring, alerting, on-call, runbooks
Dependencies: <upstream and downstream>
Hierarchy: SLA (external) -> SLO (internal, stricter) -> SLI (measured metric).
SLI categories: availability (success/total), latency (requests < threshold / total), throughput, correctness, freshness, durability.
Error budget = 1 - SLO.
Errors: HTTP 5xx, timeouts, circuit breaker rejections. NOT 4xx or 429.
IF error rate >0.1%: investigate top 3 error classes. WHEN SLO budget <10% remaining: freeze deploys.
Policy:
50% remaining: normal operations
Multi-window burn rate alerts:
Both windows must trigger (reduces false positives).
Toil = manual, repetitive, automatable, tactical, no enduring value. Inventory with frequency + hours. Target: <50% of team capacity. Automate top 3.
IF toil > 50%: stop feature work, automate.
Minimum 5 engineers. Primary + secondary. Escalation: L1(0m) -> L2(15m) -> L3(30m) -> L4(60m). Health: <5 pages/shift, <1 during sleep, MTTA <5min, MTTR <30min, false positive <20%. Max 1 week in 5. Day off after off-hours SEV1.
Every pageable alert needs: what is happening, user impact, diagnostic steps (commands), mitigation options (commands), escalation, post-incident actions. Levels: L0 Manual -> L1 Assisted -> L2 Semi-auto -> L3 Full auto.
Lifecycle: Detection -> Triage -> Mitigation -> Resolution -> Post-mortem -> Prevention. Severity: SEV1 (<15min), SEV2 (<30min), SEV3 (<2h), SEV4 (next day). Roles: IC, Tech Lead, Comms, Scribe.
SLOs, error budget alerts, dashboards, logging, tracing, alerts, runbooks, on-call, circuit breakers, timeouts, auto-scaling, canary deploy, rollback.
Append .godmode/reliability-results.tsv:
timestamp service slo_count budget_remaining_pct alerts runbooks status
KEEP if: SLO measurement works AND alerts fire
correctly AND runbook is actionable.
DISCARD if: false positives OR measurement broken
OR runbook is vague.
STOP when ALL of:
- SLOs defined and measurable (tier-1 services)
- Burn rate alerts configured and tested
- On-call rotation active with escalation
- Runbooks exist for all critical alerts
On failure: git reset --hard HEAD~1. Never pause.
<!-- tier-3 -->| Failure | Action | |--|--| | SLO always breached | Set at current p95, improve later | | Alert fatigue | Increase threshold, multi-signal alerts | | Runbook outdated | Add to deploy checklist, test quarterly | | Budget depleted fast | Freeze deploys, fix top errors |
development
Web performance optimization. Lighthouse, bundle analysis, code splitting, image optimization, critical CSS, fonts, service workers, CDN.
development
Webhook design, delivery, retry, HMAC verification, event subscriptions, dead letter queues.
development
Vue.js mastery. Composition API, Pinia, Vue Router, Nuxt SSR/SSG, Vite optimization, testing.
development
Evidence gate. Run command, read full output, confirm or deny claim. No trust, only proof.