Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

arbazkhan971/reliability

Name: reliability
Author: arbazkhan971

skills/reliability/SKILL.md

npx skillsauth add arbazkhan971/godmode reliability

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Activate When

/godmode:reliability, "SRE", "SLO", "error budget"
"toil", "on-call", "runbook", "incident management"
Establishing production operations for a new service

Workflow

1. Service Context

grep -r "healthcheck\|health-check\|/health" \
  --include="*.ts" --include="*.py" -l 2>/dev/null
grep -r "pagerduty\|opsgenie\|alertmanager" \
  --include="*.yaml" --include="*.yml" -l 2>/dev/null

Service: <name> | Criticality: Tier 1/2/3
Current state: monitoring, alerting, on-call, runbooks
Dependencies: <upstream and downstream>

2. SLO/SLI/SLA

Hierarchy: SLA (external) -> SLO (internal, stricter) -> SLI (measured metric).

SLI categories: availability (success/total), latency (requests < threshold / total), throughput, correctness, freshness, durability.

Error budget = 1 - SLO.

99.9% = 43.2 min/month error budget
99.99% = 4.3 min/month error budget

Errors: HTTP 5xx, timeouts, circuit breaker rejections. NOT 4xx or 429.

IF error rate >0.1%: investigate top 3 error classes. WHEN SLO budget <10% remaining: freeze deploys.

3. Error Budgets & Burn Rate Alerts

Policy:

50% remaining: normal operations
25-50%: slow risky deploys
10-25%: freeze non-critical
<10%: all hands on reliability
0%: only reliability work

Multi-window burn rate alerts:

Critical: 14.4x burn, 1h+5m -> Page
High: 6x burn, 6h+30m -> Page
Medium: 3x burn, 1d+2h -> Ticket
Low: 1x burn, 3d+6h -> Log

Both windows must trigger (reduces false positives).

4. Toil Reduction

Toil = manual, repetitive, automatable, tactical, no enduring value. Inventory with frequency + hours. Target: <50% of team capacity. Automate top 3.

IF toil > 50%: stop feature work, automate.

5. On-Call

Minimum 5 engineers. Primary + secondary. Escalation: L1(0m) -> L2(15m) -> L3(30m) -> L4(60m). Health: <5 pages/shift, <1 during sleep, MTTA <5min, MTTR <30min, false positive <20%. Max 1 week in 5. Day off after off-hours SEV1.

6. Runbooks

Every pageable alert needs: what is happening, user impact, diagnostic steps (commands), mitigation options (commands), escalation, post-incident actions. Levels: L0 Manual -> L1 Assisted -> L2 Semi-auto -> L3 Full auto.

7. Incident Management

Lifecycle: Detection -> Triage -> Mitigation -> Resolution -> Post-mortem -> Prevention. Severity: SEV1 (<15min), SEV2 (<30min), SEV3 (<2h), SEV4 (next day). Roles: IC, Tech Lead, Comms, Scribe.

8. Production Readiness

SLOs, error budget alerts, dashboards, logging, tracing, alerts, runbooks, on-call, circuit breakers, timeouts, auto-scaling, canary deploy, rollback.

Hard Rules

NEVER set SLO at 100%.
EVERY alert must have a runbook.
NEVER alert on raw metrics -- use burn rate.
SET SLO stricter than SLA.
SLIs from real user traffic only.
NEVER skip SEV1/SEV2 post-mortems.
Budget policy MUST define exhaustion response.
On-call minimum 5 people.
Toil measured monthly; >50% = stop features.

TSV Logging

Append .godmode/reliability-results.tsv:

timestamp	service	slo_count	budget_remaining_pct	alerts	runbooks	status

Keep/Discard

KEEP if: SLO measurement works AND alerts fire
  correctly AND runbook is actionable.
DISCARD if: false positives OR measurement broken
  OR runbook is vague.

Stop Conditions

STOP when ALL of:
  - SLOs defined and measurable (tier-1 services)
  - Burn rate alerts configured and tested
  - On-call rotation active with escalation
  - Runbooks exist for all critical alerts

Autonomous Operation

On failure: git reset --hard HEAD~1. Never pause.

Error Recovery

arbazkhan971/reliability

skills/reliability/SKILL.md

Site reliability engineering -- SLO/SLI/SLA, error budgets, toil, on-call, runbooks, incidents.

9 stars

testing

Updated Apr 17, 2026

$ install --global

skillsauth

npx skillsauth add arbazkhan971/godmode reliability

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 17, 2026, 3:16 AM5.8s1 file scanned

SKILL.md

name:: reliability
description:: Site reliability engineering -- SLO/SLI/SLA,

Activate When

/godmode:reliability, "SRE", "SLO", "error budget"
"toil", "on-call", "runbook", "incident management"
Establishing production operations for a new service

Workflow

1. Service Context

grep -r "healthcheck\|health-check\|/health" \
  --include="*.ts" --include="*.py" -l 2>/dev/null
grep -r "pagerduty\|opsgenie\|alertmanager" \
  --include="*.yaml" --include="*.yml" -l 2>/dev/null

Service: <name> | Criticality: Tier 1/2/3
Current state: monitoring, alerting, on-call, runbooks
Dependencies: <upstream and downstream>

2. SLO/SLI/SLA

Hierarchy: SLA (external) -> SLO (internal, stricter) -> SLI (measured metric).

SLI categories: availability (success/total), latency (requests < threshold / total), throughput, correctness, freshness, durability.

Error budget = 1 - SLO.

99.9% = 43.2 min/month error budget
99.99% = 4.3 min/month error budget

Errors: HTTP 5xx, timeouts, circuit breaker rejections. NOT 4xx or 429.

IF error rate >0.1%: investigate top 3 error classes. WHEN SLO budget <10% remaining: freeze deploys.

3. Error Budgets & Burn Rate Alerts

Policy:

50% remaining: normal operations
25-50%: slow risky deploys
10-25%: freeze non-critical
<10%: all hands on reliability
0%: only reliability work

Multi-window burn rate alerts:

Critical: 14.4x burn, 1h+5m -> Page
High: 6x burn, 6h+30m -> Page
Medium: 3x burn, 1d+2h -> Ticket
Low: 1x burn, 3d+6h -> Log

Both windows must trigger (reduces false positives).

4. Toil Reduction

Toil = manual, repetitive, automatable, tactical, no enduring value. Inventory with frequency + hours. Target: <50% of team capacity. Automate top 3.

IF toil > 50%: stop feature work, automate.

5. On-Call

6. Runbooks

7. Incident Management

Lifecycle: Detection -> Triage -> Mitigation -> Resolution -> Post-mortem -> Prevention. Severity: SEV1 (<15min), SEV2 (<30min), SEV3 (<2h), SEV4 (next day). Roles: IC, Tech Lead, Comms, Scribe.

8. Production Readiness

SLOs, error budget alerts, dashboards, logging, tracing, alerts, runbooks, on-call, circuit breakers, timeouts, auto-scaling, canary deploy, rollback.

Hard Rules

NEVER set SLO at 100%.
EVERY alert must have a runbook.
NEVER alert on raw metrics -- use burn rate.
SET SLO stricter than SLA.
SLIs from real user traffic only.
NEVER skip SEV1/SEV2 post-mortems.
Budget policy MUST define exhaustion response.
On-call minimum 5 people.
Toil measured monthly; >50% = stop features.

TSV Logging

Append .godmode/reliability-results.tsv:

timestamp	service	slo_count	budget_remaining_pct	alerts	runbooks	status

Keep/Discard

KEEP if: SLO measurement works AND alerts fire
  correctly AND runbook is actionable.
DISCARD if: false positives OR measurement broken
  OR runbook is vague.

Stop Conditions

STOP when ALL of:
  - SLOs defined and measurable (tier-1 services)
  - Burn rate alerts configured and tested
  - On-call rotation active with escalation
  - Runbooks exist for all critical alerts

Autonomous Operation

On failure: git reset --hard HEAD~1. Never pause.

Error Recovery

Related Skills

arbazkhan971/webperf

development

VerifiedTrustedCommunity

Web performance optimization. Lighthouse, bundle analysis, code splitting, image optimization, critical CSS, fonts, service workers, CDN.

10SKILL.mdUpdated Apr 26, 2026

arbazkhan971/webhook

development

VerifiedTrustedCommunity

Webhook design, delivery, retry, HMAC verification, event subscriptions, dead letter queues.

10SKILL.mdUpdated Apr 26, 2026

arbazkhan971/vue

development

VerifiedTrustedCommunity

Vue.js mastery. Composition API, Pinia, Vue Router, Nuxt SSR/SSG, Vite optimization, testing.

10SKILL.mdUpdated Apr 26, 2026

arbazkhan971/verify

development

VerifiedTrustedCommunity

Evidence gate. Run command, read full output, confirm or deny claim. No trust, only proof.

10SKILL.mdUpdated Apr 26, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/arbazkhan971/godmode.git

# Copy into Claude Code skills folder (global)
cp -r godmode/skills/reliability ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

arbazkhan971/godmode

9 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT