SRE & Incident Management Platform

Complete Site Reliability Engineering system — from SLO definition through incident response, chaos engineering, and operational excellence. Zero dependencies.

Phase 1: Reliability Assessment

Before building anything, assess where you are.

Service Catalog Entry

service:
  name: ""
  tier: ""  # critical | important | standard | experimental
  owner_team: ""
  oncall_rotation: ""
  dependencies:
    upstream: []    # services we call
    downstream: []  # services that call us
  data_classification: ""  # public | internal | confidential | restricted
  deployment_frequency: ""  # daily | weekly | biweekly | monthly
  architecture: ""  # monolith | microservice | serverless | hybrid
  language: ""
  infra: ""  # k8s | ECS | Lambda | VM | bare-metal
  traffic_pattern: ""  # steady | diurnal | spiky | seasonal
  peak_rps: 0
  storage_gb: 0
  monthly_cost_usd: 0

Maturity Assessment (Score 1-5 per dimension)

| Dimension | 1 (Ad-hoc) | 3 (Defined) | 5 (Optimized) | Score | |-----------|-----------|-------------|---------------|-------| | SLOs | No SLOs defined | SLOs exist, reviewed quarterly | Data-driven SLOs, auto error budgets | | | Monitoring | Basic health checks | Golden signals + dashboards | Full observability, anomaly detection | | | Incident Response | No runbooks, hero culture | Documented process, postmortems | Automated detection, structured ICS | | | Automation | Manual deployments | CI/CD pipeline, some automation | Self-healing, auto-scaling, GitOps | | | Chaos Engineering | No testing | Basic failure injection | Continuous chaos in production | | | Capacity Planning | Reactive scaling | Quarterly forecasting | Predictive auto-scaling | | | Toil Management | >50% toil | Toil tracked, reduction plans | <25% toil, systematic elimination | | | On-Call Health | Burnout, 24/7 individuals | Rotation exists, escalation paths | Balanced load, <2 pages/shift | |

Score interpretation:

8-16: Firefighting mode — start with SLOs + incident process
17-24: Foundation built — add chaos engineering + toil reduction
25-32: Maturing — optimize error budgets + capacity planning
33-40: Advanced — focus on predictive reliability + culture

Phase 2: SLI/SLO Framework

SLI Selection by Service Type

| Service Type | Primary SLI | Secondary SLIs | |-------------|-------------|----------------| | API/Backend | Request success rate | Latency p50/p95/p99, throughput | | Frontend/Web | Page load (LCP) | FID/INP, CLS, error rate | | Data Pipeline | Freshness | Correctness, completeness, throughput | | Storage | Durability | Availability, latency | | Streaming | Processing latency | Throughput, ordering, data loss rate | | Batch Job | Success rate | Duration, SLA compliance | | ML Model | Prediction latency | Accuracy drift, feature freshness |

SLI Specification Template

sli:
  name: "request_success_rate"
  description: "Proportion of valid requests served successfully"
  type: "availability"  # availability | latency | quality | freshness
  measurement:
    good_events: "HTTP responses with status < 500"
    total_events: "All HTTP requests excluding health checks"
    source: "load balancer access logs"
    aggregation: "sum(good) / sum(total) over rolling 28-day window"
  exclusions:
    - "Health check endpoints (/healthz, /readyz)"
    - "Synthetic monitoring traffic"
    - "Requests from blocked IPs"
    - "4xx responses (client errors)"

SLO Target Selection Guide

| Nines | Uptime % | Downtime/month | Appropriate for | |-------|----------|----------------|-----------------| | 2 nines | 99% | 7h 18m | Internal tools, dev environments | | 2.5 | 99.5% | 3h 39m | Non-critical services, backoffice | | 3 nines | 99.9% | 43m 50s | Standard production services | | 3.5 | 99.95% | 21m 55s | Important customer-facing services | | 4 nines | 99.99% | 4m 23s | Critical services, payments, auth | | 5 nines | 99.999% | 26s | Life-safety, financial clearing |

Rules for setting targets:

Start lower than you think — you can always tighten
SLO < SLA (always have buffer — typically 0.1-0.5% margin)
Internal SLO < External SLO (catch problems before customers do)
Each nine costs ~10x more to achieve
If you can't measure it, you can't SLO it

SLO Document Template

slo:
  service: ""
  sli: ""
  target: 99.9  # percentage
  window: "28d"  # rolling window
  error_budget: 0.1  # 100% - target
  error_budget_minutes: 40  # per 28-day window
  
  burn_rate_alerts:
    - name: "fast_burn"
      burn_rate: 14.4  # exhausts budget in 2 hours
      short_window: "5m"
      long_window: "1h"
      severity: "page"
    - name: "medium_burn"
      burn_rate: 6.0   # exhausts budget in ~5 hours
      short_window: "30m"
      long_window: "6h"
      severity: "page"
    - name: "slow_burn"
      burn_rate: 1.0   # exhausts budget in 28 days
      short_window: "6h"
      long_window: "3d"
      severity: "ticket"
  
  review_cadence: "monthly"
  owner: ""
  stakeholders: []
  
  escalation_when_budget_exhausted:
    - "Halt non-critical deployments"
    - "Redirect engineering to reliability work"
    - "Escalate to VP Engineering if no improvement in 48h"

Phase 3: Error Budget Management

Error Budget Policy

error_budget_policy:
  service: ""
  
  budget_states:
    healthy:
      condition: "remaining_budget > 50%"
      actions:
        - "Normal development velocity"
        - "Feature work prioritized"
        - "Chaos experiments allowed"
    
    warning:
      condition: "remaining_budget 25-50%"
      actions:
        - "Increase monitoring scrutiny"
        - "Review recent changes for risk"
        - "Limit risky deployments to business hours"
        - "No chaos experiments"
    
    critical:
      condition: "remaining_budget 0-25%"
      actions:
        - "Feature freeze — reliability work only"
        - "All deployments require SRE approval"
        - "Mandatory rollback plan for every change"
        - "Daily error budget review"
    
    exhausted:
      condition: "remaining_budget <= 0"
      actions:
        - "Complete deployment freeze"
        - "All engineering redirected to reliability"
        - "VP Engineering notified"
        - "Postmortem required for budget exhaustion"
        - "Freeze maintained until budget recovers to 10%"
  
  exceptions:
    - "Security patches always allowed"
    - "Regulatory compliance changes always allowed"
    - "Data loss prevention always allowed"
  
  reset: "Rolling 28-day window (no manual resets)"

Burn Rate Calculation

Burn rate = (error rate observed) / (error rate allowed by SLO)

Example:
- SLO: 99.9% (error budget = 0.1%)
- Current error rate: 0.5%
- Burn rate = 0.5% / 0.1% = 5x

At 5x burn rate → budget exhausted in 28d / 5 = 5.6 days

Error Budget Dashboard

Track weekly:

| Metric | Current | Trend | Status | |--------|---------|-------|--------| | Budget remaining (%) | | ↑↓→ | 🟢🟡🔴 | | Budget consumed this week | | | | | Burn rate (1h / 6h / 24h) | | | | | Incidents consuming budget | | | | | Top error contributor | | | | | Projected exhaustion date | | | |

Phase 4: Monitoring & Alerting Architecture

Four Golden Signals

| Signal | What to Measure | Alert When | |--------|----------------|------------| | Latency | p50, p95, p99 response time | p99 > 2x baseline for 5 min | | Traffic | Requests/sec, concurrent users | >30% drop (indicates upstream issue) OR >50% spike | | Errors | 5xx rate, timeout rate, exception rate | Error rate > SLO burn rate threshold | | Saturation | CPU, memory, disk, connections, queue depth | >80% sustained for 10 min |

USE Method (Infrastructure)

For every resource, track:

Utilization: % of capacity used (0-100%)
Saturation: queue depth / wait time (0 = no waiting)
Errors: error count / error rate

RED Method (Services)

For every service, track:

Rate: requests per second
Errors: failed requests per second
Duration: latency distribution

Alert Design Rules

Every alert must have a runbook link — no exceptions
Every alert must be actionable — if you can't act on it, delete it
Symptoms over causes — alert on "users can't check out" not "database CPU high"
Multi-window, multi-burn-rate — avoid single-threshold alerts
Page only for customer impact — everything else is a ticket
Alert fatigue = death — review alert volume monthly; target <5 pages/week per service

Alert Severity Guide

| Severity | Response Time | Notification | Examples | |----------|--------------|-------------|----------| | P0/Page | <5 min | PagerDuty + phone | SLO burn rate critical, data loss, security breach | | P1/Urgent | <30 min | Slack + PagerDuty | Degraded service, elevated errors, capacity warning | | P2/Ticket | Next business day | Ticket auto-created | Slow burn, non-critical component down | | P3/Log | Weekly review | Dashboard only | Informational, trend detection |

Structured Log Standard

{
  "timestamp": "2026-02-17T11:24:00.000Z",
  "level": "error",
  "service": "payment-api",
  "trace_id": "abc123",
  "span_id": "def456",
  "message": "Payment processing failed",
  "error_type": "TimeoutException",
  "error_message": "Gateway timeout after 30s",
  "http_method": "POST",
  "http_path": "/api/v1/payments",
  "http_status": 504,
  "duration_ms": 30012,
  "customer_id": "cust_xxx",
  "payment_id": "pay_yyy",
  "amount_cents": 4999,
  "retry_count": 2,
  "environment": "production",
  "host": "payment-api-7b4d9-xk2p1",
  "region": "us-east-1"
}

Phase 5: Incident Response Framework

Severity Classification Matrix

| | Impact: 1 User | Impact: <25% Users | Impact: >25% Users | Impact: All Users | |-|----------------|--------------------|--------------------|-------------------| | Core function down | SEV3 | SEV2 | SEV1 | SEV1 | | Degraded performance | SEV4 | SEV3 | SEV2 | SEV1 | | Non-core feature down | SEV4 | SEV3 | SEV3 | SEV2 | | Cosmetic/minor | SEV4 | SEV4 | SEV3 | SEV3 |

Auto-escalation triggers:

Any data loss → SEV1 minimum
Security breach with PII → SEV1
Revenue-impacting → SEV1 or SEV2
SLA breach imminent → auto-escalate one level

Incident Command System (ICS)

| Role | Responsibility | Assigned | |------|---------------|----------| | Incident Commander (IC) | Owns resolution, makes decisions, manages timeline | | | Communications Lead | Status updates, stakeholder comms, customer-facing | | | Operations Lead | Hands-on-keyboard, executing fixes | | | Subject Matter Expert | Deep knowledge of affected system | | | Scribe | Documenting timeline, actions, decisions | |

IC Rules:

IC does NOT debug — IC coordinates
IC makes final decisions when team disagrees
IC can escalate severity at any time
IC owns handoff if rotation changes
IC calls end-of-incident

Incident Response Workflow

DETECT → TRIAGE → RESPOND → MITIGATE → RESOLVE → REVIEW

Step 1: DETECT (0-5 min)
├── Alert fires OR user report received
├── On-call acknowledges within SLA
└── Quick assessment: is this real? What severity?

Step 2: TRIAGE (5-15 min)
├── Classify severity using matrix above
├── Assign IC and roles
├── Open incident channel (#inc-YYYY-MM-DD-title)
├── Post initial status update
└── Start timeline document

Step 3: RESPOND (15 min - ongoing)
├── IC briefs team: "Here's what we know, here's what we don't"
├── Operations Lead begins investigation
├── Check: recent deployments? Config changes? Dependency issues?
├── Parallel investigation tracks if needed
└── 15-minute check-ins for SEV1, 30-min for SEV2

Step 4: MITIGATE (ASAP)
├── Priority: STOP THE BLEEDING
├── Options (fastest first):
│   ├── Rollback last deployment
│   ├── Feature flag disable
│   ├── Traffic shift / failover
│   ├── Scale up / circuit breaker
│   └── Manual data fix
├── Mitigated ≠ Resolved — temporary fix is OK
└── Update status: "Impact mitigated, root cause investigation ongoing"

Step 5: RESOLVE
├── Root cause identified and fixed
├── Verification: SLIs back to normal for 30+ minutes
├── All-clear communicated
└── IC declares incident resolved

Step 6: REVIEW (within 5 business days)
├── Blameless postmortem written
├── Action items assigned with owners and deadlines
├── Postmortem review meeting
└── Action items tracked to completion

Communication Templates

Initial notification (internal):

🔴 INCIDENT: [Title]
Severity: SEV[X]
Impact: [Who/what is affected]
Status: Investigating
IC: [Name]
Channel: #inc-[date]-[slug]
Next update: [time]

Customer-facing status:

[Service] - Investigating increased error rates

We are currently investigating reports of [symptom]. 
Some users may experience [user-visible impact].
Our team is actively working on a resolution.
We will provide an update within [time].

Resolution notification:

✅ RESOLVED: [Title]
Duration: [X hours Y minutes]
Impact: [Summary]
Root cause: [One sentence]
Postmortem: [Link] (within 5 business days)

Phase 6: Postmortem Framework

Blameless Postmortem Template

postmortem:
  title: ""
  date: ""
  severity: ""  # SEV1-4
  duration: ""  # total incident duration
  authors: []
  reviewers: []
  status: "draft"  # draft | in-review | final
  
  summary: |
    One paragraph: what happened, what was the impact, how was it resolved.
  
  impact:
    users_affected: 0
    duration_minutes: 0
    revenue_impact_usd: 0
    slo_budget_consumed_pct: 0
    data_loss: false
    customer_tickets: 0
  
  timeline:
    - time: ""
      event: ""
      # Chronological, every significant event
      # Include detection time, escalation, mitigation attempts
  
  root_cause: |
    Technical explanation of WHY it happened.
    Go deep — surface causes are not root causes.
  
  contributing_factors:
    - ""  # What made it worse or delayed resolution?
  
  detection:
    how_detected: ""  # alert | user report | manual check
    time_to_detect_minutes: 0
    could_have_detected_sooner: ""
  
  resolution:
    how_resolved: ""
    time_to_mitigate_minutes: 0
    time_to_resolve_minutes: 0
  
  what_went_well:
    - ""  # Explicitly call out what worked
  
  what_went_wrong:
    - ""
  
  where_we_got_lucky:
    - ""  # Things that could have made it worse
  
  action_items:
    - id: "AI-001"
      type: ""  # prevent | detect | mitigate | process
      description: ""
      owner: ""
      priority: ""  # P0 | P1 | P2
      deadline: ""
      status: "open"  # open | in-progress | done
      ticket: ""

Root Cause Analysis Methods

Five Whys (simple incidents):

Why did users see errors? → API returned 500s
Why did API return 500s? → Database connection pool exhausted
Why was pool exhausted? → Long-running query held connections
Why was query long-running? → Missing index on new column
Why was index missing? → Migration didn't include index; no query performance review in CI

→ Root cause: No automated query performance check in deployment pipeline → Action: Add query plan analysis to CI for migration PRs

Fishbone / Ishikawa (complex incidents):

Categories to investigate:
├── People: Training? Fatigue? Communication?
├── Process: Runbook? Escalation? Change management?
├── Technology: Bug? Config? Capacity? Dependency?
├── Environment: Network? Cloud provider? Third party?
├── Monitoring: Detection gap? Alert fatigue? Dashboard gap?
└── Testing: Test coverage? Load testing? Chaos testing?

Contributing Factor Categories: | Category | Questions | |----------|-----------| | Trigger | What change or event started it? | | Propagation | Why did it spread? Why wasn't it contained? | | Detection | Why wasn't it caught earlier? | | Resolution | What slowed the fix? | | Process | What process gaps contributed? |

Postmortem Review Meeting (60 min)

1. Timeline walk-through (15 min)
   - Author presents chronology
   - Attendees add context ("I remember seeing X at this point")

2. Root cause deep-dive (15 min)  
   - Do we agree on root cause?
   - Are there additional contributing factors?

3. Action item review (20 min)
   - Are these the RIGHT actions?
   - Are they prioritized correctly?
   - Do owners agree on deadlines?

4. Process improvements (10 min)
   - Could we have detected this sooner?
   - Could we have resolved this faster?
   - What would have prevented this entirely?

Phase 7: Chaos Engineering

Chaos Maturity Model

| Level | Name | Activities | |-------|------|-----------| | 0 | None | No chaos testing | | 1 | Exploratory | Manual fault injection in staging | | 2 | Systematic | Scheduled chaos experiments in staging | | 3 | Production | Controlled chaos in production (Game Days) | | 4 | Continuous | Automated chaos in production with safety controls |

Chaos Experiment Template

experiment:
  name: ""
  hypothesis: "When [fault], the system will [expected behavior]"
  
  steady_state:
    metrics:
      - name: ""
        baseline: ""
        acceptable_range: ""
  
  method:
    fault_type: ""  # network | compute | storage | dependency | data
    target: ""      # which service/component
    blast_radius: ""  # single pod | single AZ | percentage of traffic
    duration: ""
    
  safety:
    abort_conditions:
      - "SLO burn rate exceeds 10x"
      - "Customer-visible errors detected"
      - "Alert fires that we didn't expect"
    rollback_plan: ""
    required_approvals: []
    
  results:
    outcome: ""  # confirmed | disproved | inconclusive
    observations: []
    action_items: []

Chaos Experiment Library

| Category | Experiment | Validates | |----------|-----------|-----------| | Network | Add 200ms latency to DB calls | Timeout handling, circuit breakers | | Network | Drop 5% of packets to downstream | Retry logic, error handling | | Network | DNS resolution failure | Caching, fallback, error messages | | Compute | Kill random pod every 10 min | Auto-restart, load balancing | | Compute | CPU stress to 95% on 1 node | Auto-scaling, graceful degradation | | Compute | Fill disk to 95% | Disk monitoring, log rotation, alerts | | Storage | Increase DB latency 5x | Connection pool handling, timeouts | | Storage | Simulate cache failure (Redis down) | Cache-aside pattern, DB fallback | | Dependency | Block external API (payment provider) | Circuit breaker, queuing, retry | | Dependency | Return 429s from auth service | Rate limit handling, backoff | | Data | Clock skew on subset of nodes | Timestamp handling, ordering | | Scale | 10x traffic spike over 5 minutes | Auto-scaling speed, queue depth |

Game Day Runbook

PRE-GAME (1 week before):
□ Experiment designed and reviewed
□ Steady-state metrics identified
□ Abort conditions defined
□ All participants briefed
□ Runbacks tested in staging
□ Stakeholders notified

GAME DAY:
□ Verify steady state (15 min baseline)
□ Announce in #engineering: "Chaos Game Day starting"
□ Inject fault
□ Observe and document
□ If abort condition hit → rollback immediately
□ Run for planned duration
□ Remove fault
□ Verify recovery to steady state

POST-GAME (same day):
□ Results documented
□ Surprises noted
□ Action items created
□ Share findings in team meeting

Phase 8: Toil Management

Toil Identification

Definition: Work that is manual, repetitive, automatable, tactical, without enduring value, and scales linearly with service growth.

Toil Inventory Template

toil_item:
  name: ""
  category: ""  # deployment | scaling | config | data | access | monitoring | recovery
  frequency: ""  # daily | weekly | monthly | per-incident
  time_per_occurrence_min: 0
  occurrences_per_month: 0
  total_hours_per_month: 0
  teams_affected: []
  automation_difficulty: ""  # low | medium | high
  automation_value: 0  # hours saved per month
  priority_score: 0  # value / difficulty

Toil Reduction Priority Matrix

| | Low Effort | Medium Effort | High Effort | |-|-----------|--------------|-------------| | High Value (>10 hrs/mo) | DO FIRST | DO SECOND | PLAN | | Med Value (2-10 hrs/mo) | DO SECOND | PLAN | EVALUATE | | Low Value (<2 hrs/mo) | QUICK WIN | SKIP | SKIP |

Common Toil Targets (Ranked by Impact)

Manual deployments → CI/CD pipeline + GitOps
Access provisioning → Self-service + auto-approval for low-risk
Certificate renewals → Auto-renewal (cert-manager, Let's Encrypt)
Scaling decisions → HPA + predictive auto-scaling
Log investigation → Structured logging + correlation + dashboards
Data fixes → Self-service admin tools + validation at ingestion
Config changes → Config-as-code + automated rollout
Incident response → Automated runbooks for known issues
Capacity reporting → Automated dashboards + forecasting
On-call triage → Noise reduction + auto-remediation for known patterns

Toil Budget Rule

Target: <25% of SRE time spent on toil. Track monthly. If above 25%, prioritize automation over all feature work.

Phase 9: Capacity Planning

Capacity Model Template

capacity_model:
  service: ""
  bottleneck_resource: ""  # CPU | memory | storage | connections | bandwidth
  
  current_state:
    peak_utilization_pct: 0
    headroom_pct: 0
    cost_per_month_usd: 0
    
  growth_forecast:
    metric: ""  # MAU | requests/sec | storage_gb
    current: 0
    monthly_growth_pct: 0
    projected_6mo: 0
    projected_12mo: 0
    
  scaling_strategy:
    type: ""  # horizontal | vertical | hybrid
    auto_scaling: true
    min_instances: 0
    max_instances: 0
    scale_up_threshold: 80  # % utilization
    scale_down_threshold: 30
    cooldown_seconds: 300
    
  cost_projection:
    current_monthly: 0
    projected_6mo_monthly: 0
    projected_12mo_monthly: 0

Capacity Planning Cadence

| Frequency | Action | |-----------|--------| | Daily | Review auto-scaling events, check for anomalies | | Weekly | Review utilization trends, spot-check headroom | | Monthly | Update growth model, review cost projections | | Quarterly | Full capacity review, budget planning, architecture check | | Pre-launch | Load test to 2x expected peak, verify scaling |

Load Testing Benchmarks

| Scenario | Method | Duration | Target | |----------|--------|----------|--------| | Baseline | Steady load at current peak | 30 min | Establish metrics | | Growth | 2x current peak | 15 min | Verify scaling works | | Spike | 10x normal in 60 seconds | 5 min | Circuit breakers hold | | Soak | 1.5x normal load | 4 hours | No memory leaks, degradation | | Stress | Ramp until failure | Until break | Find actual limits |

Phase 10: On-Call Excellence

On-Call Health Metrics

| Metric | Healthy | Warning | Critical | |--------|---------|---------|----------| | Pages per shift | <2 | 2-5 | >5 | | Off-hours pages | <1/week | 1-3/week | >3/week | | Time to acknowledge | <5 min | 5-15 min | >15 min | | Time to mitigate | <30 min | 30-60 min | >60 min | | False positive rate | <10% | 10-30% | >30% | | Escalation rate | <20% | 20-40% | >40% | | On-call satisfaction | >4/5 | 3-4/5 | <3/5 |

On-Call Rotation Best Practices

Minimum rotation size: 5 people (one week on, four weeks off)
No back-to-back weeks unless team is too small (fix the team size)
Follow-the-sun for global teams (no one pages at 3 AM if avoidable)
Primary + secondary on-call always
Handoff document at rotation change — open issues, recent deploys, known risks
Compensation — on-call pay, time off in lieu, or equivalent

On-Call Handoff Template

## On-Call Handoff: [Date]

### Open Issues
- [Issue]: [Status, next steps]

### Recent Changes (last 7 days)
- [Deployment/config change]: [Risk level, rollback plan]

### Known Risks
- [Event/condition]: [What to watch for]

### Scheduled Maintenance
- [When]: [What, duration, rollback plan]

### Runbook Updates
- [Any new/updated runbooks since last rotation]

Runbook Template

runbook:
  title: ""
  alert_name: ""  # exact alert that triggers this
  last_updated: ""
  owner: ""
  
  overview: |
    What this alert means in plain English.
    
  impact: |
    What users/systems are affected and how.
    
  diagnosis:
    - step: "Check service health"
      command: ""
      expected: ""
      if_unexpected: ""
    - step: "Check recent deployments"
      command: ""
      expected: ""
      if_unexpected: "Rollback: [command]"
    - step: "Check dependencies"
      command: ""
      expected: ""
      if_unexpected: ""
      
  mitigation:
    - option: "Rollback"
      when: "Recent deployment suspected"
      steps: []
    - option: "Scale up"
      when: "Traffic spike"
      steps: []
    - option: "Failover"
      when: "Single component failure"
      steps: []
      
  escalation:
    after_minutes: 30
    contact: ""
    context_to_provide: ""

Phase 11: Reliability Review & Governance

Weekly SRE Review (30 min)

1. SLO Status (5 min)
   - Budget remaining per service
   - Any burn rate alerts this week?

2. Incident Review (10 min)
   - Incidents this week: count, severity, duration
   - Open postmortem action items: status check

3. On-Call Health (5 min)
   - Pages this week (total, off-hours, false positives)
   - Any on-call feedback?

4. Reliability Work (10 min)
   - Automation shipped this week
   - Toil reduced (hours saved)
   - Chaos experiments run
   - Capacity concerns

Monthly Reliability Report

monthly_report:
  period: ""
  
  slo_summary:
    services_meeting_slo: 0
    services_breaching_slo: 0
    worst_performing: ""
    
  incidents:
    total: 0
    by_severity: { SEV1: 0, SEV2: 0, SEV3: 0, SEV4: 0 }
    mttr_minutes: 0
    mttd_minutes: 0
    repeat_incidents: 0
    
  error_budget:
    services_in_healthy: 0
    services_in_warning: 0
    services_in_critical: 0
    services_exhausted: 0
    
  toil:
    hours_spent: 0
    hours_automated_away: 0
    pct_of_sre_time: 0
    
  on_call:
    total_pages: 0
    off_hours_pages: 0
    false_positive_pct: 0
    avg_ack_time_min: 0
    
  action_items:
    open: 0
    completed_this_month: 0
    overdue: 0
    
  highlights: []
  concerns: []
  next_month_priorities: []

Production Readiness Review Checklist

Before any new service goes to production:

| Category | Check | Status | |----------|-------|--------| | SLOs | SLIs defined and measured | | | SLOs | SLO targets set with stakeholder agreement | | | SLOs | Error budget policy documented | | | Monitoring | Golden signals dashboarded | | | Monitoring | Alerting configured with runbooks | | | Monitoring | Structured logging implemented | | | Monitoring | Distributed tracing enabled | | | Incidents | On-call rotation established | | | Incidents | Escalation paths documented | | | Incidents | Runbooks for top 5 failure modes | | | Capacity | Load tested to 2x expected peak | | | Capacity | Auto-scaling configured and tested | | | Capacity | Resource limits set (CPU, memory) | | | Resilience | Graceful degradation implemented | | | Resilience | Circuit breakers for dependencies | | | Resilience | Retry with exponential backoff | | | Resilience | Timeout configured for all external calls | | | Deploy | Rollback tested and documented | | | Deploy | Canary/blue-green deployment ready | | | Deploy | Feature flags for risky features | | | Security | Authentication and authorization | | | Security | Secrets in vault (not env vars) | | | Security | Dependencies scanned | | | Data | Backup and restore tested | | | Data | Data retention policy defined | | | Docs | Architecture diagram current | | | Docs | API documentation published | | | Docs | Operational runbook complete | |

Phase 12: Advanced Patterns

Self-Healing Automation

auto_remediation:
  - trigger: "pod_crash_loop"
    condition: "restart_count > 3 in 10 min"
    action: "Delete pod, let scheduler reschedule"
    escalate_if: "Still crashing after 3 auto-remediations"
    
  - trigger: "disk_usage_high"
    condition: "disk_usage > 85%"
    action: "Run log cleanup script, archive old data"
    escalate_if: "Still above 85% after cleanup"
    
  - trigger: "connection_pool_exhausted"
    condition: "available_connections = 0"
    action: "Kill idle connections, increase pool temporarily"
    escalate_if: "Pool exhausted again within 1 hour"
    
  - trigger: "certificate_expiring"
    condition: "days_until_expiry < 14"
    action: "Trigger cert renewal"
    escalate_if: "Renewal fails"

Multi-Region Reliability

| Strategy | Complexity | RTO | Cost | |----------|-----------|-----|------| | Active-passive | Low | Minutes | 1.5x | | Active-active read | Medium | Seconds | 1.8x | | Active-active full | High | Near-zero | 2-3x | | Cell-based | Very high | Per-cell | 2-4x |

Decision guide:

SLO < 99.9% → Single region with good backups
SLO 99.9-99.95% → Active-passive with automated failover
SLO > 99.95% → Active-active (read or full)
SLO > 99.99% → Cell-based architecture

Reliability Culture Indicators

Healthy signals:

Postmortems are blameless and well-attended
Error budgets are respected (feature freeze actually happens)
On-call is shared fairly and compensated
Toil is tracked and reducing quarter-over-quarter
Chaos experiments happen regularly
Teams own their reliability (not just SRE)

Warning signs:

"Hero culture" — same person always saves the day
Postmortems are blame-focused or skipped
Error budget exhaustion doesn't change behavior
On-call is dreaded, same 2 people always paged
"We'll fix reliability after this feature ships" (always)
SRE team is just an ops team with a new name

Quality Scoring Rubric (0-100)

| Dimension | Weight | 0-2 | 3-4 | 5 | |-----------|--------|-----|-----|---| | SLO Coverage | 20% | No SLOs | SLOs for critical services | All services with SLOs, error budgets, reviews | | Monitoring | 15% | Basic health checks | Golden signals + dashboards | Full observability stack + anomaly detection | | Incident Response | 15% | Ad-hoc, no process | ICS roles, runbooks, postmortems | Structured ICS, blameless culture, action tracking | | Automation | 15% | Manual everything | CI/CD + some automation | Self-healing, GitOps, <25% toil | | Chaos Engineering | 10% | None | Staging experiments | Continuous production chaos with safety | | Capacity Planning | 10% | Reactive | Quarterly forecasting | Predictive, auto-scaling, cost-optimized | | On-Call Health | 10% | Burnout, hero culture | Fair rotation, <5 pages/shift | Balanced, compensated, <2 pages/shift | | Documentation | 5% | Nothing written | Runbooks exist | Complete, current, tested runbooks |

Natural Language Commands

"Assess reliability for [service]" → Run maturity assessment
"Define SLOs for [service]" → Walk through SLI selection + SLO setting
"Check error budget for [service]" → Calculate current budget status
"Start incident for [description]" → Create incident channel, assign IC, begin workflow
"Write postmortem for [incident]" → Generate structured postmortem
"Plan chaos experiment for [service]" → Design experiment with hypothesis
"Audit toil for [team]" → Inventory and prioritize toil
"Review on-call health" → Analyze page volume, satisfaction, fairness
"Production readiness review for [service]" → Run full checklist
"Monthly reliability report" → Generate comprehensive report
"Design runbook for [alert]" → Create structured runbook
"Plan capacity for [service] growing at [X%]" → Build capacity model

SRE & Incident Management Platform

Complete Site Reliability Engineering system — from SLO definition through incident response, chaos engineering, and operational excellence. Zero dependencies.

Phase 1: Reliability Assessment

Before building anything, assess where you are.

Service Catalog Entry

service:
  name: ""
  tier: ""  # critical | important | standard | experimental
  owner_team: ""
  oncall_rotation: ""
  dependencies:
    upstream: []    # services we call
    downstream: []  # services that call us
  data_classification: ""  # public | internal | confidential | restricted
  deployment_frequency: ""  # daily | weekly | biweekly | monthly
  architecture: ""  # monolith | microservice | serverless | hybrid
  language: ""
  infra: ""  # k8s | ECS | Lambda | VM | bare-metal
  traffic_pattern: ""  # steady | diurnal | spiky | seasonal
  peak_rps: 0
  storage_gb: 0
  monthly_cost_usd: 0

Maturity Assessment (Score 1-5 per dimension)

Score interpretation:

8-16: Firefighting mode — start with SLOs + incident process
17-24: Foundation built — add chaos engineering + toil reduction
25-32: Maturing — optimize error budgets + capacity planning
33-40: Advanced — focus on predictive reliability + culture

Phase 2: SLI/SLO Framework

SLI Selection by Service Type

SLI Specification Template

sli:
  name: "request_success_rate"
  description: "Proportion of valid requests served successfully"
  type: "availability"  # availability | latency | quality | freshness
  measurement:
    good_events: "HTTP responses with status < 500"
    total_events: "All HTTP requests excluding health checks"
    source: "load balancer access logs"
    aggregation: "sum(good) / sum(total) over rolling 28-day window"
  exclusions:
    - "Health check endpoints (/healthz, /readyz)"
    - "Synthetic monitoring traffic"
    - "Requests from blocked IPs"
    - "4xx responses (client errors)"

SLO Target Selection Guide

Rules for setting targets:

Start lower than you think — you can always tighten
SLO < SLA (always have buffer — typically 0.1-0.5% margin)
Internal SLO < External SLO (catch problems before customers do)
Each nine costs ~10x more to achieve
If you can't measure it, you can't SLO it

SLO Document Template

slo:
  service: ""
  sli: ""
  target: 99.9  # percentage
  window: "28d"  # rolling window
  error_budget: 0.1  # 100% - target
  error_budget_minutes: 40  # per 28-day window
  
  burn_rate_alerts:
    - name: "fast_burn"
      burn_rate: 14.4  # exhausts budget in 2 hours
      short_window: "5m"
      long_window: "1h"
      severity: "page"
    - name: "medium_burn"
      burn_rate: 6.0   # exhausts budget in ~5 hours
      short_window: "30m"
      long_window: "6h"
      severity: "page"
    - name: "slow_burn"
      burn_rate: 1.0   # exhausts budget in 28 days
      short_window: "6h"
      long_window: "3d"
      severity: "ticket"
  
  review_cadence: "monthly"
  owner: ""
  stakeholders: []
  
  escalation_when_budget_exhausted:
    - "Halt non-critical deployments"
    - "Redirect engineering to reliability work"
    - "Escalate to VP Engineering if no improvement in 48h"

Phase 3: Error Budget Management

Error Budget Policy

error_budget_policy:
  service: ""
  
  budget_states:
    healthy:
      condition: "remaining_budget > 50%"
      actions:
        - "Normal development velocity"
        - "Feature work prioritized"
        - "Chaos experiments allowed"
    
    warning:
      condition: "remaining_budget 25-50%"
      actions:
        - "Increase monitoring scrutiny"
        - "Review recent changes for risk"
        - "Limit risky deployments to business hours"
        - "No chaos experiments"
    
    critical:
      condition: "remaining_budget 0-25%"
      actions:
        - "Feature freeze — reliability work only"
        - "All deployments require SRE approval"
        - "Mandatory rollback plan for every change"
        - "Daily error budget review"
    
    exhausted:
      condition: "remaining_budget <= 0"
      actions:
        - "Complete deployment freeze"
        - "All engineering redirected to reliability"
        - "VP Engineering notified"
        - "Postmortem required for budget exhaustion"
        - "Freeze maintained until budget recovers to 10%"
  
  exceptions:
    - "Security patches always allowed"
    - "Regulatory compliance changes always allowed"
    - "Data loss prevention always allowed"
  
  reset: "Rolling 28-day window (no manual resets)"

Burn Rate Calculation

Burn rate = (error rate observed) / (error rate allowed by SLO)

Example:
- SLO: 99.9% (error budget = 0.1%)
- Current error rate: 0.5%
- Burn rate = 0.5% / 0.1% = 5x

At 5x burn rate → budget exhausted in 28d / 5 = 5.6 days

Error Budget Dashboard

Track weekly:

Phase 4: Monitoring & Alerting Architecture

Four Golden Signals

USE Method (Infrastructure)

For every resource, track:

Utilization: % of capacity used (0-100%)
Saturation: queue depth / wait time (0 = no waiting)
Errors: error count / error rate

RED Method (Services)

For every service, track:

Rate: requests per second
Errors: failed requests per second
Duration: latency distribution

Alert Design Rules

Every alert must have a runbook link — no exceptions
Every alert must be actionable — if you can't act on it, delete it
Symptoms over causes — alert on "users can't check out" not "database CPU high"
Multi-window, multi-burn-rate — avoid single-threshold alerts
Page only for customer impact — everything else is a ticket
Alert fatigue = death — review alert volume monthly; target <5 pages/week per service

Alert Severity Guide

Structured Log Standard

{
  "timestamp": "2026-02-17T11:24:00.000Z",
  "level": "error",
  "service": "payment-api",
  "trace_id": "abc123",
  "span_id": "def456",
  "message": "Payment processing failed",
  "error_type": "TimeoutException",
  "error_message": "Gateway timeout after 30s",
  "http_method": "POST",
  "http_path": "/api/v1/payments",
  "http_status": 504,
  "duration_ms": 30012,
  "customer_id": "cust_xxx",
  "payment_id": "pay_yyy",
  "amount_cents": 4999,
  "retry_count": 2,
  "environment": "production",
  "host": "payment-api-7b4d9-xk2p1",
  "region": "us-east-1"
}

Phase 5: Incident Response Framework

Severity Classification Matrix

Auto-escalation triggers:

Any data loss → SEV1 minimum
Security breach with PII → SEV1
Revenue-impacting → SEV1 or SEV2
SLA breach imminent → auto-escalate one level

Incident Command System (ICS)

IC Rules:

IC does NOT debug — IC coordinates
IC makes final decisions when team disagrees
IC can escalate severity at any time
IC owns handoff if rotation changes
IC calls end-of-incident

Incident Response Workflow

DETECT → TRIAGE → RESPOND → MITIGATE → RESOLVE → REVIEW

Step 1: DETECT (0-5 min)
├── Alert fires OR user report received
├── On-call acknowledges within SLA
└── Quick assessment: is this real? What severity?

Step 2: TRIAGE (5-15 min)
├── Classify severity using matrix above
├── Assign IC and roles
├── Open incident channel (#inc-YYYY-MM-DD-title)
├── Post initial status update
└── Start timeline document

Step 3: RESPOND (15 min - ongoing)
├── IC briefs team: "Here's what we know, here's what we don't"
├── Operations Lead begins investigation
├── Check: recent deployments? Config changes? Dependency issues?
├── Parallel investigation tracks if needed
└── 15-minute check-ins for SEV1, 30-min for SEV2

Step 4: MITIGATE (ASAP)
├── Priority: STOP THE BLEEDING
├── Options (fastest first):
│   ├── Rollback last deployment
│   ├── Feature flag disable
│   ├── Traffic shift / failover
│   ├── Scale up / circuit breaker
│   └── Manual data fix
├── Mitigated ≠ Resolved — temporary fix is OK
└── Update status: "Impact mitigated, root cause investigation ongoing"

Step 5: RESOLVE
├── Root cause identified and fixed
├── Verification: SLIs back to normal for 30+ minutes
├── All-clear communicated
└── IC declares incident resolved

Step 6: REVIEW (within 5 business days)
├── Blameless postmortem written
├── Action items assigned with owners and deadlines
├── Postmortem review meeting
└── Action items tracked to completion

Communication Templates

Initial notification (internal):

🔴 INCIDENT: [Title]
Severity: SEV[X]
Impact: [Who/what is affected]
Status: Investigating
IC: [Name]
Channel: #inc-[date]-[slug]
Next update: [time]

Customer-facing status:

[Service] - Investigating increased error rates

We are currently investigating reports of [symptom]. 
Some users may experience [user-visible impact].
Our team is actively working on a resolution.
We will provide an update within [time].

Resolution notification:

✅ RESOLVED: [Title]
Duration: [X hours Y minutes]
Impact: [Summary]
Root cause: [One sentence]
Postmortem: [Link] (within 5 business days)

Phase 6: Postmortem Framework

Blameless Postmortem Template

postmortem:
  title: ""
  date: ""
  severity: ""  # SEV1-4
  duration: ""  # total incident duration
  authors: []
  reviewers: []
  status: "draft"  # draft | in-review | final
  
  summary: |
    One paragraph: what happened, what was the impact, how was it resolved.
  
  impact:
    users_affected: 0
    duration_minutes: 0
    revenue_impact_usd: 0
    slo_budget_consumed_pct: 0
    data_loss: false
    customer_tickets: 0
  
  timeline:
    - time: ""
      event: ""
      # Chronological, every significant event
      # Include detection time, escalation, mitigation attempts
  
  root_cause: |
    Technical explanation of WHY it happened.
    Go deep — surface causes are not root causes.
  
  contributing_factors:
    - ""  # What made it worse or delayed resolution?
  
  detection:
    how_detected: ""  # alert | user report | manual check
    time_to_detect_minutes: 0
    could_have_detected_sooner: ""
  
  resolution:
    how_resolved: ""
    time_to_mitigate_minutes: 0
    time_to_resolve_minutes: 0
  
  what_went_well:
    - ""  # Explicitly call out what worked
  
  what_went_wrong:
    - ""
  
  where_we_got_lucky:
    - ""  # Things that could have made it worse
  
  action_items:
    - id: "AI-001"
      type: ""  # prevent | detect | mitigate | process
      description: ""
      owner: ""
      priority: ""  # P0 | P1 | P2
      deadline: ""
      status: "open"  # open | in-progress | done
      ticket: ""

Root Cause Analysis Methods

Five Whys (simple incidents):

Why did users see errors? → API returned 500s
Why did API return 500s? → Database connection pool exhausted
Why was pool exhausted? → Long-running query held connections
Why was query long-running? → Missing index on new column
Why was index missing? → Migration didn't include index; no query performance review in CI

→ Root cause: No automated query performance check in deployment pipeline → Action: Add query plan analysis to CI for migration PRs

Fishbone / Ishikawa (complex incidents):

Categories to investigate:
├── People: Training? Fatigue? Communication?
├── Process: Runbook? Escalation? Change management?
├── Technology: Bug? Config? Capacity? Dependency?
├── Environment: Network? Cloud provider? Third party?
├── Monitoring: Detection gap? Alert fatigue? Dashboard gap?
└── Testing: Test coverage? Load testing? Chaos testing?

Postmortem Review Meeting (60 min)

1. Timeline walk-through (15 min)
   - Author presents chronology
   - Attendees add context ("I remember seeing X at this point")

2. Root cause deep-dive (15 min)  
   - Do we agree on root cause?
   - Are there additional contributing factors?

3. Action item review (20 min)
   - Are these the RIGHT actions?
   - Are they prioritized correctly?
   - Do owners agree on deadlines?

4. Process improvements (10 min)
   - Could we have detected this sooner?
   - Could we have resolved this faster?
   - What would have prevented this entirely?

Phase 7: Chaos Engineering

Chaos Maturity Model

Chaos Experiment Template

experiment:
  name: ""
  hypothesis: "When [fault], the system will [expected behavior]"
  
  steady_state:
    metrics:
      - name: ""
        baseline: ""
        acceptable_range: ""
  
  method:
    fault_type: ""  # network | compute | storage | dependency | data
    target: ""      # which service/component
    blast_radius: ""  # single pod | single AZ | percentage of traffic
    duration: ""
    
  safety:
    abort_conditions:
      - "SLO burn rate exceeds 10x"
      - "Customer-visible errors detected"
      - "Alert fires that we didn't expect"
    rollback_plan: ""
    required_approvals: []
    
  results:
    outcome: ""  # confirmed | disproved | inconclusive
    observations: []
    action_items: []

Chaos Experiment Library

Game Day Runbook

PRE-GAME (1 week before):
□ Experiment designed and reviewed
□ Steady-state metrics identified
□ Abort conditions defined
□ All participants briefed
□ Runbacks tested in staging
□ Stakeholders notified

GAME DAY:
□ Verify steady state (15 min baseline)
□ Announce in #engineering: "Chaos Game Day starting"
□ Inject fault
□ Observe and document
□ If abort condition hit → rollback immediately
□ Run for planned duration
□ Remove fault
□ Verify recovery to steady state

POST-GAME (same day):
□ Results documented
□ Surprises noted
□ Action items created
□ Share findings in team meeting

Phase 8: Toil Management

Toil Identification

Definition: Work that is manual, repetitive, automatable, tactical, without enduring value, and scales linearly with service growth.

Toil Inventory Template

toil_item:
  name: ""
  category: ""  # deployment | scaling | config | data | access | monitoring | recovery
  frequency: ""  # daily | weekly | monthly | per-incident
  time_per_occurrence_min: 0
  occurrences_per_month: 0
  total_hours_per_month: 0
  teams_affected: []
  automation_difficulty: ""  # low | medium | high
  automation_value: 0  # hours saved per month
  priority_score: 0  # value / difficulty

Toil Reduction Priority Matrix

Common Toil Targets (Ranked by Impact)

Manual deployments → CI/CD pipeline + GitOps
Access provisioning → Self-service + auto-approval for low-risk
Certificate renewals → Auto-renewal (cert-manager, Let's Encrypt)
Scaling decisions → HPA + predictive auto-scaling
Log investigation → Structured logging + correlation + dashboards
Data fixes → Self-service admin tools + validation at ingestion
Config changes → Config-as-code + automated rollout
Incident response → Automated runbooks for known issues
Capacity reporting → Automated dashboards + forecasting
On-call triage → Noise reduction + auto-remediation for known patterns

Toil Budget Rule

Target: <25% of SRE time spent on toil. Track monthly. If above 25%, prioritize automation over all feature work.

Phase 9: Capacity Planning

Capacity Model Template

capacity_model:
  service: ""
  bottleneck_resource: ""  # CPU | memory | storage | connections | bandwidth
  
  current_state:
    peak_utilization_pct: 0
    headroom_pct: 0
    cost_per_month_usd: 0
    
  growth_forecast:
    metric: ""  # MAU | requests/sec | storage_gb
    current: 0
    monthly_growth_pct: 0
    projected_6mo: 0
    projected_12mo: 0
    
  scaling_strategy:
    type: ""  # horizontal | vertical | hybrid
    auto_scaling: true
    min_instances: 0
    max_instances: 0
    scale_up_threshold: 80  # % utilization
    scale_down_threshold: 30
    cooldown_seconds: 300
    
  cost_projection:
    current_monthly: 0
    projected_6mo_monthly: 0
    projected_12mo_monthly: 0

Capacity Planning Cadence

Load Testing Benchmarks

Phase 10: On-Call Excellence

On-Call Health Metrics

On-Call Rotation Best Practices

Minimum rotation size: 5 people (one week on, four weeks off)
No back-to-back weeks unless team is too small (fix the team size)
Follow-the-sun for global teams (no one pages at 3 AM if avoidable)
Primary + secondary on-call always
Handoff document at rotation change — open issues, recent deploys, known risks
Compensation — on-call pay, time off in lieu, or equivalent

On-Call Handoff Template

## On-Call Handoff: [Date]

### Open Issues
- [Issue]: [Status, next steps]

### Recent Changes (last 7 days)
- [Deployment/config change]: [Risk level, rollback plan]

### Known Risks
- [Event/condition]: [What to watch for]

### Scheduled Maintenance
- [When]: [What, duration, rollback plan]

### Runbook Updates
- [Any new/updated runbooks since last rotation]

Runbook Template

runbook:
  title: ""
  alert_name: ""  # exact alert that triggers this
  last_updated: ""
  owner: ""
  
  overview: |
    What this alert means in plain English.
    
  impact: |
    What users/systems are affected and how.
    
  diagnosis:
    - step: "Check service health"
      command: ""
      expected: ""
      if_unexpected: ""
    - step: "Check recent deployments"
      command: ""
      expected: ""
      if_unexpected: "Rollback: [command]"
    - step: "Check dependencies"
      command: ""
      expected: ""
      if_unexpected: ""
      
  mitigation:
    - option: "Rollback"
      when: "Recent deployment suspected"
      steps: []
    - option: "Scale up"
      when: "Traffic spike"
      steps: []
    - option: "Failover"
      when: "Single component failure"
      steps: []
      
  escalation:
    after_minutes: 30
    contact: ""
    context_to_provide: ""

Phase 11: Reliability Review & Governance

Weekly SRE Review (30 min)

1. SLO Status (5 min)
   - Budget remaining per service
   - Any burn rate alerts this week?

2. Incident Review (10 min)
   - Incidents this week: count, severity, duration
   - Open postmortem action items: status check

3. On-Call Health (5 min)
   - Pages this week (total, off-hours, false positives)
   - Any on-call feedback?

4. Reliability Work (10 min)
   - Automation shipped this week
   - Toil reduced (hours saved)
   - Chaos experiments run
   - Capacity concerns

Monthly Reliability Report

monthly_report:
  period: ""
  
  slo_summary:
    services_meeting_slo: 0
    services_breaching_slo: 0
    worst_performing: ""
    
  incidents:
    total: 0
    by_severity: { SEV1: 0, SEV2: 0, SEV3: 0, SEV4: 0 }
    mttr_minutes: 0
    mttd_minutes: 0
    repeat_incidents: 0
    
  error_budget:
    services_in_healthy: 0
    services_in_warning: 0
    services_in_critical: 0
    services_exhausted: 0
    
  toil:
    hours_spent: 0
    hours_automated_away: 0
    pct_of_sre_time: 0
    
  on_call:
    total_pages: 0
    off_hours_pages: 0
    false_positive_pct: 0
    avg_ack_time_min: 0
    
  action_items:
    open: 0
    completed_this_month: 0
    overdue: 0
    
  highlights: []
  concerns: []
  next_month_priorities: []

Production Readiness Review Checklist

Before any new service goes to production:

Phase 12: Advanced Patterns

Self-Healing Automation

auto_remediation:
  - trigger: "pod_crash_loop"
    condition: "restart_count > 3 in 10 min"
    action: "Delete pod, let scheduler reschedule"
    escalate_if: "Still crashing after 3 auto-remediations"
    
  - trigger: "disk_usage_high"
    condition: "disk_usage > 85%"
    action: "Run log cleanup script, archive old data"
    escalate_if: "Still above 85% after cleanup"
    
  - trigger: "connection_pool_exhausted"
    condition: "available_connections = 0"
    action: "Kill idle connections, increase pool temporarily"
    escalate_if: "Pool exhausted again within 1 hour"
    
  - trigger: "certificate_expiring"
    condition: "days_until_expiry < 14"
    action: "Trigger cert renewal"
    escalate_if: "Renewal fails"

Multi-Region Reliability

Decision guide:

SLO < 99.9% → Single region with good backups
SLO 99.9-99.95% → Active-passive with automated failover
SLO > 99.95% → Active-active (read or full)
SLO > 99.99% → Cell-based architecture

Reliability Culture Indicators

Healthy signals:

Postmortems are blameless and well-attended
Error budgets are respected (feature freeze actually happens)
On-call is shared fairly and compensated
Toil is tracked and reducing quarter-over-quarter
Chaos experiments happen regularly
Teams own their reliability (not just SRE)

Warning signs:

"Hero culture" — same person always saves the day
Postmortems are blame-focused or skipped
Error budget exhaustion doesn't change behavior
On-call is dreaded, same 2 people always paged
"We'll fix reliability after this feature ships" (always)
SRE team is just an ops team with a new name

Quality Scoring Rubric (0-100)

Natural Language Commands

"Assess reliability for [service]" → Run maturity assessment
"Define SLOs for [service]" → Walk through SLI selection + SLO setting
"Check error budget for [service]" → Calculate current budget status
"Start incident for [description]" → Create incident channel, assign IC, begin workflow
"Write postmortem for [incident]" → Generate structured postmortem
"Plan chaos experiment for [service]" → Design experiment with hypothesis
"Audit toil for [team]" → Inventory and prioritize toil
"Review on-call health" → Analyze page volume, satisfaction, fairness
"Production readiness review for [service]" → Run full checklist
"Monthly reliability report" → Generate comprehensive report
"Design runbook for [alert]" → Create structured runbook
"Plan capacity for [service] growing at [X%]" → Build capacity model

Adoption

openclaw/1kalin/afrexai-sre-platform

$ install --global

Security Scan Results

SKILL.md

SRE & Incident Management Platform

Phase 1: Reliability Assessment

Service Catalog Entry

Maturity Assessment (Score 1-5 per dimension)

Phase 2: SLI/SLO Framework

SLI Selection by Service Type

SLI Specification Template

SLO Target Selection Guide

SLO Document Template

Phase 3: Error Budget Management

Error Budget Policy

Burn Rate Calculation

Error Budget Dashboard

Phase 4: Monitoring & Alerting Architecture

Four Golden Signals

USE Method (Infrastructure)

RED Method (Services)

Alert Design Rules

Alert Severity Guide

Structured Log Standard

Phase 5: Incident Response Framework

Severity Classification Matrix

Incident Command System (ICS)

Incident Response Workflow

Communication Templates

Phase 6: Postmortem Framework

Blameless Postmortem Template

Root Cause Analysis Methods

Postmortem Review Meeting (60 min)

Phase 7: Chaos Engineering

Chaos Maturity Model

Chaos Experiment Template

Chaos Experiment Library

Game Day Runbook

Phase 8: Toil Management

Toil Identification

Toil Inventory Template

Toil Reduction Priority Matrix

Common Toil Targets (Ranked by Impact)

Toil Budget Rule

Phase 9: Capacity Planning

Capacity Model Template

Capacity Planning Cadence

Load Testing Benchmarks

Phase 10: On-Call Excellence

On-Call Health Metrics

On-Call Rotation Best Practices

On-Call Handoff Template

Runbook Template

Phase 11: Reliability Review & Governance

Weekly SRE Review (30 min)

Monthly Reliability Report

Production Readiness Review Checklist

Phase 12: Advanced Patterns

Self-Healing Automation

Multi-Region Reliability

Reliability Culture Indicators

Quality Scoring Rubric (0-100)

Natural Language Commands

Related Skills

openclaw/mcdonalds-skill

openclaw/scrapebadger

openclaw/slowmist-security-cc

openclaw/humanizer-cn

openclaw/1kalin/afrexai-sre-platform

$ install --global

Security Scan Results

SKILL.md

SRE & Incident Management Platform

Phase 1: Reliability Assessment

Service Catalog Entry

Maturity Assessment (Score 1-5 per dimension)

Phase 2: SLI/SLO Framework

SLI Selection by Service Type

SLI Specification Template