java/src/main/resources/targets/claude/skills/knowledge-packs/sre-practices/SKILL.md
SRE practices: error budgets, toil reduction, on-call practices, capacity planning, incident management process, and change management. Covers SLO-based release gates, burn rate calculation, automation prioritization, rotation patterns, load testing methodology, blameless postmortems, and canary analysis.
npx skillsauth add edercnj/ia-dev-environment sre-practicesInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Provides comprehensive Site Reliability Engineering practices for {{LANGUAGE}} {{FRAMEWORK}} services, enabling teams to maintain reliability, reduce operational toil, and manage incidents effectively. Covers error budget management, toil reduction strategies, on-call operations, capacity planning, incident response, and change management processes.
See references/error-budget-calculator.md for SLO targets, burn rate formulas, and budget exhaustion thresholds. See references/on-call-handbook.md for on-call rotation, escalation procedures, and page response workflow. See references/capacity-planning-template.md for load testing methodology, growth projections, and resource sizing.
Read these files for comprehensive SRE guidance:
| Reference | Content |
|-----------|---------|
| references/error-budget-calculator.md | SLO targets, burn rate formulas, budget exhaustion thresholds, and allocation strategies |
| references/on-call-handbook.md | On-call rotation patterns, escalation procedures, page response workflow, and fatigue management |
| references/capacity-planning-template.md | Load testing methodology, growth projections, resource sizing, and headroom targets |
Error budgets quantify acceptable unreliability derived from Service Level Objectives. When the budget is exhausted, new deployments are frozen until the budget recovers.
| SLO Target | Error Budget (monthly) | Allowed Downtime | |-----------|----------------------|-----------------| | 99.9% | 0.1% | 43.8 minutes | | 99.95% | 0.05% | 21.9 minutes | | 99.99% | 0.01% | 4.38 minutes |
Release gate rule: Deploy only when remaining error budget exceeds the estimated risk of the deployment.
Burn rate measures how fast the error budget is being consumed relative to the budget period.
Formula: burn_rate = (error_rate / error_budget_rate)
| Consumption | Action | |------------|--------| | 50% consumed | Warning alert to SRE team; review recent changes | | 75% consumed | Escalation to engineering lead; freeze non-critical deploys | | 100% consumed | Full deploy freeze; all engineering effort on reliability |
Toil is work that is manual, repetitive, automatable, tactical, devoid of enduring value, and scales linearly with service growth.
| Criterion | Description | |----------|-------------| | Manual | Requires human intervention to complete | | Repetitive | Performed more than once with the same steps | | Automatable | Could be handled by software with existing technology | | Tactical | Interrupt-driven rather than strategy-driven | | No enduring value | Does not permanently improve the system |
Prioritize automation based on frequency, time cost, and risk:
| Priority | Frequency | Time per Occurrence | Risk if Manual | |---------|-----------|-------------------|---------------| | P0 (Critical) | Daily+ | > 30 minutes | High (outage risk) | | P1 (High) | Weekly | > 15 minutes | Medium | | P2 (Medium) | Monthly | > 1 hour | Low | | P3 (Low) | Quarterly | > 2 hours | Minimal |
| Pattern | Use Case | Pros | Cons | |--------|----------|------|------| | Weekly | Small teams (3-5) | Simple scheduling | Long shifts | | Follow-the-sun | Distributed teams | No night pages | Handoff overhead | | Hybrid | Medium teams (5-10) | Balanced load | Complex scheduling |
| Severity | Initial Response | Escalation After | Commander | |---------|-----------------|------------------|-----------| | SEV-1 (Critical) | 5 minutes | 15 minutes | VP Engineering | | SEV-2 (Major) | 15 minutes | 30 minutes | Engineering Manager | | SEV-3 (Minor) | 30 minutes | 2 hours | Team Lead | | SEV-4 (Low) | Next business day | N/A | On-call engineer |
| Test Type | Purpose | Duration | |----------|---------|----------| | Baseline | Establish normal performance | 1 hour at expected load | | Stress | Find breaking point | Ramp until failure | | Soak | Detect memory leaks and degradation | 8-24 hours at peak load | | Spike | Validate auto-scaling | Sudden 10x load burst |
| Resource | Monitoring Period | Right-Size Trigger | |---------|------------------|-------------------| | CPU | 14-day p99 | < 30% utilization sustained | | Memory | 14-day p99 | < 40% utilization sustained | | Storage | 30-day trend | Growth rate < 5% monthly | | Network | 7-day peak | < 50% bandwidth utilized |
| Method | Response Time | Reliability | |--------|-------------|-------------| | Automated alerting | Seconds to minutes | High (if thresholds correct) | | Synthetic monitoring | 1-5 minutes | High (proactive) | | User reports | Minutes to hours | Variable | | Log analysis | Minutes | Medium (requires correlation) |
| Strategy | When to Use | Risk | |---------|------------|------| | Rollback | Bad deployment identified | Low (known good state) | | Feature flags | Feature-specific issue | Low (granular control) | | Traffic management | Capacity-related issue | Medium (partial service) | | Restart | Transient state corruption | Medium (data loss risk) |
| Period | Policy | Exception Process | |--------|--------|------------------| | Holiday season | Full freeze (2 weeks) | VP approval required | | Major events | Feature freeze (1 week) | Rollback-safe changes only | | Quarterly close | Reduced deployments | Business-critical fixes only |
Automatic rollback triggers:
| Parameter | Recommended Value | |----------|------------------| | Initial traffic | 1-5% | | Evaluation window | 15-30 minutes per stage | | Stage progression | 5% -> 10% -> 25% -> 50% -> 100% | | Success criteria | Error rate delta < 0.1% vs control | | Abort threshold | Error rate delta > 0.5% vs control |
skills/observability/ — SLO/SLI framework, burn rate alerts, and alerting strategyskills/resilience/ — circuit breaker, rate limiting, and chaos engineering patternsskills/disaster-recovery/ — DR strategies, failover automation, and recovery procedurestesting
Scaffolds a Helidon SE/MP service with routing, health, config, Dockerfile, and tests.
tools
Generates a Picocli @Command with subcommands, options, converters, and unit tests.
testing
Scaffolds a Micronaut service with @Controller, DI, health, Dockerfile, and tests.
testing
Scaffolds a Helidon SE/MP service with routing, health, config, Dockerfile, and tests.