devops-incident-responder-skill/SKILL.md
Expert in SRE practices, incident management, root cause analysis, and automated remediation.
npx skillsauth add 404kidwiz/claude-supercode-skills devops-incident-responderInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Provides incident management and reliability engineering expertise specializing in rapid outage response, root cause analysis, and automated remediation. Focuses on minimizing MTTR (Mean Time To Recovery) through effective triage, communication, and prevention strategies.
| Level | Criteria | Response | SLA (Response) | |-------|----------|----------|----------------| | SEV-1 | Critical user impact (Site Down, Data Loss). | Wake up everyone. CEO notified. | 15 mins | | SEV-2 | Major feature broken (Checkout fails). | Wake up on-call. | 30 mins | | SEV-3 | Minor issue (Internal tool slow). | Handle next business day. | 8 business hours | | SEV-4 | Trivial bug / Cosmetic. | Backlog. | N/A |
For every resource (CPU, Memory, Disk), check:
Red Flags → Escalate to security-engineer:
netstat / WAF logs)Goal: Fix "Disk Full" alerts without human intervention.
Steps:
Trigger
DiskSpaceLow (> 90%).Action
docker system prune -f or journalctl --vacuum-time=1d.Notification
Use case: Preventing cascading failures when a dependency acts up.
# Istio DestinationRule
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: reviews
spec:
host: reviews
trafficPolicy:
connectionPool:
http:
http1MaxPendingRequests: 1
maxRequestsPerConnection: 1
outlierDetection:
consecutive5xxErrors: 1
interval: 1s
baseEjectionTime: 3m
maxEjectionPercent: 100
# Runbook: High Database CPU
**Severity:** SEV-2
**Trigger:** RDS CPU > 90% for 5 mins
## 1. Triage
- Check [Database Dashboard](link).
- Is it a specific query? (See "Top SQL" panel).
## 2. Mitigation Actions
- **Option A (Bad Query):** Kill the session.
`SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE ...`
- **Option B (Traffic Spike):** Scale Read Replicas (Terraform apply).
- **Option C (Maintenance):** Stop non-essential cron jobs.
## 3. Escalation
- If CPU remains > 95% for 15 mins, page @database-team.
Use case: Clear communication to users.
development
Expert in automating Excel workflows using Node.js (ExcelJS, SheetJS) and Python (pandas, openpyxl).
content-media
Expert in designing durable, scalable workflow systems using Temporal, Camunda, and Event-Driven Architectures.
tools
Use when user needs WordPress development, theme or plugin creation, site optimization, security hardening, multisite management, or scaling WordPress from small sites to enterprise platforms.
tools
Expert in Windows Server, Active Directory (AD DS), Hybrid Identity (Entra ID), and PowerShell automation.