plugins/devops-sre/skills/incident-response/incident-commander/SKILL.md
Guide incident response as an Incident Commander with structured communication and coordination. Use this skill when there's an active incident, outage, service degradation, or production issue. Activate when: incident, outage, service down, production issue, SEV1, SEV2, pages, alerts firing, something broke, users complaining, error spike, latency spike.
npx skillsauth add latestaiagents/agent-skills incident-commanderInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Lead incident response with structured communication, clear ownership, and systematic resolution.
The IC (Incident Commander) is responsible for:
The IC does NOT need to be the person fixing the problem.
| Level | Description | Response Time | Examples | |-------|-------------|---------------|----------| | SEV1 | Critical - Complete outage | Immediate | Total service down, data loss, security breach | | SEV2 | Major - Significant impact | 15 min | Core feature broken, major degradation | | SEV3 | Minor - Limited impact | 1 hour | Non-critical feature down, workaround exists | | SEV4 | Low - Minimal impact | Best effort | Cosmetic, single user affected |
1. Acknowledge the alert
2. Quick assessment:
- What's broken?
- Who's affected?
- What's the blast radius?
3. Assign severity level
4. Declare incident if SEV1/SEV2
1. Create incident channel: #inc-YYYYMMDD-[brief-description]
2. Post initial summary (template below)
3. Page relevant teams if needed
4. Assign roles:
- IC (Incident Commander) - you or delegate
- Tech Lead - driving investigation
- Comms Lead - external communication
Initial Incident Post Template:
🚨 INCIDENT DECLARED
**Severity:** SEV-[X]
**Status:** Investigating
**Impact:** [Who/what is affected]
**Started:** [Time] UTC
**Current Understanding:**
[Brief description of symptoms]
**Roles:**
- IC: @[name]
- Tech Lead: @[name]
- Comms: @[name]
**Next Update:** [Time] (every 15-30 min for SEV1/2)
1. Gather data:
- Recent deployments?
- Configuration changes?
- External dependency issues?
- Error patterns in logs?
- Metrics anomalies?
2. Form hypothesis and test
3. Identify mitigation options:
- Can we rollback?
- Can we scale?
- Can we failover?
- Do we need a hotfix?
1. Choose mitigation approach
2. Communicate plan before executing
3. Execute with verification at each step
4. Monitor for improvement
5. Confirm resolution
1. Verify service is healthy
2. Update status page
3. Send resolution communication
4. Create postmortem ticket
5. Schedule postmortem meeting (within 48h for SEV1/2)
📊 INCIDENT UPDATE - [Time] UTC
**Status:** [Investigating/Identified/Mitigating/Resolved]
**Impact:** [Current impact]
**Update:**
[What we've learned, what we're doing]
**Next Steps:**
[What's happening next]
**Next Update:** [Time] UTC
✅ INCIDENT RESOLVED - [Time] UTC
**Duration:** [X hours Y minutes]
**Root Cause:** [Brief description]
**Resolution:** [What fixed it]
**Impact Summary:**
- Users affected: [number]
- Duration: [time]
- SLA impact: [yes/no]
**Next Steps:**
- Postmortem scheduled: [date/time]
- Postmortem doc: [link]
Thank you to everyone who helped respond.
| Role | Responsibility | Who | |------|----------------|-----| | IC | Coordination, decisions, communication | Declared or on-call | | Tech Lead | Investigation, fix implementation | SME for affected service | | Comms Lead | Status page, customer comms | Support/Comms team | | Scribe | Document timeline | Anyone available | | Subject Matter Experts | Deep knowledge | Paged as needed |
Escalate to leadership when:
development
Test skills for correct activation, content quality, and regression — both automated checks (frontmatter validity, lint) and manual verification (query-suite activation testing). Covers CI integration and how to catch skill regressions before users do. Use this skill when adding skills to a repo, setting up CI for a skill library, or debugging "the skill exists but doesn't work". Activate when: test skills, validate skills, skill CI, skill linting, skill activation test, skill regression.
documentation
Write the YAML frontmatter for a SKILL.md file so it activates reliably — name, description, and activation keywords that the model matches against. Covers length, tone, and the most common frontmatter mistakes. Use this skill when authoring a new skill, fixing a skill that isn't auto-activating, or reviewing skills for publication. Activate when: SKILL.md frontmatter, skill description, skill activation, skill YAML, write a skill, author a skill.
development
Design skills that fire at the right moment — neither over-eager (noise) nor under-eager (silent). Covers activation specificity, trigger phrases, disambiguation between overlapping skills, and debugging activation. Use this skill when multiple skills could fire on the same query, a skill never fires, or a skill fires too often. Activate when: skill won't activate, skill over-activates, overlapping skills, skill triggers, skill selection, skill disambiguation.
development
Structure SKILL.md content so the model reads just enough — concise summary up front, progressively deeper detail, examples on demand. Covers section ordering, length budgets, when to split into multiple skills. Use this skill when writing or refactoring a skill body, one skill has grown too long, or a skill is wordy but not useful. Activate when: SKILL.md structure, skill content, skill too long, split skill, progressive disclosure, skill body.