skills/by-role/devops/incident-response/SKILL.md
Run a structured incident response. Use when the user says "we have an incident", "production is down", "service is degraded", "on-call response", "p0 incident", "something is broken in prod", "help me manage this incident", "incident commander", or there is an active production issue requiring coordinated response - even if they don't explicitly say "incident response".
npx skillsauth add qa-aman/claude-skills incident-responseInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Based on the Google SRE Book by Beyer, Jones, Petoff & Murphy. Google's incident management framework establishes clear roles, communication protocols, and decision hierarchies that let teams respond to complex outages without chaos. The core principle: when systems fail, the response should be calmer and more structured than the incident itself.
Three SRE roles during an incident:
If in doubt, declare. It's always easier to stand down an incident than to escalate a non-incident that becomes one.
Set severity:
| Severity | Criteria | Response time | |----------|----------|---------------| | SEV1 / P0 | Total outage, revenue impact, data loss | Immediate - all hands | | SEV2 / P1 | Significant degradation, major feature down | < 15 min response | | SEV3 / P2 | Minor degradation, workaround available | < 1 hour response |
One person owns the incident. The IC:
Create a dedicated Slack channel or incident doc immediately. Post within 2 minutes:
INCIDENT: [service] [severity]
IC: [name]
Operations: [name]
Status: Investigating
Symptoms: [what users are experiencing]
Started: [approximate time]
Operations Lead: form a hypothesis before running commands.
SRE principle: change is the enemy of stability. Look at what changed first.
Communications Lead posts status updates every 15 minutes (SEV1) or 30 minutes (SEV2) even if nothing has changed. "We are still investigating" is a valid update. Silence breeds panic.
Internal update format:
[TIME] Status: [Investigating / Mitigating / Monitoring / Resolved]
Impact: [what users are experiencing]
Current action: [what the team is doing right now]
Next update: [time]
Mitigation = stop the bleeding. Fix = address the root cause.
Mitigation actions (in order of preference):
Apply the fastest, lowest-risk mitigation first.
When user impact is resolved (not when root cause is found):
RESOLVED: [service] is restored
Duration: [start] to [end]
Impact: [who was affected, what was degraded]
Resolution: [what action resolved it]
Root cause: [preliminary - full postmortem to follow]
Next step: Postmortem within [48h for SEV1, 1 week for SEV2]
1. IC also making production changes Bad: The incident commander is also running kubectl commands and checking logs. Good: IC delegates all technical actions. Two roles, two people. IC stays at the strategic level.
2. No regular status updates Bad: Stakeholders hear nothing for 45 minutes. Good: Communications Lead posts every 15 minutes (SEV1). Even "still investigating" counts.
3. Fixing root cause under pressure instead of mitigating Bad: "We know the bug, let me push a fix to prod." Good: Rollback first, restore service, then fix properly in a controlled environment.
4. Declaring the incident over when root cause is unknown Bad: Service is restored, incident closed, no postmortem scheduled. Good: Resolution = user impact resolved. Root cause investigation continues. Postmortem is mandatory for SEV1/SEV2.
development
Plan a webinar end-to-end using April Dunford's Obviously Awesome positioning framework to find the topic angle that makes the webinar obviously valuable to the right audience. Produces topic positioning, abstract, speaker brief, registration page, promotion sequence, day-of run-of-show, and post-webinar follow-up. Use when the user asks to plan a webinar, virtual event, online workshop, "we need a webinar on X", host a webinar, online masterclass, or any live virtual event with promotion and follow-up. Reads ICP, services, and brand voice from knowledge/.
development
Write long-form thought leadership articles, opinion pieces, industry POV essays, and CEO/founder bylines using the Made to Stick SUCCESs framework (Chip and Dan Heath). Use when the user asks for a long-form article, executive byline, opinion piece, industry POV, manifesto, "explain our point of view on X", or wants to publish an authority-building piece (1200-2500 words). Reads brand voice and positioning from knowledge/.
development
Plan a monthly content calendar across channels using the Content Marketing Matrix (Dave Chaffey, Smart Insights) - Entertain/Inspire/Educate/Convince. Every post gets a quadrant label. The monthly calendar must hit 40% Educate, 40% Inspire+Convince, 20% Entertain. Produces a week-by-week posting schedule with topics, formats, channels, and asset links. Use when the user says "content calendar", "social calendar", "plan next month's content", "what should we post", "content plan", "editorial calendar", "schedule posts for the month", or wants a structured posting plan for LinkedIn, Twitter, email, or blog. Reads brand voice, ICP, and past learnings from knowledge/.
development
Write SEO-optimized long-form articles targeting specific keywords using the They Ask You Answer Big 5 framework (Marcus Sheridan). Articles are categorized by Big 5 type (Cost, Problems, Versus, Best/Reviews, How-To) and structured accordingly. The "answer first" rule applies to every article. Use when the user asks for an SEO article, blog post for ranking, "rank for keyword X", organic content, search-optimized post, pillar page, or content for organic traffic. Includes keyword targeting, search intent matching, internal linking suggestions, and meta tags.