Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

qa-aman/incident-response

Name: incident-response
Author: qa-aman

skills/by-role/devops/incident-response/SKILL.md

npx skillsauth add qa-aman/claude-skills incident-response

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Overview

Based on the Google SRE Book by Beyer, Jones, Petoff & Murphy. Google's incident management framework establishes clear roles, communication protocols, and decision hierarchies that let teams respond to complex outages without chaos. The core principle: when systems fail, the response should be calmer and more structured than the incident itself.

Three SRE roles during an incident:

Incident Commander (IC) - owns the response, makes decisions, delegates
Operations Lead - makes changes to the system (the IC does not touch production)
Communications Lead - handles stakeholder updates, keeps IC focused on the incident

Workflow

Step 1: Declare the incident

If in doubt, declare. It's always easier to stand down an incident than to escalate a non-incident that becomes one.

Set severity:

| Severity | Criteria | Response time | |----------|----------|---------------| | SEV1 / P0 | Total outage, revenue impact, data loss | Immediate - all hands | | SEV2 / P1 | Significant degradation, major feature down | < 15 min response | | SEV3 / P2 | Minor degradation, workaround available | < 1 hour response |

Step 2: Assign the IC

One person owns the incident. The IC:

Does NOT make changes to production themselves
Delegates all technical actions to the Operations Lead
Asks for status updates, not details ("what's the impact?" not "show me the logs")
Makes the call to escalate, rollback, or stand down

Step 3: Open the incident channel

Create a dedicated Slack channel or incident doc immediately. Post within 2 minutes:

INCIDENT: [service] [severity]
IC: [name]
Operations: [name]
Status: Investigating
Symptoms: [what users are experiencing]
Started: [approximate time]

Step 4: Investigate - form hypotheses

Operations Lead: form a hypothesis before running commands.

What changed recently? (deploys, config changes, dependency updates)
What do metrics show? (error rate, latency, saturation)
What do logs show?

SRE principle: change is the enemy of stability. Look at what changed first.

Step 5: Communicate on a schedule

Communications Lead posts status updates every 15 minutes (SEV1) or 30 minutes (SEV2) even if nothing has changed. "We are still investigating" is a valid update. Silence breeds panic.

Internal update format:

[TIME] Status: [Investigating / Mitigating / Monitoring / Resolved]
Impact: [what users are experiencing]
Current action: [what the team is doing right now]
Next update: [time]

Step 6: Mitigate before you fix

Mitigation = stop the bleeding. Fix = address the root cause.

Mitigation actions (in order of preference):

Rollback the most recent deploy
Disable the affected feature flag
Redirect traffic to a healthy region
Scale up capacity
Apply a hotfix (last resort - introduces new risk)

Apply the fastest, lowest-risk mitigation first.

Step 7: Declare resolution

When user impact is resolved (not when root cause is found):

RESOLVED: [service] is restored
Duration: [start] to [end]
Impact: [who was affected, what was degraded]
Resolution: [what action resolved it]
Root cause: [preliminary - full postmortem to follow]
Next step: Postmortem within [48h for SEV1, 1 week for SEV2]

Anti-Patterns

1. IC also making production changes Bad: The incident commander is also running kubectl commands and checking logs. Good: IC delegates all technical actions. Two roles, two people. IC stays at the strategic level.

2. No regular status updates Bad: Stakeholders hear nothing for 45 minutes. Good: Communications Lead posts every 15 minutes (SEV1). Even "still investigating" counts.

3. Fixing root cause under pressure instead of mitigating Bad: "We know the bug, let me push a fix to prod." Good: Rollback first, restore service, then fix properly in a controlled environment.

4. Declaring the incident over when root cause is unknown Bad: Service is restored, incident closed, no postmortem scheduled. Good: Resolution = user impact resolved. Root cause investigation continues. Postmortem is mandatory for SEV1/SEV2.

Quality Checklist

[ ] Severity declared with criteria
[ ] Incident Commander assigned (not the same person as Operations Lead)
[ ] Incident channel opened with initial status post
[ ] Status updates posted on schedule (every 15-30 min)
[ ] Mitigation prioritized over fix under pressure
[ ] Resolution post includes: duration, impact, resolution action, postmortem date
[ ] Postmortem scheduled before incident is fully closed

qa-aman/incident-response

skills/by-role/devops/incident-response/SKILL.md

Run a structured incident response. Use when the user says "we have an incident", "production is down", "service is degraded", "on-call response", "p0 incident", "something is broken in prod", "help me manage this incident", "incident commander", or there is an active production issue requiring coordinated response - even if they don't explicitly say "incident response".

13 stars

testing

Updated Apr 23, 2026

$ install --global

skillsauth

npx skillsauth add qa-aman/claude-skills incident-response

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 23, 2026, 1:59 PM150.2s1 file scanned

SKILL.md

name:: incident-response
description:: >

Overview

Three SRE roles during an incident:

Incident Commander (IC) - owns the response, makes decisions, delegates
Operations Lead - makes changes to the system (the IC does not touch production)
Communications Lead - handles stakeholder updates, keeps IC focused on the incident

Workflow

Step 1: Declare the incident

If in doubt, declare. It's always easier to stand down an incident than to escalate a non-incident that becomes one.

Set severity:

Step 2: Assign the IC

One person owns the incident. The IC:

Does NOT make changes to production themselves
Delegates all technical actions to the Operations Lead
Asks for status updates, not details ("what's the impact?" not "show me the logs")
Makes the call to escalate, rollback, or stand down

Step 3: Open the incident channel

Create a dedicated Slack channel or incident doc immediately. Post within 2 minutes:

INCIDENT: [service] [severity]
IC: [name]
Operations: [name]
Status: Investigating
Symptoms: [what users are experiencing]
Started: [approximate time]

Step 4: Investigate - form hypotheses

Operations Lead: form a hypothesis before running commands.

What changed recently? (deploys, config changes, dependency updates)
What do metrics show? (error rate, latency, saturation)
What do logs show?

SRE principle: change is the enemy of stability. Look at what changed first.

Step 5: Communicate on a schedule

Communications Lead posts status updates every 15 minutes (SEV1) or 30 minutes (SEV2) even if nothing has changed. "We are still investigating" is a valid update. Silence breeds panic.

Internal update format:

[TIME] Status: [Investigating / Mitigating / Monitoring / Resolved]
Impact: [what users are experiencing]
Current action: [what the team is doing right now]
Next update: [time]

Step 6: Mitigate before you fix

Mitigation = stop the bleeding. Fix = address the root cause.

Mitigation actions (in order of preference):

Rollback the most recent deploy
Disable the affected feature flag
Redirect traffic to a healthy region
Scale up capacity
Apply a hotfix (last resort - introduces new risk)

Apply the fastest, lowest-risk mitigation first.

Step 7: Declare resolution

When user impact is resolved (not when root cause is found):

RESOLVED: [service] is restored
Duration: [start] to [end]
Impact: [who was affected, what was degraded]
Resolution: [what action resolved it]
Root cause: [preliminary - full postmortem to follow]
Next step: Postmortem within [48h for SEV1, 1 week for SEV2]

Anti-Patterns

2. No regular status updates Bad: Stakeholders hear nothing for 45 minutes. Good: Communications Lead posts every 15 minutes (SEV1). Even "still investigating" counts.

3. Fixing root cause under pressure instead of mitigating Bad: "We know the bug, let me push a fix to prod." Good: Rollback first, restore service, then fix properly in a controlled environment.

Quality Checklist

[ ] Severity declared with criteria
[ ] Incident Commander assigned (not the same person as Operations Lead)
[ ] Incident channel opened with initial status post
[ ] Status updates posted on schedule (every 15-30 min)
[ ] Mitigation prioritized over fix under pressure
[ ] Resolution post includes: duration, impact, resolution action, postmortem date
[ ] Postmortem scheduled before incident is fully closed

Related Skills

qa-aman/webinar-planner

development

VerifiedTrustedCommunity

Plan a webinar end-to-end using April Dunford's Obviously Awesome positioning framework to find the topic angle that makes the webinar obviously valuable to the right audience. Produces topic positioning, abstract, speaker brief, registration page, promotion sequence, day-of run-of-show, and post-webinar follow-up. Use when the user asks to plan a webinar, virtual event, online workshop, "we need a webinar on X", host a webinar, online masterclass, or any live virtual event with promotion and follow-up. Reads ICP, services, and brand voice from knowledge/.

13SKILL.mdUpdated May 5, 2026

qa-aman/webinar-planner

qa-aman/thought-leadership-writer

development

VerifiedTrustedCommunity

Write long-form thought leadership articles, opinion pieces, industry POV essays, and CEO/founder bylines using the Made to Stick SUCCESs framework (Chip and Dan Heath). Use when the user asks for a long-form article, executive byline, opinion piece, industry POV, manifesto, "explain our point of view on X", or wants to publish an authority-building piece (1200-2500 words). Reads brand voice and positioning from knowledge/.

13SKILL.mdUpdated May 5, 2026

qa-aman/thought-leadership-writer

qa-aman/social-calendar

development

VerifiedTrustedCommunity

Plan a monthly content calendar across channels using the Content Marketing Matrix (Dave Chaffey, Smart Insights) - Entertain/Inspire/Educate/Convince. Every post gets a quadrant label. The monthly calendar must hit 40% Educate, 40% Inspire+Convince, 20% Entertain. Produces a week-by-week posting schedule with topics, formats, channels, and asset links. Use when the user says "content calendar", "social calendar", "plan next month's content", "what should we post", "content plan", "editorial calendar", "schedule posts for the month", or wants a structured posting plan for LinkedIn, Twitter, email, or blog. Reads brand voice, ICP, and past learnings from knowledge/.

13SKILL.mdUpdated May 5, 2026

qa-aman/social-calendar

qa-aman/seo-article-writer

development

VerifiedTrustedCommunity

Write SEO-optimized long-form articles targeting specific keywords using the They Ask You Answer Big 5 framework (Marcus Sheridan). Articles are categorized by Big 5 type (Cost, Problems, Versus, Best/Reviews, How-To) and structured accordingly. The "answer first" rule applies to every article. Use when the user asks for an SEO article, blog post for ranking, "rank for keyword X", organic content, search-optimized post, pillar page, or content for organic traffic. Includes keyword targeting, search intent matching, internal linking suggestions, and meta tags.

13SKILL.mdUpdated May 5, 2026

qa-aman/seo-article-writer

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/qa-aman/claude-skills.git

# Copy into Claude Code skills folder (global)
cp -r claude-skills/skills/by-role/devops/incident-response ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

qa-aman/claude-skills

13 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT