Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

qa-aman/runbook

Name: runbook
Author: qa-aman

skills/by-role/devops/runbook/SKILL.md

npx skillsauth add qa-aman/claude-skills runbook

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Overview

Based on the Google SRE Book. A runbook is the operational contract for a service: how to start it, stop it, diagnose it, and recover it. Google SRE's standard: a runbook should be executable by an on-call engineer who has never seen the service before. If it requires tribal knowledge, it's not a runbook - it's an assumption.

The test: can an on-call engineer follow this runbook at 3am with a degraded service, no time to research, and adrenaline affecting their cognition? If not, it's not operational-grade.

Workflow

Step 1: Write the service overview

Service: [your service]
Owner team: [your team]
On-call rotation: [who gets paged]
SLO: [availability target, e.g. 99.9% uptime]
Dependencies: [upstream services this depends on]
Dependents: [downstream services that depend on this]
Dashboards: [links]
Logs: [how to access]
Alerts: [link to alerting config]

Step 2: Document steady-state operations

What does normal look like?

Health check:

# Command to verify the service is healthy
curl https://[your service]/health
# Expected response: {"status": "ok", "version": "x.y.z"}

Key metrics (normal ranges):

| Metric | Normal | Warning | Critical | |--------|--------|---------|----------| | Error rate | < 0.1% | 0.1-1% | > 1% | | p99 latency | < 200ms | 200-500ms | > 500ms | | CPU usage | < 60% | 60-80% | > 80% |

Step 3: Document alert runbooks

For each alert, write a response procedure:

Alert: [Alert name]
Severity: [SEV1/SEV2/SEV3]
Fires when: [condition]
Impact: [what users experience]

Diagnosis:
  1. Check [dashboard link] for [specific signal]
  2. Run: [command] - Expected: [output]
  3. Check: [logs location] for [error pattern]

Common causes:
  A. [Cause] - Fix: [command or action]
  B. [Cause] - Fix: [command or action]

If none of the above: escalate to [team/person]

Step 4: Document common operations

List every routine operational task with exact commands:

Restart the service:

kubectl rollout restart deployment/[service-name] -n [namespace]
kubectl rollout status deployment/[service-name] -n [namespace]

Scale up:

kubectl scale deployment/[service-name] --replicas=[n] -n [namespace]

Rollback a deploy:

kubectl rollout undo deployment/[service-name] -n [namespace]
# Verify rollback succeeded:
kubectl rollout status deployment/[service-name] -n [namespace]

Step 5: Document recovery procedures

For each failure mode, step-by-step recovery:

Failure: [service is unresponsive / returning 5xx / high latency]
Verify: [how to confirm this is the problem]
Recover:
  1. [first action]
  2. [second action]
  3. [verify recovery: command - expected output]
Escalate if: [condition that means this runbook is insufficient]

Step 6: Document escalation paths

Who to call when the runbook doesn't work:

| Situation | Contact | How | |-----------|---------|-----| | Database issue | [DB team] | [PagerDuty / Slack] | | Network issue | [Infra team] | [contact] | | Vendor outage | [Vendor support] | [ticket URL] |

Step 7: Add a testing section

How does an on-call engineer practice with this runbook in a non-emergency?

Chaos engineering scenarios
Runbook review frequency (recommend: quarterly)
Last tested: [date]

Anti-Patterns

1. Runbook that requires tribal knowledge Bad: "Check if the DB is having issues." (how? where? what to look for?) Good: "Run psql -h [host] -U [user] -c 'SELECT count(*) FROM pg_stat_activity' - if > 100 connections, connection pool exhaustion is likely."

2. Commands without expected output Bad: "Run the health check command." Good: "Run: curl https://[your service]/health - Expected: {"status":"ok"}. If you see {"status":"degraded"}, proceed to Step 3."

3. No escalation path Bad: Runbook ends with "investigate further." Good: Every runbook has an explicit "escalate if" condition and names who to escalate to.

4. Runbook never tested Bad: Runbook written once, never validated in a real incident. Good: Run quarterly fire drills. Note the last tested date in the runbook.

Quality Checklist

[ ] Service overview complete (owner, SLO, dependencies, links)
[ ] Normal operating metrics defined with ranges
[ ] Every alert has a response procedure with exact commands
[ ] Commands include expected outputs (not just what to run)
[ ] Common operations documented step-by-step
[ ] Recovery procedures cover the top 3 failure modes
[ ] Escalation path named and reachable
[ ] Last tested date recorded
[ ] Executable at 3am by someone unfamiliar with the service

qa-aman/runbook

skills/by-role/devops/runbook/SKILL.md

Write an operational runbook. Use when the user says "write a runbook", "on-call documentation", "how to operate this service", "alert runbook", "troubleshooting guide for ops", "what to do when this alert fires", "operational procedures", or needs to document how to run, troubleshoot, or respond to a service - even if they don't explicitly say "runbook".

13 stars

testing

Updated Apr 23, 2026

$ install --global

skillsauth

npx skillsauth add qa-aman/claude-skills runbook

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 23, 2026, 2:00 PM171.5s1 file scanned

SKILL.md

name:: runbook
description:: >

Overview

The test: can an on-call engineer follow this runbook at 3am with a degraded service, no time to research, and adrenaline affecting their cognition? If not, it's not operational-grade.

Workflow

Step 1: Write the service overview

Service: [your service]
Owner team: [your team]
On-call rotation: [who gets paged]
SLO: [availability target, e.g. 99.9% uptime]
Dependencies: [upstream services this depends on]
Dependents: [downstream services that depend on this]
Dashboards: [links]
Logs: [how to access]
Alerts: [link to alerting config]

Step 2: Document steady-state operations

What does normal look like?

Health check:

# Command to verify the service is healthy
curl https://[your service]/health
# Expected response: {"status": "ok", "version": "x.y.z"}

Key metrics (normal ranges):

Step 3: Document alert runbooks

For each alert, write a response procedure:

Alert: [Alert name]
Severity: [SEV1/SEV2/SEV3]
Fires when: [condition]
Impact: [what users experience]

Diagnosis:
  1. Check [dashboard link] for [specific signal]
  2. Run: [command] - Expected: [output]
  3. Check: [logs location] for [error pattern]

Common causes:
  A. [Cause] - Fix: [command or action]
  B. [Cause] - Fix: [command or action]

If none of the above: escalate to [team/person]

Step 4: Document common operations

List every routine operational task with exact commands:

Restart the service:

kubectl rollout restart deployment/[service-name] -n [namespace]
kubectl rollout status deployment/[service-name] -n [namespace]

Scale up:

kubectl scale deployment/[service-name] --replicas=[n] -n [namespace]

Rollback a deploy:

kubectl rollout undo deployment/[service-name] -n [namespace]
# Verify rollback succeeded:
kubectl rollout status deployment/[service-name] -n [namespace]

Step 5: Document recovery procedures

For each failure mode, step-by-step recovery:

Failure: [service is unresponsive / returning 5xx / high latency]
Verify: [how to confirm this is the problem]
Recover:
  1. [first action]
  2. [second action]
  3. [verify recovery: command - expected output]
Escalate if: [condition that means this runbook is insufficient]

Step 6: Document escalation paths

Who to call when the runbook doesn't work:

Step 7: Add a testing section

How does an on-call engineer practice with this runbook in a non-emergency?

Chaos engineering scenarios
Runbook review frequency (recommend: quarterly)
Last tested: [date]

Anti-Patterns

3. No escalation path Bad: Runbook ends with "investigate further." Good: Every runbook has an explicit "escalate if" condition and names who to escalate to.

4. Runbook never tested Bad: Runbook written once, never validated in a real incident. Good: Run quarterly fire drills. Note the last tested date in the runbook.

Quality Checklist

[ ] Service overview complete (owner, SLO, dependencies, links)
[ ] Normal operating metrics defined with ranges
[ ] Every alert has a response procedure with exact commands
[ ] Commands include expected outputs (not just what to run)
[ ] Common operations documented step-by-step
[ ] Recovery procedures cover the top 3 failure modes
[ ] Escalation path named and reachable
[ ] Last tested date recorded
[ ] Executable at 3am by someone unfamiliar with the service

Related Skills

qa-aman/webinar-planner

development

VerifiedTrustedCommunity

Plan a webinar end-to-end using April Dunford's Obviously Awesome positioning framework to find the topic angle that makes the webinar obviously valuable to the right audience. Produces topic positioning, abstract, speaker brief, registration page, promotion sequence, day-of run-of-show, and post-webinar follow-up. Use when the user asks to plan a webinar, virtual event, online workshop, "we need a webinar on X", host a webinar, online masterclass, or any live virtual event with promotion and follow-up. Reads ICP, services, and brand voice from knowledge/.

13SKILL.mdUpdated May 5, 2026

qa-aman/webinar-planner

qa-aman/thought-leadership-writer

development

VerifiedTrustedCommunity

Write long-form thought leadership articles, opinion pieces, industry POV essays, and CEO/founder bylines using the Made to Stick SUCCESs framework (Chip and Dan Heath). Use when the user asks for a long-form article, executive byline, opinion piece, industry POV, manifesto, "explain our point of view on X", or wants to publish an authority-building piece (1200-2500 words). Reads brand voice and positioning from knowledge/.

13SKILL.mdUpdated May 5, 2026

qa-aman/thought-leadership-writer

qa-aman/social-calendar

development

VerifiedTrustedCommunity

Plan a monthly content calendar across channels using the Content Marketing Matrix (Dave Chaffey, Smart Insights) - Entertain/Inspire/Educate/Convince. Every post gets a quadrant label. The monthly calendar must hit 40% Educate, 40% Inspire+Convince, 20% Entertain. Produces a week-by-week posting schedule with topics, formats, channels, and asset links. Use when the user says "content calendar", "social calendar", "plan next month's content", "what should we post", "content plan", "editorial calendar", "schedule posts for the month", or wants a structured posting plan for LinkedIn, Twitter, email, or blog. Reads brand voice, ICP, and past learnings from knowledge/.

13SKILL.mdUpdated May 5, 2026

qa-aman/social-calendar

qa-aman/seo-article-writer

development

VerifiedTrustedCommunity

Write SEO-optimized long-form articles targeting specific keywords using the They Ask You Answer Big 5 framework (Marcus Sheridan). Articles are categorized by Big 5 type (Cost, Problems, Versus, Best/Reviews, How-To) and structured accordingly. The "answer first" rule applies to every article. Use when the user asks for an SEO article, blog post for ranking, "rank for keyword X", organic content, search-optimized post, pillar page, or content for organic traffic. Includes keyword targeting, search intent matching, internal linking suggestions, and meta tags.

13SKILL.mdUpdated May 5, 2026

qa-aman/seo-article-writer

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/qa-aman/claude-skills.git

# Copy into Claude Code skills folder (global)
cp -r claude-skills/skills/by-role/devops/runbook ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

qa-aman/claude-skills

13 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT