skills/by-role/devops/runbook/SKILL.md
Write an operational runbook. Use when the user says "write a runbook", "on-call documentation", "how to operate this service", "alert runbook", "troubleshooting guide for ops", "what to do when this alert fires", "operational procedures", or needs to document how to run, troubleshoot, or respond to a service - even if they don't explicitly say "runbook".
npx skillsauth add qa-aman/claude-skills runbookInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Based on the Google SRE Book. A runbook is the operational contract for a service: how to start it, stop it, diagnose it, and recover it. Google SRE's standard: a runbook should be executable by an on-call engineer who has never seen the service before. If it requires tribal knowledge, it's not a runbook - it's an assumption.
The test: can an on-call engineer follow this runbook at 3am with a degraded service, no time to research, and adrenaline affecting their cognition? If not, it's not operational-grade.
Service: [your service]
Owner team: [your team]
On-call rotation: [who gets paged]
SLO: [availability target, e.g. 99.9% uptime]
Dependencies: [upstream services this depends on]
Dependents: [downstream services that depend on this]
Dashboards: [links]
Logs: [how to access]
Alerts: [link to alerting config]
What does normal look like?
Health check:
# Command to verify the service is healthy
curl https://[your service]/health
# Expected response: {"status": "ok", "version": "x.y.z"}
Key metrics (normal ranges):
| Metric | Normal | Warning | Critical | |--------|--------|---------|----------| | Error rate | < 0.1% | 0.1-1% | > 1% | | p99 latency | < 200ms | 200-500ms | > 500ms | | CPU usage | < 60% | 60-80% | > 80% |
For each alert, write a response procedure:
Alert: [Alert name]
Severity: [SEV1/SEV2/SEV3]
Fires when: [condition]
Impact: [what users experience]
Diagnosis:
1. Check [dashboard link] for [specific signal]
2. Run: [command] - Expected: [output]
3. Check: [logs location] for [error pattern]
Common causes:
A. [Cause] - Fix: [command or action]
B. [Cause] - Fix: [command or action]
If none of the above: escalate to [team/person]
List every routine operational task with exact commands:
Restart the service:
kubectl rollout restart deployment/[service-name] -n [namespace]
kubectl rollout status deployment/[service-name] -n [namespace]
Scale up:
kubectl scale deployment/[service-name] --replicas=[n] -n [namespace]
Rollback a deploy:
kubectl rollout undo deployment/[service-name] -n [namespace]
# Verify rollback succeeded:
kubectl rollout status deployment/[service-name] -n [namespace]
For each failure mode, step-by-step recovery:
Failure: [service is unresponsive / returning 5xx / high latency]
Verify: [how to confirm this is the problem]
Recover:
1. [first action]
2. [second action]
3. [verify recovery: command - expected output]
Escalate if: [condition that means this runbook is insufficient]
Who to call when the runbook doesn't work:
| Situation | Contact | How | |-----------|---------|-----| | Database issue | [DB team] | [PagerDuty / Slack] | | Network issue | [Infra team] | [contact] | | Vendor outage | [Vendor support] | [ticket URL] |
How does an on-call engineer practice with this runbook in a non-emergency?
1. Runbook that requires tribal knowledge
Bad: "Check if the DB is having issues." (how? where? what to look for?)
Good: "Run psql -h [host] -U [user] -c 'SELECT count(*) FROM pg_stat_activity' - if > 100 connections, connection pool exhaustion is likely."
2. Commands without expected output
Bad: "Run the health check command."
Good: "Run: curl https://[your service]/health - Expected: {"status":"ok"}. If you see {"status":"degraded"}, proceed to Step 3."
3. No escalation path Bad: Runbook ends with "investigate further." Good: Every runbook has an explicit "escalate if" condition and names who to escalate to.
4. Runbook never tested Bad: Runbook written once, never validated in a real incident. Good: Run quarterly fire drills. Note the last tested date in the runbook.
development
Plan a webinar end-to-end using April Dunford's Obviously Awesome positioning framework to find the topic angle that makes the webinar obviously valuable to the right audience. Produces topic positioning, abstract, speaker brief, registration page, promotion sequence, day-of run-of-show, and post-webinar follow-up. Use when the user asks to plan a webinar, virtual event, online workshop, "we need a webinar on X", host a webinar, online masterclass, or any live virtual event with promotion and follow-up. Reads ICP, services, and brand voice from knowledge/.
development
Write long-form thought leadership articles, opinion pieces, industry POV essays, and CEO/founder bylines using the Made to Stick SUCCESs framework (Chip and Dan Heath). Use when the user asks for a long-form article, executive byline, opinion piece, industry POV, manifesto, "explain our point of view on X", or wants to publish an authority-building piece (1200-2500 words). Reads brand voice and positioning from knowledge/.
development
Plan a monthly content calendar across channels using the Content Marketing Matrix (Dave Chaffey, Smart Insights) - Entertain/Inspire/Educate/Convince. Every post gets a quadrant label. The monthly calendar must hit 40% Educate, 40% Inspire+Convince, 20% Entertain. Produces a week-by-week posting schedule with topics, formats, channels, and asset links. Use when the user says "content calendar", "social calendar", "plan next month's content", "what should we post", "content plan", "editorial calendar", "schedule posts for the month", or wants a structured posting plan for LinkedIn, Twitter, email, or blog. Reads brand voice, ICP, and past learnings from knowledge/.
development
Write SEO-optimized long-form articles targeting specific keywords using the They Ask You Answer Big 5 framework (Marcus Sheridan). Articles are categorized by Big 5 type (Cost, Problems, Versus, Best/Reviews, How-To) and structured accordingly. The "answer first" rule applies to every article. Use when the user asks for an SEO article, blog post for ranking, "rank for keyword X", organic content, search-optimized post, pillar page, or content for organic traffic. Includes keyword targeting, search intent matching, internal linking suggestions, and meta tags.