skills/oncall-runbook/SKILL.md
Write an on-call runbook for a service — covering alert definitions, escalation paths, common incident responses, and on-call handoff procedures. Use when asked to write an on-call guide, create alert runbooks, document escalation procedures, or prepare an on-call handoff document. Produces a structured on-call runbook with per-alert response procedures, escalation matrix, diagnostic commands, and handoff template.
npx skillsauth add mohitagw15856/pm-claude-skills oncall-runbookInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Produce a complete on-call runbook for a service — giving the on-call engineer everything they need to respond confidently to alerts at 3am, without having to ask anyone for help.
A good on-call runbook reduces mean time to resolution (MTTR) by eliminating the "what do I do first?" problem. It is written for the on-call engineer who has just been paged and needs to act, not for someone calmly reading documentation.
Ask for these if not already provided:
Team: [Team name] | Tech lead: [Name] PagerDuty service: [Link] | Escalation policy: [Policy name] Last updated: [Date] | Next review: [Date + 90 days]
First time on-call for this service? Read the [developer onboarding doc] first — it covers the architecture and how things work. This runbook assumes you understand the service.
Dashboard: [Link — the first thing to open when paged] Logs: [Link — where to find logs] Runbook index: Jump to the alert that paged you → [Alert list below] Can't resolve in 30 min? Escalate to: [Name] via [Slack / PagerDuty]
Rollback command (memorise this):
[rollback command — e.g. kubectl rollout undo deployment/[service-name]]
| Situation | Escalate to | How | After how long |
|---|---|---|---|
| Can't diagnose the alert | [Tech lead name] | Slack DM / Phone | 30 minutes |
| Alert requires infra change | [Platform team] | #platform Slack | Immediately |
| Customer-facing impact | [CSM / Support lead] | #incidents Slack | Immediately (P1) |
| Database issue | [DBA or data team] | Slack / PagerDuty | Immediately |
| [Specific dependency] down | [[Dependency] on-call] | PagerDuty / Slack | Immediately |
| Extended outage (>1 hour) | [Engineering manager] | Phone | 1 hour |
Contacts:
| Name | Role | Slack | Phone |
|---|---|---|---|
| [Name] | Tech lead | @[handle] | [Number] |
| [Name] | Engineering manager | @[handle] | [Number] |
| [Name] | Platform / infra | @[handle] | [Number] |
| [Platform team] | Infra on-call | #platform | PagerDuty |
[Upstream callers]
│
▼
[This Service]
│
├──→ [Primary Database]
├──→ [Cache — e.g. Redis]
└──→ [Downstream Service / Queue]
If this service is down, these are affected: [List downstream consumers] If these are down, this service is affected: [List upstream dependencies]
What it means: [Plain English — e.g. "More than 5% of API requests are returning 5xx errors in the last 5 minutes"] Severity: P1 / P2 / P3 SLO impact: Yes / No — [If yes: this alert means the error budget is burning at [X]× rate]
Step 1 — Acknowledge and assess
# Check current error rate
[query or dashboard link]
# Check which endpoints are erroring
[query or command]
Step 2 — Check recent changes
# Any deploys in the last hour?
[command or link to deployment log]
# Recent config changes?
[where to check]
Step 3 — Check dependencies
# Is the database healthy?
[health check command or link]
# Is [downstream service] healthy?
[health check command or link]
Step 4 — Diagnose
| If you see | It means | Do this | |---|---|---| | [Error pattern 1] | [Cause] | [Action] | | [Error pattern 2] | [Cause] | [Action] | | [Error pattern 3] | [Cause] | [Action] | | No clear pattern | Unknown cause | Escalate to [name] |
Step 5 — Fix or mitigate
# If caused by bad deploy — roll back:
[rollback command]
# If caused by [specific issue]:
[fix command]
# If caused by upstream dependency:
[mitigation — e.g. enable circuit breaker, reduce traffic, etc.]
After resolving:
#incidents with resolution summaryWhat it means: [e.g. "P99 response time has exceeded 1s for more than 3 consecutive minutes"] Severity: P1 / P2 / P3 SLO impact: Yes — latency SLO breach
Step 1 — Assess scope
# Check which endpoints are slow
[query or dashboard — broken down by endpoint]
# Check if latency is across all regions or localised
[query or command]
Step 2 — Common causes and fixes
| Cause | Signal | Fix |
|---|---|---|
| Database slow queries | DB latency spike on dashboard | [Check slow query log: command] |
| Cache miss storm | Cache hit rate drops on dashboard | [command or action] |
| Memory pressure / GC | High memory on service dashboard | [command or action — e.g. restart, scale up] |
| Upstream service slow | Trace shows time in external call | Escalate to [service] on-call |
| Traffic spike | Request rate spike on dashboard | [Scale up: command] |
Step 3 — Escalate if unresolved in 20 minutes Page [Tech lead] via PagerDuty / Slack.
What it means: [e.g. "The service has used all available database connections — new requests will fail"] Severity: P1 SLO impact: Yes — will cause errors immediately
Immediate mitigation:
# Restart the service to flush stale connections
[restart command]
# Check current connection count
[DB connection query]
Diagnose root cause after stabilising:
# Check for long-running queries holding connections
[query]
# Check if a recent deploy changed connection pool config
[where to check]
Resolution: [e.g. "Increase pool size in config / kill long-running queries / scale the service"]
What it means: [e.g. "The message queue backlog exceeds 10,000 messages — consumers are not keeping up"] Severity: P2 SLO impact: Depends — if queue backs up, downstream systems will receive delayed data
Step 1 — Check consumer health
# Are consumers running?
[command]
# Consumer error rate?
[dashboard or query]
Step 2 — Check message contents
# Are there poison messages causing retries?
[command to inspect dead-letter queue or failed messages]
Step 3 — Options
| If | Then |
|---|---|
| Consumers are down | Restart consumers: [command] |
| Poison message in queue | Move to DLQ: [command] |
| Consumers healthy but slow | Scale consumers: [command] |
| Upstream producing too fast | Escalate to [upstream service] owner |
Common commands for quick diagnosis. Paste and run without modification.
# Service health
[health check command]
# Recent logs (last 100 lines)
[log command]
# Error logs only
[error log filter command]
# Current pod / instance status
[kubectl get pods / aws ecs describe-tasks / etc.]
# Restart the service
[restart command]
# Roll back to previous version
[rollback command]
# Database connection count
[DB query]
# Cache hit rate
[cache stats command]
# Current request rate
[metrics query]
| Dashboard | URL | Use it to | |---|---|---| | Service overview | [Link] | First stop — error rate, latency, request rate | | Database | [Link] | Connection count, slow queries, replication lag | | Infrastructure | [Link] | CPU, memory, disk | | Queue / consumers | [Link] | Backlog depth, consumer throughput | | Upstream dependencies | [Link] | Dependency health at a glance |
When you declare an incident:
Post to #incidents immediately:
🔴 INCIDENT — [Service Name]
Status: Investigating
Impact: [Who is affected and how]
Paged: [Your name]
Next update: [Time — max 30 min from now]
Update every 30 minutes while active:
🔴 UPDATE — [Service Name] — [Time]
Status: [Investigating / Identified / Mitigating / Resolved]
Latest: [One sentence on what you found or did]
Next update: [Time]
On resolution:
✅ RESOLVED — [Service Name] — [Time]
Duration: [X minutes]
Impact: [Summary of who was affected]
Cause: [One sentence]
Follow-up: [PIR required? Yes/No — link when created]
Use this template at the end of every on-call shift:
--- ON-CALL HANDOFF: [Service Name] ---
Date: [Date]
Outgoing: [Your name]
Incoming: [Next on-call name]
INCIDENTS THIS SHIFT:
- [Incident summary — date, duration, cause, resolution, follow-up required]
OPEN ISSUES TO WATCH:
- [Anything not fully resolved / trending in the wrong direction]
CHANGES SINCE LAST HANDOFF:
- [Deploys, config changes, infra changes that affect on-call awareness]
RUNBOOK GAPS FOUND:
- [Anything you had to figure out that isn't documented — please add it]
ANYTHING ELSE:
- [Notes for incoming on-call]
development
Analyse competitor moves and translate them into strategic implications for your product roadmap. Use when a competitor announces a new feature, pricing change, partnership, or strategic shift, or when producing a periodic competitive intelligence report. Produces a categorised signal analysis with reactive-vs-proactive assessment, threat ratings, specific roadmap implications, and recommended responses with owners.
development
Build a community management playbook for a brand's social media channels. Use when asked to create guidelines for managing comments, DMs, and community interactions, define a moderation policy, or build response frameworks for social media community managers. Produces a complete playbook with response templates, escalation paths, moderation rules, and tone guidelines.
development
Activate a 4-stage coding discipline framework that forces Claude to plan before coding, isolate changes on a branch, write tests first, and self-review output twice before presenting it. Use when starting a complex coding task, when past Claude sessions produced broken first drafts, or when you want to prevent rework cycles. Produces a confirmed written plan, isolated feature branch, test-first implementation, and a double-reviewed output with a correctness and code-quality checklist.
development
Optimize an article for Answer Engine Optimization (AEO) — restructuring content so AI engines like ChatGPT, Perplexity, and Claude can extract, quote, and cite it. Rewrites headings as questions, drops 50-80 word answer capsules, audits paragraph length, and flags trust signals. Use when asked to AEO-optimize, make content AI-readable, improve AI citation chances, or adapt an article for answer engines.