plugins/pm-engineering/skills/oncall-runbook/SKILL.md
Write an on-call runbook for a service — covering alert definitions, escalation paths, common incident responses, and on-call handoff procedures. Use when asked to write an on-call guide, create alert runbooks, document escalation procedures, or prepare an on-call handoff document. Produces a structured on-call runbook with per-alert response procedures, escalation matrix, diagnostic commands, and handoff template.
npx skillsauth add mohitagw15856/pm-claude-skills oncall-runbookInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Produce a complete on-call runbook for a service — giving the on-call engineer everything they need to respond confidently to alerts at 3am, without having to ask anyone for help.
A good on-call runbook reduces mean time to resolution (MTTR) by eliminating the "what do I do first?" problem. It is written for the on-call engineer who has just been paged and needs to act, not for someone calmly reading documentation.
Ask for these if not already provided:
Team: [Team name] | Tech lead: [Name] PagerDuty service: [Link] | Escalation policy: [Policy name] Last updated: [Date] | Next review: [Date + 90 days]
First time on-call for this service? Read the [developer onboarding doc] first — it covers the architecture and how things work. This runbook assumes you understand the service.
Dashboard: [Link — the first thing to open when paged] Logs: [Link — where to find logs] Runbook index: Jump to the alert that paged you → [Alert list below] Can't resolve in 30 min? Escalate to: [Name] via [Slack / PagerDuty]
Rollback command (memorise this):
[rollback command — e.g. kubectl rollout undo deployment/[service-name]]
| Situation | Escalate to | How | After how long |
|---|---|---|---|
| Can't diagnose the alert | [Tech lead name] | Slack DM / Phone | 30 minutes |
| Alert requires infra change | [Platform team] | #platform Slack | Immediately |
| Customer-facing impact | [CSM / Support lead] | #incidents Slack | Immediately (P1) |
| Database issue | [DBA or data team] | Slack / PagerDuty | Immediately |
| [Specific dependency] down | [[Dependency] on-call] | PagerDuty / Slack | Immediately |
| Extended outage (>1 hour) | [Engineering manager] | Phone | 1 hour |
Contacts:
| Name | Role | Slack | Phone |
|---|---|---|---|
| [Name] | Tech lead | @[handle] | [Number] |
| [Name] | Engineering manager | @[handle] | [Number] |
| [Name] | Platform / infra | @[handle] | [Number] |
| [Platform team] | Infra on-call | #platform | PagerDuty |
[Upstream callers]
│
▼
[This Service]
│
├──→ [Primary Database]
├──→ [Cache — e.g. Redis]
└──→ [Downstream Service / Queue]
If this service is down, these are affected: [List downstream consumers] If these are down, this service is affected: [List upstream dependencies]
What it means: [Plain English — e.g. "More than 5% of API requests are returning 5xx errors in the last 5 minutes"] Severity: P1 / P2 / P3 SLO impact: Yes / No — [If yes: this alert means the error budget is burning at [X]× rate]
Step 1 — Acknowledge and assess
# Check current error rate
[query or dashboard link]
# Check which endpoints are erroring
[query or command]
Step 2 — Check recent changes
# Any deploys in the last hour?
[command or link to deployment log]
# Recent config changes?
[where to check]
Step 3 — Check dependencies
# Is the database healthy?
[health check command or link]
# Is [downstream service] healthy?
[health check command or link]
Step 4 — Diagnose
| If you see | It means | Do this | |---|---|---| | [Error pattern 1] | [Cause] | [Action] | | [Error pattern 2] | [Cause] | [Action] | | [Error pattern 3] | [Cause] | [Action] | | No clear pattern | Unknown cause | Escalate to [name] |
Step 5 — Fix or mitigate
# If caused by bad deploy — roll back:
[rollback command]
# If caused by [specific issue]:
[fix command]
# If caused by upstream dependency:
[mitigation — e.g. enable circuit breaker, reduce traffic, etc.]
After resolving:
#incidents with resolution summaryWhat it means: [e.g. "P99 response time has exceeded 1s for more than 3 consecutive minutes"] Severity: P1 / P2 / P3 SLO impact: Yes — latency SLO breach
Step 1 — Assess scope
# Check which endpoints are slow
[query or dashboard — broken down by endpoint]
# Check if latency is across all regions or localised
[query or command]
Step 2 — Common causes and fixes
| Cause | Signal | Fix |
|---|---|---|
| Database slow queries | DB latency spike on dashboard | [Check slow query log: command] |
| Cache miss storm | Cache hit rate drops on dashboard | [command or action] |
| Memory pressure / GC | High memory on service dashboard | [command or action — e.g. restart, scale up] |
| Upstream service slow | Trace shows time in external call | Escalate to [service] on-call |
| Traffic spike | Request rate spike on dashboard | [Scale up: command] |
Step 3 — Escalate if unresolved in 20 minutes Page [Tech lead] via PagerDuty / Slack.
What it means: [e.g. "The service has used all available database connections — new requests will fail"] Severity: P1 SLO impact: Yes — will cause errors immediately
Immediate mitigation:
# Restart the service to flush stale connections
[restart command]
# Check current connection count
[DB connection query]
Diagnose root cause after stabilising:
# Check for long-running queries holding connections
[query]
# Check if a recent deploy changed connection pool config
[where to check]
Resolution: [e.g. "Increase pool size in config / kill long-running queries / scale the service"]
What it means: [e.g. "The message queue backlog exceeds 10,000 messages — consumers are not keeping up"] Severity: P2 SLO impact: Depends — if queue backs up, downstream systems will receive delayed data
Step 1 — Check consumer health
# Are consumers running?
[command]
# Consumer error rate?
[dashboard or query]
Step 2 — Check message contents
# Are there poison messages causing retries?
[command to inspect dead-letter queue or failed messages]
Step 3 — Options
| If | Then |
|---|---|
| Consumers are down | Restart consumers: [command] |
| Poison message in queue | Move to DLQ: [command] |
| Consumers healthy but slow | Scale consumers: [command] |
| Upstream producing too fast | Escalate to [upstream service] owner |
Common commands for quick diagnosis. Paste and run without modification.
# Service health
[health check command]
# Recent logs (last 100 lines)
[log command]
# Error logs only
[error log filter command]
# Current pod / instance status
[kubectl get pods / aws ecs describe-tasks / etc.]
# Restart the service
[restart command]
# Roll back to previous version
[rollback command]
# Database connection count
[DB query]
# Cache hit rate
[cache stats command]
# Current request rate
[metrics query]
| Dashboard | URL | Use it to | |---|---|---| | Service overview | [Link] | First stop — error rate, latency, request rate | | Database | [Link] | Connection count, slow queries, replication lag | | Infrastructure | [Link] | CPU, memory, disk | | Queue / consumers | [Link] | Backlog depth, consumer throughput | | Upstream dependencies | [Link] | Dependency health at a glance |
When you declare an incident:
Post to #incidents immediately:
🔴 INCIDENT — [Service Name]
Status: Investigating
Impact: [Who is affected and how]
Paged: [Your name]
Next update: [Time — max 30 min from now]
Update every 30 minutes while active:
🔴 UPDATE — [Service Name] — [Time]
Status: [Investigating / Identified / Mitigating / Resolved]
Latest: [One sentence on what you found or did]
Next update: [Time]
On resolution:
✅ RESOLVED — [Service Name] — [Time]
Duration: [X minutes]
Impact: [Summary of who was affected]
Cause: [One sentence]
Follow-up: [PIR required? Yes/No — link when created]
Use this template at the end of every on-call shift:
--- ON-CALL HANDOFF: [Service Name] ---
Date: [Date]
Outgoing: [Your name]
Incoming: [Next on-call name]
INCIDENTS THIS SHIFT:
- [Incident summary — date, duration, cause, resolution, follow-up required]
OPEN ISSUES TO WATCH:
- [Anything not fully resolved / trending in the wrong direction]
CHANGES SINCE LAST HANDOFF:
- [Deploys, config changes, infra changes that affect on-call awareness]
RUNBOOK GAPS FOUND:
- [Anything you had to figure out that isn't documented — please add it]
ANYTHING ELSE:
- [Notes for incoming on-call]
development
Build a framework for creating shareable, high-reach social media content. Use when asked to plan viral content, develop a shareable content strategy, create a hook writing system, or build a repeatable process for content that gets shared. Produces a platform-specific viral content framework with hook formulas, content structures, shareability triggers, and a content testing system.
development
Generate article or newsletter thumbnail candidates using the Gemini API from inside Claude Code. Claude reads article copy, proposes composition concepts, writes image generation prompts incorporating brand specs, calls Gemini to generate the images, evaluates the results via computer vision, and returns ranked candidates with rationale. Use when asked to create thumbnails, generate cover images, or produce visual candidates for an article or newsletter.
testing
Flips Claude's default from "find reasons you're right" to "find reasons you're wrong." A genuine thinking partner, not a mirror with grammar. Use before high-stakes decisions, plans, assumptions, or pitches you haven't stress-tested.
development
Scrapes a Substack Notes page and exports engagement data (likes, comments, restacks) to a formatted .xlsx file with conditional formatting and summary stats.