.claude/skills/postmortem-writing/SKILL.md
Writes blameless postmortems with root cause analysis, incident timelines, contributing factors, and action items. Use when conducting incident reviews or when the user mentions postmortem, root cause analysis, or blameless review.
npx skillsauth add tranhieutt/software_development_department postmortem-writingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
"Blame-focused: Who caused this? → Blameless: What conditions allowed this?"
Engineers don't fail — systems create conditions where failures become inevitable. The goal is improving systems, not punishing people.
Day 0: Incident occurs
Day 1-2: Draft postmortem (memory is freshest)
Day 3-5: Postmortem meeting
Day 5-7: Finalize + create tickets
Week 2+: Action item completion
# Postmortem: [Title]
**Date**: YYYY-MM-DD | **Severity**: SEV2 | **Duration**: 47 min
**Authors**: @alice, @bob | **Status**: Draft
## Executive Summary
[2-3 sentences: what broke, impact, how resolved]
**Impact**: [N customers, N minutes, revenue loss, no data loss]
## Timeline (UTC)
| Time | Event |
|-------|-------|
| 14:23 | v2.3.4 deployed to production |
| 14:31 | Alert: payment_error_rate > 5% |
| 14:33 | On-call @alice acknowledges |
| 14:45 | Root cause identified: DB connections |
| 14:52 | Decision to rollback |
| 15:10 | Rollback complete, error rate normalizing |
| 15:18 | Service recovered |
## Root Cause Analysis
### What happened
[Technical description of failure]
### 5 Whys
- Why did service fail? → DB connections exhausted
- Why exhausted? → Each request opened new connection
- Why new connections? → Code bypassed connection pool
- Why bypassed? → Developer unfamiliar with DB patterns
- Why unfamiliar? → No documentation on connection management
### Contributing factors
- Code review missed the infrastructure change
- No integration tests for connection pool behavior
- Staging traffic too low to expose the issue
- Alert threshold too high (90%, should be 70%)
## What worked / what didn't
| Worked | Didn't work |
|---|---|
| Alert fired within 8 min | Took 10 min to correlate with deployment |
| Clear Grafana dashboard | No deployment-correlated alerting |
| Fast rollback decision | No canary deployment |
## Action items
| Priority | Action | Owner | Due | Ticket |
|---|---|---|---|---|
| P0 | Integration test for connection pool | @alice | 2024-01-22 | ENG-1234 |
| P0 | Lower DB alert threshold to 70% | @bob | 2024-01-17 | OPS-567 |
| P1 | Document connection management patterns | @alice | 2024-01-29 | DOC-89 |
| P2 | Evaluate canary deployment | @charlie | 2024-02-15 | ENG-1235 |
# Quick Postmortem: [Title]
**Date**: YYYY-MM-DD | **Duration**: 12 min | **Severity**: SEV3
**What happened**: Cache flush caused thundering herd — all requests missed cache simultaneously.
**Timeline**: 10:00 flush → 10:02 alerts → 10:05 identified → 10:08 warming enabled → 10:12 normal
**Root cause**: Full flush used for minor config update.
**Fixes**: Immediate: enabled warming. Long-term: partial invalidation (ENG-999).
**Lesson**: Never full-flush production cache; use targeted invalidation.
| Anti-pattern | Why it fails | |---|---| | "Human error" as root cause | Always dig deeper — why did the system allow it? | | Shallow analysis (1 why) | Doesn't prevent recurrence | | No action items | Meeting was a waste of time | | Unrealistic actions | Never completed | | No follow-up tracking | Actions forgotten |
Save to docs/technical/postmortem-YYYY-MM-DD-[slug].md
Deliver: timeline + 1-sentence root cause + max 5 action items with owner and deadline
testing
Generates high-fidelity architecture diagrams, sequence flows, and component maps for SDD projects. Use when finalizing a design phase, documenting system architecture, or visualizing agentic workflows. Default style: Style 6 (Claude Official).
data-ai
Provides vector database and semantic search patterns for Pinecone, Weaviate, Qdrant, Milvus, and pgvector in RAG and recommendation systems. Use when implementing vector search or when the user mentions vector database, semantic search, embeddings, or similarity search.
development
Updates docs/technical/CODEMAP.md by scanning the current codebase structure. Run after a significant feature merge, refactor, or when CODEMAP feels stale.
development
Unlocks the codebase after a release freeze or incident freeze period to resume normal development. Use when a freeze period ends or when the user mentions unfreezing or lifting the code freeze.