skills/postmortem-writing/SKILL.md
Comprehensive guide to writing effective, blameless postmortems that drive organizational learning and prevent incident recurrence.
npx skillsauth add ranbot-ai/awesome-skills postmortem-writingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Comprehensive guide to writing effective, blameless postmortems that drive organizational learning and prevent incident recurrence.
resources/implementation-playbook.md.| Blame-Focused | Blameless | |---------------|-----------| | "Who caused this?" | "What conditions allowed this?" | | "Someone made a mistake" | "The system allowed this mistake" | | Punish individuals | Improve systems | | Hide information | Share learnings | | Fear of speaking up | Psychological safety |
Day 0: Incident occurs
Day 1-2: Draft postmortem document
Day 3-5: Postmortem meeting
Day 5-7: Finalize document, create tickets
Week 2+: Action item completion
Quarterly: Review patterns across incidents
# Postmortem: [Incident Title]
**Date**: 2024-01-15
**Authors**: @alice, @bob
**Status**: Draft | In Review | Final
**Incident Severity**: SEV2
**Incident Duration**: 47 minutes
## Executive Summary
On January 15, 2024, the payment processing service experienced a 47-minute outage affecting approximately 12,000 customers. The root cause was a database connection pool exhaustion triggered by a configuration change in deployment v2.3.4. The incident was resolved by rolling back to v2.3.3 and increasing connection pool limits.
**Impact**:
- 12,000 customers unable to complete purchases
- Estimated revenue loss: $45,000
- 847 support tickets created
- No data loss or security implications
## Timeline (All times UTC)
| Time | Event |
|------|-------|
| 14:23 | Deployment v2.3.4 completed to production |
| 14:31 | First alert: `payment_error_rate > 5%` |
| 14:33 | On-call engineer @alice acknowledges alert |
| 14:35 | Initial investigation begins, error rate at 23% |
| 14:41 | Incident declared SEV2, @bob joins |
| 14:45 | Database connection exhaustion identified |
| 14:52 | Decision to rollback deployment |
| 14:58 | Rollback to v2.3.3 initiated |
| 15:10 | Rollback complete, error rate dropping |
| 15:18 | Service fully recovered, incident resolved |
## Root Cause Analysis
### What Happened
The v2.3.4 deployment included a change to the database query pattern that inadvertently removed connection pooling for a frequently-called endpoint. Each request opened a new database connection instead of reusing pooled connections.
### Why It Happened
1. **Proximate Cause**: Code change in `PaymentRepository.java` replaced pooled `DataSource` with direct `DriverManager.getConnection()` calls.
2. **Contributing Factors**:
- Code review did not catch the connection handling change
- No integration tests specifically for connection pool behavior
- Staging environment has lower traffic, masking the issue
- Database connection metrics alert threshold was too high (90%)
3. **5 Whys Analysis**:
- Why did the service fail? → Database connections exhausted
- Why were connections exhausted? → Each request opened new connection
- Why did each request open new connection? → Code bypassed connection pool
- Why did code bypass connection pool? → Developer unfamiliar with codebase patterns
- Why was developer unfamiliar? → No documentation on connection management patterns
### System Diagram
[Client] → [Load Balancer] → [Payment Service] → [Database] ↓ Connection Pool (broken) ↓ Direct connections (cause)
## Detection
### What Worked
- Error rate alert fired within 8 minutes of deployment
- Grafana dashboard clearly showed connection spike
- On-call response was swift (2 minute acknowledgment)
### What Didn't Work
- Database connection metric alert threshold too high
- No deployment-correlated alerting
- Canary deployment would have caught this earlier
### Detection Gap
The deployment completed at 14:23, but the first alert didn't fire until 14:31 (8 minutes). A deployment-aware alert could have detected the issue faster.
## Response
### What Worked
- On-call engineer quick
testing
Fix SEO indexing issues, crawl budget problems, and Search Console coverage errors for Next.js apps. Covers canonical tags, noindex audits, sitemap health, static rendering, and internal linking.
data-ai
Analyze AI disruption pressure across a business, map competitive exposure, and produce a 90-day defensive action plan.
tools
--- name: longbridge description: 125+ agent skills for Longbridge Securities — real-time quotes, charts, fundamentals, portfolio analysis, options, and more for HK/US/A-share/SG markets. Trilingual: Simplified Chinese, Traditional category: AI & Agents source: antigravity tags: [api, mcp, claude, ai, agent, security, cro] url: https://github.com/sickn33/antigravity-awesome-skills/tree/main/skills/longbridge --- # Longbridge ## Overview Longbridge is the official skill collection for Longbr
tools
Design, debug, and harden GitHub Actions CI/CD workflows, including reusable workflows, matrix builds, self-hosted runners, OIDC authentication, caching, environments, secrets, and release automation.