Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

tranhieutt/postmortem-writing

Name: postmortem-writing
Author: tranhieutt

.claude/skills/postmortem-writing/SKILL.md

npx skillsauth add tranhieutt/software_development_department postmortem-writing

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Postmortem Writing

Blameless mindset: the non-obvious part

"Blame-focused: Who caused this? → Blameless: What conditions allowed this?"

Engineers don't fail — systems create conditions where failures become inevitable. The goal is improving systems, not punishing people.

Triggers (when to write one)

SEV1/SEV2 incidents
Customer-facing outage > 15 minutes
Data loss or security incident
Novel failure modes worth sharing

Timeline: Day 0 → Day 7

Day 0:   Incident occurs
Day 1-2: Draft postmortem (memory is freshest)
Day 3-5: Postmortem meeting
Day 5-7: Finalize + create tickets
Week 2+: Action item completion

Standard template

# Postmortem: [Title]

**Date**: YYYY-MM-DD | **Severity**: SEV2 | **Duration**: 47 min
**Authors**: @alice, @bob | **Status**: Draft

## Executive Summary
[2-3 sentences: what broke, impact, how resolved]

**Impact**: [N customers, N minutes, revenue loss, no data loss]

## Timeline (UTC)

| Time  | Event |
|-------|-------|
| 14:23 | v2.3.4 deployed to production |
| 14:31 | Alert: payment_error_rate > 5% |
| 14:33 | On-call @alice acknowledges |
| 14:45 | Root cause identified: DB connections |
| 14:52 | Decision to rollback |
| 15:10 | Rollback complete, error rate normalizing |
| 15:18 | Service recovered |

## Root Cause Analysis

### What happened
[Technical description of failure]

### 5 Whys
- Why did service fail? → DB connections exhausted
- Why exhausted? → Each request opened new connection
- Why new connections? → Code bypassed connection pool
- Why bypassed? → Developer unfamiliar with DB patterns
- Why unfamiliar? → No documentation on connection management

### Contributing factors
- Code review missed the infrastructure change
- No integration tests for connection pool behavior
- Staging traffic too low to expose the issue
- Alert threshold too high (90%, should be 70%)

## What worked / what didn't

| Worked | Didn't work |
|---|---|
| Alert fired within 8 min | Took 10 min to correlate with deployment |
| Clear Grafana dashboard | No deployment-correlated alerting |
| Fast rollback decision | No canary deployment |

## Action items

| Priority | Action | Owner | Due | Ticket |
|---|---|---|---|---|
| P0 | Integration test for connection pool | @alice | 2024-01-22 | ENG-1234 |
| P0 | Lower DB alert threshold to 70% | @bob | 2024-01-17 | OPS-567 |
| P1 | Document connection management patterns | @alice | 2024-01-29 | DOC-89 |
| P2 | Evaluate canary deployment | @charlie | 2024-02-15 | ENG-1235 |

Quick template (SEV3, < 30 min)

# Quick Postmortem: [Title]
**Date**: YYYY-MM-DD | **Duration**: 12 min | **Severity**: SEV3

**What happened**: Cache flush caused thundering herd — all requests missed cache simultaneously.
**Timeline**: 10:00 flush → 10:02 alerts → 10:05 identified → 10:08 warming enabled → 10:12 normal
**Root cause**: Full flush used for minor config update.
**Fixes**: Immediate: enabled warming. Long-term: partial invalidation (ENG-999).
**Lesson**: Never full-flush production cache; use targeted invalidation.

Meeting structure (60 min)

Opening (5 min) — state blameless norms explicitly
Timeline review (15 min) — chronological walkthrough
Analysis (20 min) — what failed, why, what conditions allowed it
Action items (15 min) — brainstorm → prioritize → assign owners
Close (5 min) — confirm owners, schedule follow-up

Anti-patterns

| Anti-pattern | Why it fails | |---|---| | "Human error" as root cause | Always dig deeper — why did the system allow it? | | Shallow analysis (1 why) | Doesn't prevent recurrence | | No action items | Meeting was a waste of time | | Unrealistic actions | Never completed | | No follow-up tracking | Actions forgotten |

Output

Save to docs/technical/postmortem-YYYY-MM-DD-[slug].md

Deliver: timeline + 1-sentence root cause + max 5 action items with owner and deadline

tranhieutt/postmortem-writing

.claude/skills/postmortem-writing/SKILL.md

Writes blameless postmortems with root cause analysis, incident timelines, contributing factors, and action items. Use when conducting incident reviews or when the user mentions postmortem, root cause analysis, or blameless review.

60 stars

testing

Updated Apr 15, 2026

$ install --global

skillsauth

npx skillsauth add tranhieutt/software_development_department postmortem-writing

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 15, 2026, 11:49 PM6.9s1 file scanned

SKILL.md

name:: postmortem-writing
description:: Writes blameless postmortems with root cause analysis, incident timelines, contributing factors, and action items. Use when conducting incident reviews or when the user mentions postmortem, root cause analysis, or blameless review.
when_to_use:: When conducting incident reviews, writing blameless postmortems, or documenting root cause analysis after production incidents
allowed-tools:: Read, Glob, Grep, Bash
argument-hint:: [incident description or ticket ID]
user-invocable:: true
effort:: 3

Postmortem Writing

Blameless mindset: the non-obvious part

"Blame-focused: Who caused this? → Blameless: What conditions allowed this?"

Engineers don't fail — systems create conditions where failures become inevitable. The goal is improving systems, not punishing people.

Triggers (when to write one)

SEV1/SEV2 incidents
Customer-facing outage > 15 minutes
Data loss or security incident
Novel failure modes worth sharing

Timeline: Day 0 → Day 7

Day 0:   Incident occurs
Day 1-2: Draft postmortem (memory is freshest)
Day 3-5: Postmortem meeting
Day 5-7: Finalize + create tickets
Week 2+: Action item completion

Standard template

# Postmortem: [Title]

**Date**: YYYY-MM-DD | **Severity**: SEV2 | **Duration**: 47 min
**Authors**: @alice, @bob | **Status**: Draft

## Executive Summary
[2-3 sentences: what broke, impact, how resolved]

**Impact**: [N customers, N minutes, revenue loss, no data loss]

## Timeline (UTC)

| Time  | Event |
|-------|-------|
| 14:23 | v2.3.4 deployed to production |
| 14:31 | Alert: payment_error_rate > 5% |
| 14:33 | On-call @alice acknowledges |
| 14:45 | Root cause identified: DB connections |
| 14:52 | Decision to rollback |
| 15:10 | Rollback complete, error rate normalizing |
| 15:18 | Service recovered |

## Root Cause Analysis

### What happened
[Technical description of failure]

### 5 Whys
- Why did service fail? → DB connections exhausted
- Why exhausted? → Each request opened new connection
- Why new connections? → Code bypassed connection pool
- Why bypassed? → Developer unfamiliar with DB patterns
- Why unfamiliar? → No documentation on connection management

### Contributing factors
- Code review missed the infrastructure change
- No integration tests for connection pool behavior
- Staging traffic too low to expose the issue
- Alert threshold too high (90%, should be 70%)

## What worked / what didn't

| Worked | Didn't work |
|---|---|
| Alert fired within 8 min | Took 10 min to correlate with deployment |
| Clear Grafana dashboard | No deployment-correlated alerting |
| Fast rollback decision | No canary deployment |

## Action items

| Priority | Action | Owner | Due | Ticket |
|---|---|---|---|---|
| P0 | Integration test for connection pool | @alice | 2024-01-22 | ENG-1234 |
| P0 | Lower DB alert threshold to 70% | @bob | 2024-01-17 | OPS-567 |
| P1 | Document connection management patterns | @alice | 2024-01-29 | DOC-89 |
| P2 | Evaluate canary deployment | @charlie | 2024-02-15 | ENG-1235 |

Quick template (SEV3, < 30 min)

# Quick Postmortem: [Title]
**Date**: YYYY-MM-DD | **Duration**: 12 min | **Severity**: SEV3

**What happened**: Cache flush caused thundering herd — all requests missed cache simultaneously.
**Timeline**: 10:00 flush → 10:02 alerts → 10:05 identified → 10:08 warming enabled → 10:12 normal
**Root cause**: Full flush used for minor config update.
**Fixes**: Immediate: enabled warming. Long-term: partial invalidation (ENG-999).
**Lesson**: Never full-flush production cache; use targeted invalidation.

Meeting structure (60 min)

Opening (5 min) — state blameless norms explicitly
Timeline review (15 min) — chronological walkthrough
Analysis (20 min) — what failed, why, what conditions allowed it
Action items (15 min) — brainstorm → prioritize → assign owners
Close (5 min) — confirm owners, schedule follow-up

Anti-patterns

Output

Save to docs/technical/postmortem-YYYY-MM-DD-[slug].md

Deliver: timeline + 1-sentence root cause + max 5 action items with owner and deadline

Related Skills

tranhieutt/visual-engineer

testing

VerifiedTrustedCommunity

Generates high-fidelity architecture diagrams, sequence flows, and component maps for SDD projects. Use when finalizing a design phase, documenting system architecture, or visualizing agentic workflows. Default style: Style 6 (Claude Official).

60SKILL.mdUpdated Apr 15, 2026

tranhieutt/visual-engineer

tranhieutt/vector-database-engineer

data-ai

VerifiedTrustedCommunity

Provides vector database and semantic search patterns for Pinecone, Weaviate, Qdrant, Milvus, and pgvector in RAG and recommendation systems. Use when implementing vector search or when the user mentions vector database, semantic search, embeddings, or similarity search.

60SKILL.mdUpdated Apr 15, 2026

tranhieutt/vector-database-engineer

tranhieutt/update-codemap

development

VerifiedTrustedCommunity

Updates docs/technical/CODEMAP.md by scanning the current codebase structure. Run after a significant feature merge, refactor, or when CODEMAP feels stale.

60SKILL.mdUpdated Apr 15, 2026

tranhieutt/update-codemap

tranhieutt/unfreeze

development

VerifiedTrustedCommunity

Unlocks the codebase after a release freeze or incident freeze period to resume normal development. Use when a freeze period ends or when the user mentions unfreezing or lifting the code freeze.

60SKILL.mdUpdated Apr 15, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/tranhieutt/software_development_department.git

# Copy into Claude Code skills folder (global)
cp -r software_development_department/.claude/skills/postmortem-writing ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

tranhieutt/software_development_department

60 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT