Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

hermeticormus/plugins/runbook-writing/skills/runbook-patterns

Name: plugins/runbook-writing/skills/runbook-patterns
Author: hermeticormus

plugins/runbook-writing/skills/runbook-patterns/SKILL.md

npx skillsauth add hermeticormus/librecopy-claude-code plugins/runbook-writing/skills/runbook-patterns

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Runbook Patterns

The 3AM Test

A runbook passes the 3AM test if a competent on-call engineer can:

Find the runbook from the alert within 30 seconds (alert must link directly to runbook)
Understand the scope within 60 seconds (overview section)
Start diagnosing within 2 minutes (first command is on screen without scrolling)

If any of these fail, the runbook will be abandoned during an incident.

Structure Pattern: Symptom → Cause → Remedy

The fundamental runbook pattern. Every diagnostic section follows this sequence:

### Step N: Diagnose [hypothesis]

**Why check this**: [One sentence on why this matters / what it rules out]

**Command:**
```bash
kubectl logs -n production -l app=api --since=5m | tail -50

If you see [specific output]: This indicates [cause]. Go to [Step X]. If you see [other output]: This indicates [other cause]. Go to [Step Y]. If the command fails: [What to do if the command itself doesn't work].


Never leave the engineer without a next step. Every diagnosis branch must resolve to either a remediation or an escalation.

## Command Documentation Pattern

Commands in runbooks must be self-contained. The engineer should not have to look anything up:

```markdown
# Bad: requires knowledge outside the runbook
kubectl get pods | grep Crash

# Bad: requires substitution
kubectl describe pod <pod-name> -n <namespace>

# Good: complete, copy-paste ready
# First, find the pod name:
POD=$(kubectl get pods -n production -l app=api-server -o jsonpath='{.items[0].metadata.name}')

# Then describe it:
kubectl describe pod $POD -n production

Command blocks must include:

Complete commands with no required substitution (or variable setup at top)
Expected output for the "healthy" case
Expected output for each "problem" case
What the output means (do not leave interpretation to the engineer)

Ordered Causes Pattern

List causes in frequency order. The engineer should try the most common diagnosis first:

## Common Causes

This alert fires for three reasons, listed by frequency:

1. **Database connection pool exhaustion** (~55% of occurrences)
   Diagnose: [link to Step 2]

2. **Upstream service timeout** (~30% of occurrences)
   Diagnose: [link to Step 3]

3. **Memory leak after v2.4 deployment** (~15% of occurrences)
   Diagnose: [link to Step 4]

Verification After Remediation

Every remediation step must end with a verification command and expected output:

### Remediation: Restart the API service

**Run:**
```bash
kubectl rollout restart deployment/api-server -n production
kubectl rollout status deployment/api-server -n production --timeout=5m

Expected output:

Waiting for deployment "api-server" rollout to finish: 0 of 3 updated replicas are available...
Waiting for deployment "api-server" rollout to finish: 1 of 3 updated replicas are available...
deployment "api-server" successfully rolled out

Verify resolution: Check Error Rate Dashboard - error rate should drop below 0.1% within 2 minutes. If it does not, proceed to Option B.


## Escalation Gate Pattern

Escalation conditions must be deterministic -- not judgment calls:

```markdown
# Bad: requires judgment
Escalate if the problem seems serious.

# Good: deterministic condition
Escalate to the platform-lead immediately if ANY of the following:
- Error rate has not decreased after 10 minutes of remediation
- kubectl commands return errors (cluster may be degraded)
- You observe any data loss symptoms (mismatched record counts, corrupted responses)
- This is a P1 and you are not 100% confident in your next action

When calling, have ready:
- Current error rate and trend (link to dashboard)
- Steps already attempted and their results
- Relevant log lines (paste them, do not describe them)

SLA Awareness Pattern

On-call engineers need SLA budget context before making downtime decisions:

> **Before taking the service offline**: Check current SLA status.
> SLA target: 99.9% uptime (43.8 minutes/month budget)
> Current status: [status.company.com](https://status.company.com)
>
> If budget is < 15 minutes remaining: escalate to platform-lead before
> any action that causes downtime, even for maintenance.

Postmortem Template Pattern

Postmortems require blameless framing and actionable follow-up:

## Incident: [Date] - [Brief description]

**Duration**: HH:MM (Detection: HH:MM, Resolution: HH:MM)
**Severity**: P[N]
**Impact**: [Users affected, requests failed, revenue impact if known]

## Timeline

| Time (UTC) | Event |
|-----------|-------|
| 14:23 | Alert fired: api_error_rate > 1% |
| 14:25 | On-call paged. Acknowledged. |
| 14:31 | Root cause identified: database connection pool exhausted |
| 14:38 | Remediation applied: pool size increased |
| 14:41 | Error rate dropped to 0.05%. Alert resolved. |

## Root Cause

[Technical description of what caused the incident. Blame the system, not people.]

## Contributing Factors

- [Factor 1: e.g., "Monitoring threshold was too high -- alert should have fired 10 minutes earlier"]
- [Factor 2: e.g., "Runbook did not cover this failure mode"]

## Action Items

| Action | Owner | Priority | Due |
|--------|-------|----------|-----|
| Lower alert threshold to fire earlier | Platform | P2 | 2024-04-01 |
| Update runbook to include connection pool diagnosis | Platform | P2 | 2024-03-25 |
| Add connection pool metric to dashboard | Platform | P3 | 2024-04-15 |

Anti-Patterns

The Tribal Knowledge Runbook

# Bad: requires context not in the runbook
Check the usual places. If it's the database thing, do the usual fix.

# Good: no assumed knowledge
Check the database connection pool:
[full diagnostic command with expected output]

The Stale Runbook

A runbook with an old "last-tested" date is dangerous -- it implies the procedure is current when it may not be. Add to your CI or quarterly review:

# Find runbooks not tested in 90 days
find runbooks/ -name "*.md" | xargs grep -l "last-tested" | \
  xargs grep "last-tested" | awk -F: '{print $1, $NF}' | \
  awk '$2 < date -d "90 days ago"'

The Vague Verification

# Bad: no verification
Restart the service if needed.

# Good: specific verification with expected outcome
kubectl rollout restart deployment/api -n production
# Wait until: kubectl rollout status deployment/api -n production
# shows: "successfully rolled out"
# Then verify on dashboard: error rate < 0.1% within 2 minutes

References

Google SRE Book - On-Call
PagerDuty Incident Response
Atlassian Incident Management
Blameless Postmortems (Etsy)

hermeticormus/plugins/runbook-writing/skills/runbook-patterns

plugins/runbook-writing/skills/runbook-patterns/SKILL.md

# Runbook Patterns ## The 3AM Test A runbook passes the 3AM test if a competent on-call engineer can: 1. Find the runbook from the alert within 30 seconds (alert must link directly to runbook) 2. Understand the scope within 60 seconds (overview section) 3. Start diagnosing within 2 minutes (first command is on screen without scrolling) If any of these fail, the runbook will be abandoned during an incident. ## Structure Pattern: Symptom → Cause → Remedy The fundamental runbook pattern. Every

tools

Updated Apr 21, 2026

$ install --global

skillsauth

npx skillsauth add hermeticormus/librecopy-claude-code plugins/runbook-writing/skills/runbook-patterns

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 21, 2026, 10:08 PM131.0s1 file scanned

SKILL.md

Runbook Patterns

The 3AM Test

A runbook passes the 3AM test if a competent on-call engineer can:

Find the runbook from the alert within 30 seconds (alert must link directly to runbook)
Understand the scope within 60 seconds (overview section)
Start diagnosing within 2 minutes (first command is on screen without scrolling)

If any of these fail, the runbook will be abandoned during an incident.

Structure Pattern: Symptom → Cause → Remedy

The fundamental runbook pattern. Every diagnostic section follows this sequence:

### Step N: Diagnose [hypothesis]

**Why check this**: [One sentence on why this matters / what it rules out]

**Command:**
```bash
kubectl logs -n production -l app=api --since=5m | tail -50


Never leave the engineer without a next step. Every diagnosis branch must resolve to either a remediation or an escalation.

## Command Documentation Pattern

Commands in runbooks must be self-contained. The engineer should not have to look anything up:

```markdown
# Bad: requires knowledge outside the runbook
kubectl get pods | grep Crash

# Bad: requires substitution
kubectl describe pod <pod-name> -n <namespace>

# Good: complete, copy-paste ready
# First, find the pod name:
POD=$(kubectl get pods -n production -l app=api-server -o jsonpath='{.items[0].metadata.name}')

# Then describe it:
kubectl describe pod $POD -n production

Command blocks must include:

Complete commands with no required substitution (or variable setup at top)
Expected output for the "healthy" case
Expected output for each "problem" case
What the output means (do not leave interpretation to the engineer)

Ordered Causes Pattern

List causes in frequency order. The engineer should try the most common diagnosis first:

## Common Causes

This alert fires for three reasons, listed by frequency:

1. **Database connection pool exhaustion** (~55% of occurrences)
   Diagnose: [link to Step 2]

2. **Upstream service timeout** (~30% of occurrences)
   Diagnose: [link to Step 3]

3. **Memory leak after v2.4 deployment** (~15% of occurrences)
   Diagnose: [link to Step 4]

Verification After Remediation

Every remediation step must end with a verification command and expected output:

### Remediation: Restart the API service

**Run:**
```bash
kubectl rollout restart deployment/api-server -n production
kubectl rollout status deployment/api-server -n production --timeout=5m

Expected output:

Waiting for deployment "api-server" rollout to finish: 0 of 3 updated replicas are available...
Waiting for deployment "api-server" rollout to finish: 1 of 3 updated replicas are available...
deployment "api-server" successfully rolled out

Verify resolution: Check Error Rate Dashboard - error rate should drop below 0.1% within 2 minutes. If it does not, proceed to Option B.


## Escalation Gate Pattern

Escalation conditions must be deterministic -- not judgment calls:

```markdown
# Bad: requires judgment
Escalate if the problem seems serious.

# Good: deterministic condition
Escalate to the platform-lead immediately if ANY of the following:
- Error rate has not decreased after 10 minutes of remediation
- kubectl commands return errors (cluster may be degraded)
- You observe any data loss symptoms (mismatched record counts, corrupted responses)
- This is a P1 and you are not 100% confident in your next action

When calling, have ready:
- Current error rate and trend (link to dashboard)
- Steps already attempted and their results
- Relevant log lines (paste them, do not describe them)

SLA Awareness Pattern

On-call engineers need SLA budget context before making downtime decisions:

> **Before taking the service offline**: Check current SLA status.
> SLA target: 99.9% uptime (43.8 minutes/month budget)
> Current status: [status.company.com](https://status.company.com)
>
> If budget is < 15 minutes remaining: escalate to platform-lead before
> any action that causes downtime, even for maintenance.

Postmortem Template Pattern

Postmortems require blameless framing and actionable follow-up:

## Incident: [Date] - [Brief description]

**Duration**: HH:MM (Detection: HH:MM, Resolution: HH:MM)
**Severity**: P[N]
**Impact**: [Users affected, requests failed, revenue impact if known]

## Timeline

| Time (UTC) | Event |
|-----------|-------|
| 14:23 | Alert fired: api_error_rate > 1% |
| 14:25 | On-call paged. Acknowledged. |
| 14:31 | Root cause identified: database connection pool exhausted |
| 14:38 | Remediation applied: pool size increased |
| 14:41 | Error rate dropped to 0.05%. Alert resolved. |

## Root Cause

[Technical description of what caused the incident. Blame the system, not people.]

## Contributing Factors

- [Factor 1: e.g., "Monitoring threshold was too high -- alert should have fired 10 minutes earlier"]
- [Factor 2: e.g., "Runbook did not cover this failure mode"]

## Action Items

| Action | Owner | Priority | Due |
|--------|-------|----------|-----|
| Lower alert threshold to fire earlier | Platform | P2 | 2024-04-01 |
| Update runbook to include connection pool diagnosis | Platform | P2 | 2024-03-25 |
| Add connection pool metric to dashboard | Platform | P3 | 2024-04-15 |

Anti-Patterns

The Tribal Knowledge Runbook

# Bad: requires context not in the runbook
Check the usual places. If it's the database thing, do the usual fix.

# Good: no assumed knowledge
Check the database connection pool:
[full diagnostic command with expected output]

The Stale Runbook

A runbook with an old "last-tested" date is dangerous -- it implies the procedure is current when it may not be. Add to your CI or quarterly review:

# Find runbooks not tested in 90 days
find runbooks/ -name "*.md" | xargs grep -l "last-tested" | \
  xargs grep "last-tested" | awk -F: '{print $1, $NF}' | \
  awk '$2 < date -d "90 days ago"'

The Vague Verification

# Bad: no verification
Restart the service if needed.

# Good: specific verification with expected outcome
kubectl rollout restart deployment/api -n production
# Wait until: kubectl rollout status deployment/api -n production
# shows: "successfully rolled out"
# Then verify on dashboard: error rate < 0.1% within 2 minutes

References

Google SRE Book - On-Call
PagerDuty Incident Response
Atlassian Incident Management
Blameless Postmortems (Etsy)

Related Skills

hermeticormus/plugins/user-documentation/skills/user-doc-patterns

tools

VerifiedTrustedCommunity

# User Doc Patterns > Patterns for writing clear, accessible end-user documentation. ## Knowledge Base ### User Documentation vs Developer Documentation | User Docs | Developer Docs | |-----------|---------------| | Task-oriented ("How do I...") | Concept-oriented ("How does it work...") | | Plain language | Technical language | | Screenshots and visual aids | Code examples | | Step-by-step procedures | API references | | Feature names and UI labels | Function signatures and parameters | | A

SKILL.mdUpdated Apr 21, 2026

hermeticormus/plugins/user-documentation/skills/user-doc-patterns

hermeticormus/plugins/tutorial-creation/skills/tutorial-structures

tools

VerifiedTrustedCommunity

# Tutorial Structures > Pedagogical patterns and frameworks for creating effective technical tutorials. ## Knowledge Base ### The Tutorial Spectrum Tutorials exist on a spectrum between two extremes: | Recipe | Concept Guide | |--------|--------------| | "Do exactly this" | "Understand this idea" | | Step-by-step | Explanation-heavy | | Fast to complete | Deep understanding | | Low retention | High retention | The best tutorials blend both: steps for doing, explanations for understanding.

SKILL.mdUpdated Apr 21, 2026

hermeticormus/plugins/tutorial-creation/skills/tutorial-structures

hermeticormus/plugins/tutorial-creation/skills/tutorial-patterns

tools

VerifiedTrustedCommunity

# Tutorial Patterns ## Tutorial vs. How-to Guide: The Critical Distinction Before writing, identify which document is actually needed: | Tutorial | How-to Guide | |----------|-------------| | "Build a REST API in Node.js" | "Add JWT authentication to your Express API" | | For someone new to this | For someone who knows the domain | | Explains why each step is done | Steps are efficient, minimal explanation | | Has checkpoints, explores | Numbered steps, no detours | | Learner reaches a comple

SKILL.mdUpdated Apr 21, 2026

hermeticormus/plugins/tutorial-creation/skills/tutorial-patterns

hermeticormus/plugins/technical-blogging/skills/tech-blogging-patterns

tools

VerifiedTrustedCommunity

# Tech Blogging Patterns ## The Developer Reading Pattern Developers do not read technical posts linearly. They scan in this order: 1. Headline (is this relevant to me?) 2. Code blocks (is this real code I can use?) 3. Headers (what does this cover?) 4. First paragraph (what's the point?) 5. Key takeaways / conclusion (is it worth reading fully?) Design for scanning first, reading second. Put real code within the first 25% of the post. ## The Before/After Pattern The contrast between a pain

SKILL.mdUpdated Apr 21, 2026

hermeticormus/plugins/technical-blogging/skills/tech-blogging-patterns

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/hermeticormus/librecopy-claude-code.git

# Copy into Claude Code skills folder (global)
cp -r librecopy-claude-code/plugins/runbook-writing/skills/runbook-patterns ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

hermeticormus/librecopy-claude-code

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT