Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

dtsong/failure-mode-analysis

Name: failure-mode-analysis
Author: dtsong

skills/council/skeptic/failure-mode-analysis/SKILL.md

npx skillsauth add dtsong/my-claude-setup failure-mode-analysis

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Failure Mode Analysis

Purpose

Systematically identify failure scenarios for proposed features and design mitigations that maintain system resilience.

Scope Constraints

Analyzes system architecture, dependency graphs, and infrastructure configurations for failure scenarios. Does not modify infrastructure, execute chaos tests, or access production systems. Limited to design-time failure identification and mitigation planning.

Inputs

Feature or infrastructure change being analyzed
System architecture (services, databases, APIs, third-party dependencies)
Current reliability requirements (SLA/SLO targets)
Existing monitoring and alerting setup

Input Sanitization

No user-provided values are used in commands or file paths. All inputs are treated as read-only analysis targets.

Procedure

Step 1: List System Components

Enumerate all components involved: services, databases, APIs, third-party dependencies, caches, queues, CDNs, DNS, and any shared infrastructure.

Step 2: Enumerate Failure Modes Per Component

For each component, systematically consider:

Network failures: Timeout, DNS resolution failure, TLS handshake failure, connection reset, packet loss
Data failures: Corruption, inconsistency between stores, migration errors, schema drift, encoding issues
Resource exhaustion: Memory leak, CPU saturation, connection pool exhaustion, disk full, file descriptor limits
Dependency failures: Third-party API down, rate limited, schema change, authentication revoked, region outage
State failures: Race conditions, stale cache serving wrong data, split-brain in distributed systems, phantom reads

Step 3: Assess Cascade Potential

Map the failure tree: if component X fails, what else breaks? Identify single points of failure and shared dependencies. Determine blast radius for each failure mode.

Step 4: Design Mitigations

For each failure mode, define:

Graceful degradation: What functionality survives? What's the user experience during failure?
Retry strategy: Exponential backoff with jitter, circuit breaker thresholds, max retry limits
Fallback behavior: Cached data, default values, error states, read-only mode
Recovery procedure: Automatic vs manual recovery, estimated time to recovery, data reconciliation steps

Step 5: Define Monitoring Signals

For each failure mode, specify the metric, log pattern, or alert that detects it. Include detection latency — how quickly will you know?

Step 6: Plan Rollback Strategy

Define rollback approach: feature flags for instant disable, database rollback scripts, deployment rollback procedure, data cleanup if needed.

Progress Checklist

[ ] Step 1: System components listed
[ ] Step 2: Failure modes enumerated per component
[ ] Step 3: Cascade potential assessed
[ ] Step 4: Mitigations designed
[ ] Step 5: Monitoring signals defined
[ ] Step 6: Rollback strategy planned

Compaction resilience: If context was lost during a long session, re-read the Inputs section to reconstruct what system is being analyzed, check the Progress Checklist for completed steps, then resume from the earliest incomplete step.

Output Format

Failure Mode Table

| Component | Failure Mode | Severity | Cascade Risk | Mitigation | Monitoring Signal | |---|---|---|---|---|---| | [Component] | [What fails] | Critical/High/Medium/Low | [What else breaks] | [Specific mitigation] | [Metric/alert] |

Cascade Diagram

[Component A fails]
  ├── [Component B] — degraded (uses cached data)
  ├── [Component C] — down (hard dependency)
  │   └── [Component D] — down (depends on C)
  └── [Component E] — unaffected (independent)

Rollback Checklist

[ ] Feature flag to disable: [flag name]
[ ] Database rollback: [migration name / script]
[ ] Deployment rollback: [procedure]
[ ] Data cleanup: [steps if needed]
[ ] Communication: [who to notify]

Handoff

Hand off to threat-model if security vulnerabilities are discovered during failure analysis.
Hand off to operator/observability-design if monitoring gaps require comprehensive observability planning.

Quality Checks

[ ] Every external dependency has failure analysis
[ ] Cascade paths are fully mapped with blast radius
[ ] Mitigations don't introduce new failure modes
[ ] Monitoring covers detection of each identified failure
[ ] Rollback strategy is defined and tested
[ ] Recovery time estimates are documented

Evolution Notes

dtsong/failure-mode-analysis

skills/council/skeptic/failure-mode-analysis/SKILL.md

Use when systematically identifying failure scenarios for proposed features and infrastructure changes. Covers component enumeration, failure mode discovery, cascade analysis, mitigation design, monitoring signals, and rollback planning. Do not use for security threat modeling (use threat-model) or input boundary testing (use edge-case-enumeration).

5 stars

testing

Updated Jul 15, 2026

$ install --global

skillsauth

npx skillsauth add dtsong/my-claude-setup failure-mode-analysis

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Jul 15, 2026, 4:15 AM152.6s1 file scanned

SKILL.md

name:: failure-mode-analysis
department:: skeptic
description:: Use when systematically identifying failure scenarios for proposed features and infrastructure changes. Covers component enumeration, failure mode discovery, cascade analysis, mitigation design, monitoring signals, and rollback planning. Do not use for security threat modeling (use threat-model) or input boundary testing (use edge-case-enumeration).
version:: 1

Failure Mode Analysis

Purpose

Systematically identify failure scenarios for proposed features and design mitigations that maintain system resilience.

Scope Constraints

Inputs

Feature or infrastructure change being analyzed
System architecture (services, databases, APIs, third-party dependencies)
Current reliability requirements (SLA/SLO targets)
Existing monitoring and alerting setup

Input Sanitization

No user-provided values are used in commands or file paths. All inputs are treated as read-only analysis targets.

Procedure

Step 1: List System Components

Enumerate all components involved: services, databases, APIs, third-party dependencies, caches, queues, CDNs, DNS, and any shared infrastructure.

Step 2: Enumerate Failure Modes Per Component

For each component, systematically consider:

Network failures: Timeout, DNS resolution failure, TLS handshake failure, connection reset, packet loss
Data failures: Corruption, inconsistency between stores, migration errors, schema drift, encoding issues
Resource exhaustion: Memory leak, CPU saturation, connection pool exhaustion, disk full, file descriptor limits
Dependency failures: Third-party API down, rate limited, schema change, authentication revoked, region outage
State failures: Race conditions, stale cache serving wrong data, split-brain in distributed systems, phantom reads

Step 3: Assess Cascade Potential

Map the failure tree: if component X fails, what else breaks? Identify single points of failure and shared dependencies. Determine blast radius for each failure mode.

Step 4: Design Mitigations

For each failure mode, define:

Graceful degradation: What functionality survives? What's the user experience during failure?
Retry strategy: Exponential backoff with jitter, circuit breaker thresholds, max retry limits
Fallback behavior: Cached data, default values, error states, read-only mode
Recovery procedure: Automatic vs manual recovery, estimated time to recovery, data reconciliation steps

Step 5: Define Monitoring Signals

For each failure mode, specify the metric, log pattern, or alert that detects it. Include detection latency — how quickly will you know?

Step 6: Plan Rollback Strategy

Define rollback approach: feature flags for instant disable, database rollback scripts, deployment rollback procedure, data cleanup if needed.

Progress Checklist

[ ] Step 1: System components listed
[ ] Step 2: Failure modes enumerated per component
[ ] Step 3: Cascade potential assessed
[ ] Step 4: Mitigations designed
[ ] Step 5: Monitoring signals defined
[ ] Step 6: Rollback strategy planned

Compaction resilience: If context was lost during a long session, re-read the Inputs section to reconstruct what system is being analyzed, check the Progress Checklist for completed steps, then resume from the earliest incomplete step.

Output Format

Failure Mode Table

Cascade Diagram

[Component A fails]
  ├── [Component B] — degraded (uses cached data)
  ├── [Component C] — down (hard dependency)
  │   └── [Component D] — down (depends on C)
  └── [Component E] — unaffected (independent)

Rollback Checklist

[ ] Feature flag to disable: [flag name]
[ ] Database rollback: [migration name / script]
[ ] Deployment rollback: [procedure]
[ ] Data cleanup: [steps if needed]
[ ] Communication: [who to notify]

Handoff

Hand off to threat-model if security vulnerabilities are discovered during failure analysis.
Hand off to operator/observability-design if monitoring gaps require comprehensive observability planning.

Quality Checks

[ ] Every external dependency has failure analysis
[ ] Cascade paths are fully mapped with blast radius
[ ] Mitigations don't introduce new failure modes
[ ] Monitoring covers detection of each identified failure
[ ] Rollback strategy is defined and tested
[ ] Recovery time estimates are documented

Evolution Notes

Related Skills

dtsong/enterprise-search-strategy

development

VerifiedTrustedCommunity

Use when the council needs to surface organizational knowledge buried across multiple internal sources (wikis, design docs, ADRs, past tickets, postmortems, chat archives, code repos). Plans where to look, what to cross-reference, and how to synthesize findings into evidence the council can act on. Do not use for external market research (use competitive-analysis), library evaluation (use library-evaluation), or technology trend assessment (use technology-radar).

5SKILL.mdUpdated Jun 23, 2026

dtsong/enterprise-search-strategy

dtsong/docx-to-pdf

testing

VerifiedTrustedCommunity

Use to convert a Word .docx file to PDF and/or verify its page count. Triggers on: converting docx to pdf, rendering a document, checking how many pages a docx produces, or asserting a page-count constraint (e.g. a resume must stay 2 pages). Wraps LibreOffice headless conversion.

5SKILL.mdUpdated Jun 11, 2026

dtsong/web-security-hardening

development

VerifiedTrustedCommunity

Security audit checklist for web applications. Use when reviewing, auditing, or hardening a web app's security posture. Covers rate limiting, auth headers, IP blocking, CORS, security middleware, input validation, file upload limits, ORM usage, and password hashing. Triggers on requests like "review security", "harden this app", "security audit", "check for vulnerabilities", or when building/reviewing API endpoints.

5SKILL.mdUpdated Apr 28, 2026

dtsong/web-security-hardening

dtsong/prompt-wizard

development

VerifiedTrustedCommunity

Interactive wizard to craft effective prompts using Claude Code best practices

5SKILL.mdUpdated Apr 28, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/dtsong/my-claude-setup.git

# Copy into Claude Code skills folder (global)
cp -r my-claude-setup/skills/council/skeptic/failure-mode-analysis ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

dtsong/my-claude-setup

5 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT