engineering/chaos-engineering/skills/chaos-engineering/SKILL.md
Use when planning, running, or learning from chaos engineering experiments. Triggers on "chaos experiment", "fault injection", "gameday", "resilience test", "blast radius", "steady state", "abort criteria", "Chaos Toolkit", "Chaos Mesh", "Litmus", "Gremlin", "AWS FIS", or any deliberate failure-injection question. Ships experiment designer, blast-radius calculator, and postmortem generator (all stdlib Python), 4 references on chaos principles + experiment design + attack taxonomy + tooling landscape, and a /chaos-experiment slash command. Composes with feature-flags-architect (kill switches as abort triggers) and kubernetes-operator (common chaos targets).
npx skillsauth add alirezarezvani/claude-skills chaos-engineeringInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Design experiments that surface real weaknesses in production systems — without becoming outages. Most "chaos engineering" attempts skip steady-state measurement, define no abort criteria, and have no blast-radius bound. This skill enforces the discipline that makes chaos experiments safe and useful.
incident-response)red-team, threat-detection)The 4 Principles of Chaos Engineering (Netflix, 2016):
Add a fifth: Define abort criteria up front. A chaos experiment with no abort criteria is an outage by another name.
SKILL=engineering/chaos-engineering/skills/chaos-engineering
# 1. Design an experiment
python "$SKILL/scripts/experiment_designer.py" --target "checkout-svc" --hypothesis "p99 latency stays <500ms" --attack latency --duration-min 15
# 2. Calculate blast radius
python "$SKILL/scripts/blast_radius_calculator.py" --traffic-share 0.05 --user-pop 1000000 --duration-min 15
# 3. Generate postmortem after the experiment
python "$SKILL/scripts/experiment_postmortem.py" --plan experiment.json --result-log results.txt
All stdlib-only. Run with --help.
experiment_designer.pyGenerates a structured experiment plan from inputs. Enforces the required sections (hypothesis, steady-state metric, blast radius, abort criteria, rollback).
python scripts/experiment_designer.py \
--target "checkout-svc" \
--hypothesis "p99 latency stays <500ms when payment-svc is slow" \
--attack latency \
--magnitude "+200ms" \
--duration-min 15 \
--blast-radius "5% of US traffic" \
--abort-if "p99 > 1000ms OR error_rate > baseline + 1pp"
Outputs a markdown plan with: hypothesis, steady-state, attack, magnitude, duration, blast radius, abort criteria, rollback procedure, monitoring dashboards, and learning question.
blast_radius_calculator.pyComputes the blast radius of a planned experiment. Given traffic share + user population + duration, calculates expected affected users, expected error budget burn, and a risk score.
python scripts/blast_radius_calculator.py \
--traffic-share 0.05 \
--user-pop 1000000 \
--duration-min 15 \
--baseline-availability 0.999 \
--expected-impact-availability 0.95
Outputs:
GREEN = <1% error budget; YELLOW = 1-10%; RED = >10%.
experiment_postmortem.pyProduces a structured postmortem from an experiment plan + results. Catches the common postmortem failure modes: no learning recorded, no follow-up actions, blame-laden language.
python scripts/experiment_postmortem.py --plan experiment.json --result-log results.txt
Outputs markdown with: summary, hypothesis (was it confirmed/refuted?), what we learned, what surprised us, follow-up actions with owners, and link to next experiment.
Different attacks reveal different weaknesses. See references/attack_taxonomy.md for full detail.
| Attack | What it tests | Tooling |
|---|---|---|
| Latency | Timeouts, retries, circuit breakers | tc, Chaos Mesh NetworkChaos |
| Error | Error handling, fallback paths | Chaos Mesh HTTPChaos, Toxiproxy |
| Resource (CPU, memory, disk) | Saturation handling, autoscaling | Chaos Mesh StressChaos, stress-ng |
| Network partition | Split-brain, consensus, failover | Chaos Mesh NetworkChaos partition |
| Dependency failure | Graceful degradation, fallback | Service mesh fault injection |
| Time | Clock skew, NTP issues | libfaketime, Chaos Mesh TimeChaos |
| Infrastructure (kill instance) | Auto-recovery, failover | AWS FIS, Chaos Monkey |
Pick the attack that matches the hypothesis. "What happens if X is slow?" → latency. "What happens if X loses network?" → partition.
| Tool | Best for | Pricing | Stack | |---|---|---|---| | Chaos Toolkit | Lightweight, language-agnostic, JSON experiments | OSS | Any | | Chaos Mesh | Kubernetes-native, rich CRDs, in-cluster | OSS | Kubernetes | | Litmus | Kubernetes, Argo-integrated, large library | OSS + Enterprise | Kubernetes | | Gremlin | Enterprise SaaS, multi-cloud, audit | Paid | Any | | AWS FIS | AWS-native, IAM-integrated, EC2/ECS/EKS | Paid (AWS) | AWS | | Custom | Niche needs, single-cloud, low budget | None | Any |
Decision rules:
See references/tooling_landscape.md for trade-offs.
1. State a hypothesis: "When [fault], steady-state metric X stays within Y."
2. Identify the steady-state metric — must be measurable BEFORE the experiment.
3. Run blast_radius_calculator.py — confirm GREEN before proceeding.
4. Run experiment_designer.py to produce the plan.
5. Get a peer review of the plan; confirm abort criteria are concrete.
6. Notify the on-call team in #incidents (or whatever channel).
7. Run the experiment with monitoring open.
8. If abort criteria are hit, abort immediately; record what happened.
9. Run experiment_postmortem.py to capture learnings.
10. File follow-up actions; link to next experiment.
1. Pick a scenario (e.g., "primary database fails over").
2. Identify all dependent services that should keep working.
3. Build a multi-experiment plan covering each layer.
4. Schedule with stakeholders; on-call coverage required.
5. Run with a facilitator who manages the scenario.
6. Capture observations in a shared doc as they happen.
7. Single combined postmortem covering all observations.
8. Track follow-up actions in a board with owners.
1. Start: weekly Game Day in staging.
2. Move to: weekly Game Day in production with limited blast radius.
3. Mature to: continuous chaos via scheduled experiments (Litmus chaos schedule, Gremlin scenarios).
4. Wire to deployment: every prod deploy triggers a baseline chaos sweep.
5. Track: experiments per week, weaknesses discovered, MTTR trend.
This skill explicitly composes with two others in this library:
| Skill | Composition |
|---|---|
| feature-flags-architect | Kill switches defined there are the abort triggers here |
| kubernetes-operator | Operators are common chaos targets (test reconcile under fault) |
| incident-response | Chaos experiments that escalate become incidents |
references/chaos_principles.md — the 4 principles, history, when to startreferences/experiment_design.md — hypothesis structure, steady-state metrics, abort criteriareferences/attack_taxonomy.md — 7 attack types with examples and toolingreferences/tooling_landscape.md — Chaos Toolkit / Mesh / Litmus / Gremlin / FIS / DIY/chaos-experiment — interactive experiment design wizard that runs all 3 tools.
assets/experiment_template.md — fill-in plan templateassets/postmortem_template.md — structured postmortem templateA team using this skill should achieve:
tools
Code review automation for TypeScript, JavaScript, Python, Go, Swift, Kotlin, C#, .NET, Java, C, C++, Rust, Ruby, PHP, and Dart/Flutter. Analyzes PRs for complexity and risk, checks code quality for SOLID violations and code smells, generates review reports. Use when reviewing pull requests, analyzing code quality, identifying issues, generating review checklists.
tools
Use when planning, funding, scoping, or synthesizing enterprise research across workstreams — clinical study design, R&D program finance, market sizing/surveys, or product/user research. Triggers on "design this clinical study", "what sample size", "R&D budget", "burn rate", "capitalize or expense", "TAM SAM SOM", "market sizing", "survey design", "segment the market", "plan user interviews", "usability test", "synthesize research insights". Forks context to route to one of four Research-Operations sub-skills (clinical-research, research-finance, market-research, product-research) and returns a digest. Distinct from ra-qm-team (regulatory submission), finance (corporate close/valuation), research/grants (funding discovery), product-team (persona/journey/live experiments), and marketing-skill (campaign analytics).
development
Use when managing the money for an internal R&D program or portfolio — building a multi-period program budget with the F&A (indirect) split, tracking burn rate and runway against value-inflection milestones, or routing R&D cost items to a capitalize-vs-expense determination. Every budget output surfaces its assumptions block; capitalize-vs-expense is decision-support only and routes to a named finance owner — it never books an entry or decides accounting treatment. Distinct from finance/financial-analysis (corporate DCF, close, valuation) and research/grants (funding discovery — this manages money already won).
development
Use when planning and synthesizing product/user research as a method-and-repository discipline — selecting the right method for the goal (generative interviews vs usability test vs concept test vs validation), computing method-based saturation/sample size with an explicit confidence level, or synthesizing coded observations into insights while flagging single-source anecdotes. Never fabricates user insight; an insight requires recurrence across independent participants. Distinct from product-team/ux-researcher-designer (persona/journey artifacts), product-discovery (discovery-sprint planning), and experiment-designer (live A/B) — this is the research-ops method + insight-repository layer.