public/SKILLS/Security & Web Testing/chaos-engineer/SKILL.md
Designs chaos experiments, creates failure injection frameworks, and facilitates game day exercises for distributed systems — producing runbooks, experiment manifests, rollback procedures, and post-mortem templates. Use when designing chaos experiments, implementing failure injection frameworks, or conducting game day exercises. Invoke for chaos experiments, resilience testing, blast radius control, game days, antifragile systems, fault injection, Chaos Monkey, Litmus Chaos.
npx skillsauth add eric861129/skills_all-in-one chaos-engineerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Load detailed guidance based on context:
| Topic | Reference | Load When |
|-------|-----------|-----------|
| Experiments | references/experiment-design.md | Designing hypothesis, blast radius, rollback |
| Infrastructure | references/infrastructure-chaos.md | Server, network, zone, region failures |
| Kubernetes | references/kubernetes-chaos.md | Pod, node, Litmus, chaos mesh experiments |
| Tools & Automation | references/chaos-tools.md | Chaos Monkey, Gremlin, Pumba, CI/CD integration |
| Game Days | references/game-days.md | Planning, executing, learning from game days |
Non-obvious constraints that must be enforced on every experiment:
When implementing chaos engineering, provide:
The following shows a complete experiment — from hypothesis to rollback — using Litmus Chaos on Kubernetes.
# Verify baseline: p99 latency < 200ms, error rate < 0.1%
kubectl get deploy my-service -n production
kubectl top pods -n production -l app=my-service
# chaos-pod-delete.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: my-service-pod-delete
namespace: production
spec:
appinfo:
appns: production
applabel: "app=my-service"
appkind: deployment
# Limit blast radius: only 1 replica at a time
engineState: active
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "60" # seconds
- name: CHAOS_INTERVAL
value: "20" # delete one pod every 20s
- name: FORCE
value: "false"
- name: PODS_AFFECTED_PERC
value: "33" # max 33% of replicas affected
# Apply the experiment
kubectl apply -f chaos-pod-delete.yaml
# Watch experiment status
kubectl describe chaosengine my-service-pod-delete -n production
kubectl get chaosresult my-service-pod-delete-pod-delete -n production -w
# Tail application logs for errors
kubectl logs -l app=my-service -n production --since=2m -f
# Check ChaosResult verdict when complete
kubectl get chaosresult my-service-pod-delete-pod-delete \
-n production -o jsonpath='{.status.experimentStatus.verdict}'
# Immediately stop the experiment
kubectl patch chaosengine my-service-pod-delete \
-n production --type merge -p '{"spec":{"engineState":"stop"}}'
# Confirm all pods are healthy
kubectl rollout status deployment/my-service -n production
# Install toxiproxy CLI
brew install toxiproxy # macOS; use the binary release on Linux
# Start toxiproxy server (runs alongside your service)
toxiproxy-server &
# Create a proxy for your downstream dependency
toxiproxy-cli create -l 0.0.0.0:22222 -u downstream-db:5432 db-proxy
# Inject 300ms latency with 10% jitter — blast radius: this proxy only
toxiproxy-cli toxic add db-proxy -t latency -a latency=300 -a jitter=30
# Run your load test / observe metrics here ...
# Remove the toxic to restore normal behaviour
toxiproxy-cli toxic remove db-proxy -n latency_downstream
# chaos-monkey-config.yml — restrict to a single ASG
deployment:
enabled: true
regionIndependence: false
chaos:
enabled: true
meanTimeBetweenKillsInWorkDays: 2
minTimeBetweenKillsInWorkDays: 1
grouping: APP # kill one instance per app, not per cluster
exceptions:
- account: production
region: us-east-1
detail: "*-canary" # never kill canary instances
# Apply and trigger a manual kill for testing
chaos-monkey --app my-service --account staging --dry-run false
development
Run structured What-If scenario analysis with multi-branch possibility exploration. Use this skill when the user asks speculative questions like "what if...", "what would happen if...", "what are the possibilities", "explore scenarios", "scenario analysis", "possibility space", "what could go wrong", "best case / worst case", "risk analysis", "contingency planning", "strategic options", or any question about uncertain futures. Also trigger when the user faces a fork-in-the-road decision, wants to stress-test an idea, or needs to think through consequences before committing.
development
Access comprehensive LaTeX templates, formatting requirements, and submission guidelines for major scientific publication venues (Nature, Science, PLOS, IEEE, ACM), academic conferences (NeurIPS, ICML, CVPR, CHI), research posters, and grant proposals (NSF, NIH, DOE, DARPA). This skill should be used when preparing manuscripts for journal submission, conference papers, research posters, or grant proposals and need venue-specific formatting requirements and templates.
development
Use when challenging ideas, plans, decisions, or proposals using structured critical reasoning. Invoke to play devil's advocate, run a pre-mortem, red team, or audit evidence and assumptions.
tools
Core skill for the deep research and writing tool. Write scientific manuscripts in full paragraphs (never bullet points). Use two-stage process with (1) section outlines with key points using research-lookup then (2) convert to flowing prose. IMRAD structure, citations (APA/AMA/Vancouver), figures/tables, reporting guidelines (CONSORT/STROBE/PRISMA), for research papers and journal submissions.