Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

edercnj/resilience

Name: resilience
Author: edercnj

java/src/main/resources/targets/claude/skills/knowledge-packs/resilience/SKILL.md

npx skillsauth add edercnj/ia-dev-environment resilience

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Knowledge Pack: Resilience

Purpose

Provides comprehensive resilience patterns for {{LANGUAGE}} {{FRAMEWORK}}, enabling services to gracefully handle failures, prevent cascading outages, and recover quickly. Includes circuit breaker state machines, rate limiting strategies, bulkhead partitioning, timeout coordination, intelligent retry logic, and degradation strategies.

Quick Reference (always in context)

See references/resilience-principles.md for the essential resilience summary (6 core patterns: rate limiting, circuit breaker, bulkhead, timeout, retry, fallback).

Detailed References

Read these files for detailed pattern implementations:

| Reference | Content | |-----------|---------| | patterns/resilience/circuit-breaker.md | State machine (CLOSED/OPEN/HALF_OPEN), configuration (failure threshold, wait duration, success threshold), monitoring metrics, fallback strategies, per-dependency circuits | | patterns/resilience/rate-limiting.md | Token bucket, fixed window, sliding window algorithms; per-client, per-endpoint, global scopes; token bucket properties (capacity, refill rate, burst); response to limit (429 with Retry-After) | | patterns/resilience/bulkhead.md | Thread pool isolation, semaphore isolation, partitioning strategies (by downstream service, operation type, tenant, protocol); sizing guidelines; rejection handling; metrics and monitoring | | patterns/resilience/timeout-patterns.md | Timeout types (connection, read, write, overall); per-operation configurations; deadline propagation across services; timeout hierarchy (inner < outer); cancellation on timeout | | patterns/resilience/retry-with-backoff.md | Exponential backoff, linear backoff; mandatory jitter (full, equal, decorrelated); retryable vs non-retryable error classification; retry budgets; interaction with deadlines | | patterns/resilience/fallback-degradation.md | Graceful degradation levels (NORMAL, WARNING, CRITICAL, EMERGENCY); fallback strategies (cached data, default, error); fail-secure principle; degradation triggers and transitions | | patterns/resilience/backpressure.md | Flow control mechanisms, pause/resume protocols, connection-level backpressure, message queue depth limits, timeout-based resumption | | patterns/resilience/resilience-metrics.md | Metric types per pattern, naming conventions, alert thresholds, dashboards, SLA tracking | | references/chaos-engineering-experiments.md | Catalog of chaos experiments by type (network, latency, resource, dependency) with setup instructions |

Chaos Engineering

Proactive resilience validation through controlled fault injection for {{LANGUAGE}} {{FRAMEWORK}}.

Principles

Steady-State Hypothesis: Define measurable baseline behavior (latency p99, error rate, throughput) before experiments
Vary Real-World Events: Simulate failures that actually occur in production (network partitions, disk full, OOM)
Run in Production: Validate resilience where it matters most; staging cannot replicate production complexity
Automate Experiments: Repeatable, scheduled experiments integrated into CI/CD pipelines
Minimize Blast Radius: Start with smallest scope (single instance), expand gradually with automated kill switches

Experiment Types

Network Failure

Packet loss injection (5%, 25%, 50%, 100%)
Latency injection (50ms, 200ms, 1s, 5s added delay)
DNS resolution failure (NXDOMAIN, timeout)
Network partition simulation (split-brain between services)

Latency Injection

Response delay on specific endpoints
Slow connection establishment (TCP handshake delay)
Timeout simulation (response arrives after client timeout)

Resource Exhaustion

CPU stress (saturate cores to validate degradation behavior)
Memory pressure (allocate until near-OOM to test GC behavior)
Disk full simulation (verify graceful handling of write failures)
Thread pool exhaustion (saturate worker threads)

Dependency Failure

Downstream service unavailability (connection refused)
Degraded responses (slow responses, partial data)
Malformed responses (invalid JSON, unexpected status codes)

Tools

| Tool | Scope | Use Case | |------|-------|----------| | Chaos Monkey (Netflix) | Instance | Random instance termination in production | | Litmus | Kubernetes | K8s-native chaos experiments with CRDs | | Gremlin | SaaS | Enterprise chaos platform with safety controls | | Toxiproxy | Network | TCP-level proxy for latency/partition injection | | Chaos Mesh | Kubernetes | K8s chaos with dashboard and scheduling |

Game Day Planning

Objectives: Define what resilience property to validate (e.g., "circuit breaker opens within 5s of downstream failure")
Scope: Identify target services, blast radius, and rollback criteria
Participants: Engineers, SRE, on-call team, stakeholders
Communication Plan: Notify affected teams, establish war-room channel
Rollback Procedures: Automated kill switch, manual rollback steps, escalation path

Blast Radius Control

Start small: single instance or single dependency
Expand gradually: increase scope only after successful smaller experiments
Automated kill switch: monitoring thresholds that auto-terminate experiments
Monitoring thresholds: abort if error rate exceeds 2x baseline or latency exceeds 3x p99

Experiment Runbook Template

## Experiment: [Name]
### Hypothesis
[If X failure occurs, then Y behavior is expected because Z resilience pattern is in place]
### Steady-State Metrics
- Latency p99: [baseline]
- Error rate: [baseline]
- Throughput: [baseline]
### Experiment Steps
1. [Step with specific tool/command]
2. [Observation window]
3. [Rollback trigger conditions]
### Expected vs Actual Results
| Metric | Expected | Actual | Pass/Fail |
|--------|----------|--------|-----------|
### Findings
[Document unexpected behaviors]
### Action Items
- [ ] [Fix/improvement with owner and deadline]

Related Knowledge Packs

skills/observability/ — resilience metrics, alerting, and SLO/SLI framework
skills/infrastructure/ — health probes and graceful shutdown patterns
skills/sre-practices/ — error budgets, incident management, and change management

edercnj/resilience

java/src/main/resources/targets/claude/skills/knowledge-packs/resilience/SKILL.md

Resilience patterns: circuit breaker, rate limiting, bulkhead isolation, timeout control, retry with exponential backoff + jitter, fallback/graceful degradation, backpressure, and resilience metrics.

testing

Updated Apr 23, 2026

$ install --global

skillsauth

npx skillsauth add edercnj/ia-dev-environment resilience

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 23, 2026, 4:36 AM13.3s3 files scanned

SKILL.md

name:: resilience
description:: Resilience patterns: circuit breaker, rate limiting, bulkhead isolation, timeout control, retry with exponential backoff + jitter, fallback/graceful degradation, backpressure, and resilience metrics.
user-invocable:: false

Knowledge Pack: Resilience

Purpose

Quick Reference (always in context)

See references/resilience-principles.md for the essential resilience summary (6 core patterns: rate limiting, circuit breaker, bulkhead, timeout, retry, fallback).

Detailed References

Read these files for detailed pattern implementations:

Chaos Engineering

Proactive resilience validation through controlled fault injection for {{LANGUAGE}} {{FRAMEWORK}}.

Principles

Steady-State Hypothesis: Define measurable baseline behavior (latency p99, error rate, throughput) before experiments
Vary Real-World Events: Simulate failures that actually occur in production (network partitions, disk full, OOM)
Run in Production: Validate resilience where it matters most; staging cannot replicate production complexity
Automate Experiments: Repeatable, scheduled experiments integrated into CI/CD pipelines
Minimize Blast Radius: Start with smallest scope (single instance), expand gradually with automated kill switches

Experiment Types

Network Failure

Packet loss injection (5%, 25%, 50%, 100%)
Latency injection (50ms, 200ms, 1s, 5s added delay)
DNS resolution failure (NXDOMAIN, timeout)
Network partition simulation (split-brain between services)

Latency Injection

Response delay on specific endpoints
Slow connection establishment (TCP handshake delay)
Timeout simulation (response arrives after client timeout)

Resource Exhaustion

CPU stress (saturate cores to validate degradation behavior)
Memory pressure (allocate until near-OOM to test GC behavior)
Disk full simulation (verify graceful handling of write failures)
Thread pool exhaustion (saturate worker threads)

Dependency Failure

Downstream service unavailability (connection refused)
Degraded responses (slow responses, partial data)
Malformed responses (invalid JSON, unexpected status codes)

Tools

Game Day Planning

Objectives: Define what resilience property to validate (e.g., "circuit breaker opens within 5s of downstream failure")
Scope: Identify target services, blast radius, and rollback criteria
Participants: Engineers, SRE, on-call team, stakeholders
Communication Plan: Notify affected teams, establish war-room channel
Rollback Procedures: Automated kill switch, manual rollback steps, escalation path

Blast Radius Control

Start small: single instance or single dependency
Expand gradually: increase scope only after successful smaller experiments
Automated kill switch: monitoring thresholds that auto-terminate experiments
Monitoring thresholds: abort if error rate exceeds 2x baseline or latency exceeds 3x p99

Experiment Runbook Template

## Experiment: [Name]
### Hypothesis
[If X failure occurs, then Y behavior is expected because Z resilience pattern is in place]
### Steady-State Metrics
- Latency p99: [baseline]
- Error rate: [baseline]
- Throughput: [baseline]
### Experiment Steps
1. [Step with specific tool/command]
2. [Observation window]
3. [Rollback trigger conditions]
### Expected vs Actual Results
| Metric | Expected | Actual | Pass/Fail |
|--------|----------|--------|-----------|
### Findings
[Document unexpected behaviors]
### Action Items
- [ ] [Fix/improvement with owner and deadline]

Related Knowledge Packs

skills/observability/ — resilience metrics, alerting, and SLO/SLI framework
skills/infrastructure/ — health probes and graceful shutdown patterns
skills/sre-practices/ — error budgets, incident management, and change management

Related Skills

edercnj/x-validate-docs

development

VerifiedTrustedCommunity

Documentation freshness gate: validates 6 dimensions (readme, api, adr, etc.) per PR.

SKILL.mdUpdated Jun 10, 2026

edercnj/x-validate-docs

edercnj/x-validate-dependency-policy

testing

VerifiedTrustedCommunity

Conditional dep-policy gate: CVEs, licenses, versions, freshness; SARIF + report.

SKILL.mdUpdated Jun 10, 2026

edercnj/x-validate-dependency-policy

edercnj/x-update-architecture

documentation

VerifiedTrustedCommunity

Incrementally updates the service or system architecture document; never regenerative.

SKILL.mdUpdated Jun 10, 2026

edercnj/x-update-architecture

edercnj/x-scan-secrets

development

VerifiedTrustedCommunity

Scans code and git history for leaked credentials, API keys, and tokens; SARIF output.

SKILL.mdUpdated Jun 10, 2026

edercnj/x-scan-secrets

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/edercnj/ia-dev-environment.git

# Copy into Claude Code skills folder (global)
cp -r ia-dev-environment/java/src/main/resources/targets/claude/skills/knowledge-packs/resilience ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

edercnj/ia-dev-environment

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT