microservices-resilience/SKILL.md
Resilience patterns for microservices — circuit breaker (full implementation), health check endpoint design, load balancing algorithms, retry with backoff, bulkhead, timeout, graceful degradation, slowstart recovery, and stale cache fallback...
npx skillsauth add peterbamuhigire/skills-web-dev microservices-resilienceInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
microservices-resilience or would be better handled by a more specific companion skill.SKILL.md first, then load only the referenced deep-dive files that are necessary for the task."Failure will occur. Design for it." — Stetson, NGINX MRA
In a microservices architecture, the service graph is deep and every network hop is a failure opportunity. Resilience is not added after the fact — it is designed in from the start.
Popularised by Martin Fowler; implemented by Netflix Hystrix and NGINX Plus.
Without a circuit breaker, a failing service causes a cascade: all callers wait, exhaust connection pools, and fail too. A circuit breaker detects the failing service and stops calling it, allowing it to recover.
┌─────────────────────────────┐
│ CLOSED │
│ Normal operation │
│ All requests pass through │
└──────────┬──────────────────┘
│ failures > threshold
▼
┌─────────────────────────────┐
│ OPEN │
│ Service is failing │
│ All requests blocked │
│ Fallback served │
└──────────┬──────────────────┘
│ after timeout, try again
▼
┌─────────────────────────────┐
│ HALF-OPEN │
│ One probe request allowed │
│ Success → CLOSED │
│ Failure → OPEN again │
└─────────────────────────────┘
Every service MUST expose a /health endpoint. Passive health checks (detecting failure only after a request fails) are too late. Active checks probe before failure.
Health endpoint contract:
GET /health
→ 200 OK (healthy)
{
"status": "healthy",
"checks": {
"database": "ok",
"memory_used_pct": 62,
"queue_lag_ms": 45
}
}
→ 503 Service Unavailable (unhealthy)
{
"status": "unhealthy",
"checks": {
"database": "connection_failed",
"memory_used_pct": 87
}
}
What to check in /health:
SELECT 1)PHP/Laravel /health implementation:
Route::get('/health', function () {
$checks = [];
$healthy = true;
// DB check
try {
DB::selectOne('SELECT 1');
$checks['database'] = 'ok';
} catch (\Exception $e) {
$checks['database'] = 'connection_failed';
$healthy = false;
}
// Memory check
$memUsedPct = round((memory_get_usage(true) / ini_parse_quantity(ini_get('memory_limit'))) * 100);
$checks['memory_used_pct'] = $memUsedPct;
if ($memUsedPct > 80) $healthy = false;
return response()->json([
'status' => $healthy ? 'healthy' : 'unhealthy',
'checks' => $checks,
], $healthy ? 200 : 503);
});
upstream resizer_service {
zone resizer 64k;
server resizer-1.internal:8080;
server resizer-2.internal:8080;
server resizer-3.internal:8080;
# Circuit breaker: mark unhealthy after 1 failed health check
# Slowstart: recovering instance ramps to full traffic over 30s
}
# Active health check location (internal — not publicly accessible)
location @health_check_resizer {
internal;
health_check uri=/health match=service_healthy fails=1 interval=3s;
proxy_pass http://resizer_service;
proxy_ssl_session_reuse on;
}
# Health check response matcher
match service_healthy {
status 200;
header Content-Type ~ application/json;
body ~ '"status":"healthy"';
}
| Option | Use When | How |
|--------|---------|-----|
| Redirect to healthy instances | Other instances available | NGINX upstream failover |
| Serve stale cached response | Read-only service, data freshness not critical | proxy_cache_use_stale in NGINX |
| Alternative data source | Degraded but usable alternative exists | Secondary upstream or static response |
| Rate limit + queue | Overload (not failure) | limit_req + async queue |
| Graceful error | No alternative | Return structured error with retry guidance |
Stale cache fallback in NGINX:
proxy_cache_use_stale error timeout updating http_503;
proxy_cache_background_update on;
Slowstart for recovering instances:
# Add to server line in upstream — ramp to full traffic over 30 seconds
server resizer-1.internal:8080 slow_start=30s;
| Algorithm | How | Best For |
|-----------|-----|---------|
| Round Robin (default) | Rotate through instances equally | Uniform request weight |
| Least Connections | Send to instance with fewest active connections | Variable request duration |
| Least Time | Send to fastest-responding instance | Latency-sensitive services |
| IP Hash | Same client always goes to same instance | Session persistence (avoid if possible — prefer Redis) |
| Random | Random selection, optionally with two (pick best of 2 random) | High-scale, uniform load |
NGINX upstream with Least Time:
upstream ai_service {
least_time last_byte; # route to fastest based on last complete response
server ai-1.internal:5000;
server ai-2.internal:5000;
}
Retry transient failures (network blip, service restart). Never retry:
// Retry helper for idempotent service calls
function callWithRetry(callable $fn, int $maxAttempts = 3, int $baseDelayMs = 100): mixed
{
$attempt = 0;
while (true) {
try {
return $fn();
} catch (ServiceUnavailableException $e) {
$attempt++;
if ($attempt >= $maxAttempts) throw $e;
$delayMs = $baseDelayMs * (2 ** ($attempt - 1)); // 100ms, 200ms, 400ms
usleep($delayMs * 1000);
}
}
}
Isolate failure by partitioning resources. Prevent one service's failures from consuming all connection pool slots.
// Separate connection pools per downstream service
// In Laravel config/database.php — separate connections per service
'connections' => [
'enrollment_db' => [...], // pool for enrollment-service
'finance_db' => [...], // pool for finance-service — isolated
]
In a Kubernetes context: assign resource limits (requests/limits) per service container so one runaway service cannot starve others.
Every inter-service call MUST have a timeout. Never make an open-ended call.
| Call Type | Recommended Timeout | |-----------|-------------------| | Synchronous API call (read) | 500ms – 2s | | Synchronous API call (write/transaction) | 2s – 5s | | AI API call (non-streaming) | 8s – 15s | | Async job dispatch | 200ms (just enqueue, don't wait) | | Health check probe | 1s |
// Laravel HTTP client with timeout
$response = Http::timeout(2)->get('http://enrollment-service/api/v1/students/42');
For every inter-service dependency, verify:
/health with active statusSee also:
microservices-architecture-models — NGINX circuit breaker implementation in each modelmicroservices-communication — Health check propagation and service discoverymicroservices-ai-integration — AI service resilience and async job fallbackai-ux-patterns — User-facing error states when downstream service is unavailabledata-ai
Use when adding AI-powered analytics to a SaaS platform — semantic search over business data, natural language queries, trend detection, anomaly alerts, and AI-generated insights for dashboards. Covers embeddings, NL2SQL, and per-tenant analytics...
data-ai
Design AI-powered analytics dashboards — what metrics to show, how to display AI predictions and confidence, drill-down patterns, KPI cards, trend visualisation, AI Insights panels, export design, and role-based dashboard variants. Invoke when...
development
Use when designing, building, reviewing, or upgrading production software systems that must be secure, performant, maintainable, scalable, and user-centered. Apply before writing specs, code, architecture, APIs, databases, mobile apps, SaaS platforms, or ERP systems.
development
Professional web app UI using commercial templates (Tabler/Bootstrap 5) with strong frontend design direction when needed. Use for CRUD interfaces, dashboards, admin panels with SweetAlert2, DataTables, Flatpickr. Clone seeder-page.php, use...