Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

peterbamuhigire/reliability-engineering

Name: reliability-engineering
Author: peterbamuhigire

reliability-engineering/SKILL.md

npx skillsauth add peterbamuhigire/skills-web-dev reliability-engineering

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Reliability Engineering

Use When

Use when designing or reviewing production reliability for APIs, SaaS platforms, background jobs, distributed workflows, mobile backends, or AI-enabled systems. Covers timeout and retry policy, degradation, queue safety, incident readiness, and recovery-aware design.
The task needs reusable judgment, domain constraints, or a proven workflow rather than ad hoc advice.

Do Not Use When

The task is unrelated to reliability-engineering or would be better handled by a more specific companion skill.
The request only needs a trivial answer and none of this skill's constraints or references materially help.

Required Inputs

Gather relevant project context, constraints, and the concrete problem to solve; load references only as needed.
Confirm the desired deliverable: design, code, review, migration plan, audit, or documentation.

Workflow

Read this SKILL.md first, then load only the referenced deep-dive files that are necessary for the task.
Apply the ordered guidance, checklists, and decision rules in this skill instead of cherry-picking isolated snippets.
Produce the deliverable with assumptions, risks, and follow-up work made explicit when they matter.

Quality Standards

Keep outputs execution-oriented, concise, and aligned with the repository's baseline engineering standards.
Preserve compatibility with existing project conventions unless the skill explicitly requires a stronger standard.
Prefer deterministic, reviewable steps over vague advice or tool-specific magic.

Anti-Patterns

Treating examples as copy-paste truth without checking fit, constraints, or failure modes.
Loading every reference file by default instead of using progressive disclosure.

Outputs

A concrete result that fits the task: implementation guidance, review findings, architecture decisions, templates, or generated artifacts.
Clear assumptions, tradeoffs, or unresolved gaps when the task cannot be completed from available context alone.
References used, companion skills, or follow-up actions when they materially improve execution.

References

Use the references/ directory for deep detail after reading the core workflow below.

Use this skill when correctness under ideal conditions is not enough. The goal is to keep important workflows safe, available enough, diagnosable, and recoverable under load, dependency failure, stale state, and operator error.

Load Order

Load world-class-engineering.
Load this skill when the system has external dependencies, background processing, scale risk, or meaningful uptime expectations.
Pair it with observability-monitoring, deployment-release-engineering, and distributed-systems-patterns when services or queues are involved.

Reliability Workflow

1. Classify Criticality

For each important workflow, define:

user and business impact if it fails
maximum acceptable downtime or degradation
data-loss tolerance
financial, compliance, or trust consequences
recovery time expectation
acceptable operator effort or toil

Not every path needs the same reliability level.

2. Map Failure Modes

Explicitly list:

dependency timeout or outage
partial write or partial side effect
duplicate delivery or replay
stale reads or cache inconsistency
concurrency conflict
operator or configuration error
overload, backpressure, or queue growth
release-induced regression

If a failure mode is plausible and unhandled, the design is incomplete.

3. Design Protection Mechanisms

Choose deliberate policies for:

timeout budgets
retries and backoff
idempotency and deduplication
circuit breaking or load shedding
queues, dead-letter handling, and replay
graceful degradation or fallback behavior
concurrency limits and admission control
reconciliation jobs for eventually consistent workflows

4. Design Recovery

For every critical flow, define:

how to detect failure
who owns the first response
whether to retry, compensate, reconcile, or roll back
what can be replayed safely
what manual tooling or runbook is needed
how recent deployments or config changes will be ruled in or out quickly

5. Verify Reliability

Before production claims, produce evidence for:

timeout and retry behavior
degraded-state behavior
queue recovery or replay
duplicate-request safety
alert and runbook usefulness
overload or backpressure behavior
staged recovery drills or game-day exercises for the highest-cost failures

Reliability Standards

Retries and Timeouts

Retries without idempotency are usually a bug.
Timeouts must be shorter than user patience and upstream collapse thresholds.
Use bounded retries with jitter for transient failures.
Do not retry validation failures, authorization failures, or deterministic business rejections.

Queues and Jobs

Every job needs an idempotency strategy or deduplication key.
Poison messages need dead-letter or quarantine behavior.
Replay must be safe, observable, and permissioned.
Long-running jobs need progress or heartbeat signals.
Queues need saturation and age monitoring, not only failure counts.

Degradation

Define what the user sees when a dependency is slow or unavailable.
Prefer reduced capability over total failure where business risk allows.
Fail closed for privileged or security-sensitive paths.
Fail open only with deliberate justification and bounded blast radius.

Incident Readiness

Alerts need an owner and a first action.
Correlate incidents to release version, tenant, actor, and dependency.
Keep recovery tools safe for operators under stress.
Write runbooks for high-cost incidents before the incident happens.
Rehearse at least the top failure scenarios often enough that the response is not theoretical.

Deliverables

For meaningful reliability work, produce:

criticality table
failure-mode table
timeout and retry policy
degradation and fallback notes
queue and replay strategy
incident ownership and recovery outline
reliability verification or exercise plan

Review Checklist

[ ] Critical workflows have explicit reliability targets or expectations.
[ ] Retries, timeouts, and idempotency rules are coherent.
[ ] Duplicate, replay, and partial-failure cases are handled safely.
[ ] Degradation behavior is defined for dependency failures.
[ ] Recovery paths and owners are explicit.
[ ] Reliability claims are backed by tests, simulations, or staged evidence.

References

references/reliability-patterns.md: Design rules for timeouts, retries, queues, and degradation.
references/incident-readiness.md: Incident preparation and recovery prompts.
references/reliability-verification.md: Reliability drills, overload checks, and evidence expectations.

peterbamuhigire/reliability-engineering

reliability-engineering/SKILL.md

Use when designing or reviewing production reliability for APIs, SaaS platforms, background jobs, distributed workflows, mobile backends, or AI-enabled systems. Covers timeout and retry policy, degradation, queue safety, incident readiness, and recovery-aware design.

8 stars

development

Updated Apr 15, 2026

$ install --global

skillsauth

npx skillsauth add peterbamuhigire/skills-web-dev reliability-engineering

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 15, 2026, 9:15 AM5.5s4 files scanned

SKILL.md

name:: reliability-engineering
description:: Use when designing or reviewing production reliability for APIs, SaaS
portable:: true

Reliability Engineering

Use When

Use when designing or reviewing production reliability for APIs, SaaS platforms, background jobs, distributed workflows, mobile backends, or AI-enabled systems. Covers timeout and retry policy, degradation, queue safety, incident readiness, and recovery-aware design.
The task needs reusable judgment, domain constraints, or a proven workflow rather than ad hoc advice.

Do Not Use When

The task is unrelated to reliability-engineering or would be better handled by a more specific companion skill.
The request only needs a trivial answer and none of this skill's constraints or references materially help.

Required Inputs

Gather relevant project context, constraints, and the concrete problem to solve; load references only as needed.
Confirm the desired deliverable: design, code, review, migration plan, audit, or documentation.

Workflow

Read this SKILL.md first, then load only the referenced deep-dive files that are necessary for the task.
Apply the ordered guidance, checklists, and decision rules in this skill instead of cherry-picking isolated snippets.
Produce the deliverable with assumptions, risks, and follow-up work made explicit when they matter.

Quality Standards

Keep outputs execution-oriented, concise, and aligned with the repository's baseline engineering standards.
Preserve compatibility with existing project conventions unless the skill explicitly requires a stronger standard.
Prefer deterministic, reviewable steps over vague advice or tool-specific magic.

Anti-Patterns

Treating examples as copy-paste truth without checking fit, constraints, or failure modes.
Loading every reference file by default instead of using progressive disclosure.

Outputs

A concrete result that fits the task: implementation guidance, review findings, architecture decisions, templates, or generated artifacts.
Clear assumptions, tradeoffs, or unresolved gaps when the task cannot be completed from available context alone.
References used, companion skills, or follow-up actions when they materially improve execution.

References

Use the references/ directory for deep detail after reading the core workflow below.

Load Order

Load world-class-engineering.
Load this skill when the system has external dependencies, background processing, scale risk, or meaningful uptime expectations.
Pair it with observability-monitoring, deployment-release-engineering, and distributed-systems-patterns when services or queues are involved.

Reliability Workflow

1. Classify Criticality

For each important workflow, define:

user and business impact if it fails
maximum acceptable downtime or degradation
data-loss tolerance
financial, compliance, or trust consequences
recovery time expectation
acceptable operator effort or toil

Not every path needs the same reliability level.

2. Map Failure Modes

Explicitly list:

dependency timeout or outage
partial write or partial side effect
duplicate delivery or replay
stale reads or cache inconsistency
concurrency conflict
operator or configuration error
overload, backpressure, or queue growth
release-induced regression

If a failure mode is plausible and unhandled, the design is incomplete.

3. Design Protection Mechanisms

Choose deliberate policies for:

timeout budgets
retries and backoff
idempotency and deduplication
circuit breaking or load shedding
queues, dead-letter handling, and replay
graceful degradation or fallback behavior
concurrency limits and admission control
reconciliation jobs for eventually consistent workflows

4. Design Recovery

For every critical flow, define:

how to detect failure
who owns the first response
whether to retry, compensate, reconcile, or roll back
what can be replayed safely
what manual tooling or runbook is needed
how recent deployments or config changes will be ruled in or out quickly

5. Verify Reliability

Before production claims, produce evidence for:

timeout and retry behavior
degraded-state behavior
queue recovery or replay
duplicate-request safety
alert and runbook usefulness
overload or backpressure behavior
staged recovery drills or game-day exercises for the highest-cost failures

Reliability Standards

Retries and Timeouts

Retries without idempotency are usually a bug.
Timeouts must be shorter than user patience and upstream collapse thresholds.
Use bounded retries with jitter for transient failures.
Do not retry validation failures, authorization failures, or deterministic business rejections.

Queues and Jobs

Every job needs an idempotency strategy or deduplication key.
Poison messages need dead-letter or quarantine behavior.
Replay must be safe, observable, and permissioned.
Long-running jobs need progress or heartbeat signals.
Queues need saturation and age monitoring, not only failure counts.

Degradation

Define what the user sees when a dependency is slow or unavailable.
Prefer reduced capability over total failure where business risk allows.
Fail closed for privileged or security-sensitive paths.
Fail open only with deliberate justification and bounded blast radius.

Incident Readiness

Alerts need an owner and a first action.
Correlate incidents to release version, tenant, actor, and dependency.
Keep recovery tools safe for operators under stress.
Write runbooks for high-cost incidents before the incident happens.
Rehearse at least the top failure scenarios often enough that the response is not theoretical.

Deliverables

For meaningful reliability work, produce:

criticality table
failure-mode table
timeout and retry policy
degradation and fallback notes
queue and replay strategy
incident ownership and recovery outline
reliability verification or exercise plan

Review Checklist

[ ] Critical workflows have explicit reliability targets or expectations.
[ ] Retries, timeouts, and idempotency rules are coherent.
[ ] Duplicate, replay, and partial-failure cases are handled safely.
[ ] Degradation behavior is defined for dependency failures.
[ ] Recovery paths and owners are explicit.
[ ] Reliability claims are backed by tests, simulations, or staged evidence.

References

references/reliability-patterns.md: Design rules for timeouts, retries, queues, and degradation.
references/incident-readiness.md: Incident preparation and recovery prompts.
references/reliability-verification.md: Reliability drills, overload checks, and evidence expectations.

Related Skills

peterbamuhigire/ai-analytics-saas

data-ai

VerifiedTrustedCommunity

Use when adding AI-powered analytics to a SaaS platform — semantic search over business data, natural language queries, trend detection, anomaly alerts, and AI-generated insights for dashboards. Covers embeddings, NL2SQL, and per-tenant analytics...

10SKILL.mdUpdated Apr 15, 2026

peterbamuhigire/ai-analytics-saas

peterbamuhigire/ai-analytics-dashboards

data-ai

VerifiedTrustedCommunity

Design AI-powered analytics dashboards — what metrics to show, how to display AI predictions and confidence, drill-down patterns, KPI cards, trend visualisation, AI Insights panels, export design, and role-based dashboard variants. Invoke when...

9SKILL.mdUpdated Apr 15, 2026

peterbamuhigire/ai-analytics-dashboards

peterbamuhigire/world-class-engineering

development

VerifiedTrustedCommunity

Use when designing, building, reviewing, or upgrading production software systems that must be secure, performant, maintainable, scalable, and user-centered. Apply before writing specs, code, architecture, APIs, databases, mobile apps, SaaS platforms, or ERP systems.

8SKILL.mdUpdated Apr 15, 2026

peterbamuhigire/world-class-engineering

peterbamuhigire/webapp-gui-design

development

VerifiedTrustedCommunity

Professional web app UI using commercial templates (Tabler/Bootstrap 5) with strong frontend design direction when needed. Use for CRUD interfaces, dashboards, admin panels with SweetAlert2, DataTables, Flatpickr. Clone seeder-page.php, use...

8SKILL.mdUpdated Apr 15, 2026

peterbamuhigire/webapp-gui-design

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/peterbamuhigire/skills-web-dev.git

# Copy into Claude Code skills folder (global)
cp -r skills-web-dev/reliability-engineering ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

peterbamuhigire/skills-web-dev

8 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT