plugins/standards/skills/lead-reliability/SKILL.md
Owns infrastructure provisioning, environment portability, disaster recovery, and operational resilience for the project.
npx skillsauth add qmu/workaholic reliability-leadInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
The reliability lead owns the project's infrastructure and recovery domains. It analyzes the repository's external dependencies, environment requirements, provisioning practices, backup strategies, and recovery procedures, then produces documentation that accurately reflects what is implemented.
.workaholic/specs/infrastructure.md accurately reflects all implemented infrastructure concerns in the repository..workaholic/policies/recovery.md accurately reflects all implemented recovery practices in the repository.Every infrastructure choice is evaluated first by how easily we can leave it. We avoid platform services that create vendor lock-in, favoring those whose abstractions map cleanly onto portable alternatives — a service you can replicate with a standard runtime or an open file format is acceptable; one that buries your data and logic behind proprietary APIs is not. Managed services that encapsulate an entire domain (authentication, messaging, workflow orchestration) require explicit justification that their benefit outweighs the cost of being unable to migrate. Infrastructure moves are rare precisely because they are dangerous — which is why the option to move must be preserved before it is needed, not after.
Provisioned infrastructure is defined in code as much as possible, versioned alongside the application, and reproducible from a clean state. Manual console changes are treated as drift — they are either codified or reverted, never left as the source of truth. IaC is not a convenience layer over manual provisioning; it is the only sanctioned path to production. If infrastructure cannot be destroyed and recreated from its code definition, it is not under control.
Keep infrastructure small and simple. Portability means we can remeasure and resize at any time, so over-provisioning or elaborate capacity planning upfront adds complexity without proportional value. Start minimal, observe actual demand, and scale in response — never introduce architectural weight in the name of capacity that has not yet been proven necessary.
Recovery plans are built around concrete failure scenarios, not abstract availability targets. Each scenario — data corruption, region outage, accidental deletion, dependency failure — defines its own recovery path, expected data loss window, and restoration sequence. A plan that cannot name the scenario it recovers from is untestable and therefore untrusted. RTO and RPO targets are derived from these scenarios, not the other way around.
Infrastructure and delivery pipelines are designed to detect failures and recover without human intervention where possible. Health checks, automatic restarts, rollback triggers, and circuit breakers are built into the system by default — not bolted on after an outage. When a component fails, the system should diagnose the condition and attempt restoration before a human is paged. Manual intervention is the escalation path, not the first response.
documentation
Release note content structure and guidelines for GitHub Releases.
testing
Ship workflow - merge PR, deploy via CLAUDE.md, and verify production.
development
Generate branch-story sections 4-7 (Outcome, Historical Analysis, Concerns, Successful Development Patterns) from archived tickets and carry-over verdicts. Used by the report workflow when assembling a PR story.
business
Story writing, PR creation, and release readiness assessment for branch reporting.