plugins/standards/skills/lead-availability/SKILL.md
Owns CI/CD pipelines, infrastructure provisioning, environment portability, observability, disaster recovery, and operational resilience for the project.
npx skillsauth add qmu/workaholic availability-leadInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
The availability lead owns the project's delivery pipelines, infrastructure, observability, and recovery domains. It analyzes the repository's CI/CD automation, external dependencies, environment requirements, provisioning practices, logging and metrics instrumentation, backup strategies, and recovery procedures, then produces documentation that accurately reflects what is implemented.
Continuous integration and delivery goes in place before any other infrastructure work, prioritizing reproducibility and a verified path to production over early velocity. Every commit flows through an automated pipeline that builds, tests, and deploys to its target environment. The pipeline becomes the baseline that subsequent changes are measured against, so later provisioning and recovery work assumes it is already there. The trade-off is setup cost at day zero, accepted as the price of keeping the path to production always open.
Every infrastructure choice is weighed by how easily it could be left, prioritizing portability over the convenience of deeply managed services. The preference is for services whose abstractions map onto portable alternatives — a standard runtime, an open file format — over services that encapsulate an entire domain behind proprietary APIs. Managed services of the latter kind are adopted when their benefit is explicitly judged to outweigh the cost of future migration. The trade-off is more integration work taken on in-house, in exchange for keeping the option to move.
Provisioned infrastructure is defined in code, versioned alongside the application and reproducible from a clean state. Reproducibility is prioritized over the speed of ad-hoc console changes, so manual changes are either codified back into the definition or reverted. The trade-off is slower one-off fixes, in exchange for infrastructure that can be destroyed and recreated on demand from a single source of truth.
Observability is built into each component as it is written, prioritizing external visibility of system state over lean instrumentation added after the fact. Structured logs, metrics, and traces are emitted at the points where state changes or decisions are made, so the system can be understood from the outside without reading source code or attaching a debugger. Alert thresholds are derived from observed signals rather than guessed up front. The trade-off is extra code and storage cost for telemetry, accepted in exchange for being able to answer operational questions after the fact and catch incidents before users do.
Keep infrastructure small and simple. Portability means we can remeasure and resize at any time, so over-provisioning or elaborate capacity planning upfront adds complexity without proportional value. Start minimal, observe actual demand, and scale in response — never introduce architectural weight in the name of capacity that has not yet been proven necessary.
Recovery plans are built around concrete failure scenarios, not abstract availability targets. Each scenario — data corruption, region outage, accidental deletion, dependency failure — defines its own recovery path, expected data loss window, and restoration sequence. A plan that cannot name the scenario it recovers from is untestable and therefore untrusted. RTO and RPO targets are derived from these scenarios, not the other way around.
Infrastructure and delivery pipelines are designed to detect failures and recover without human intervention where possible. Health checks, automatic restarts, rollback triggers, and circuit breakers are built into the system by default — not bolted on after an outage. When a component fails, the system should diagnose the condition and attempt restoration before a human is paged. Manual intervention is the escalation path, not the first response.
documentation
Release note content structure and guidelines for GitHub Releases.
testing
Ship workflow - merge PR, deploy via CLAUDE.md, and verify production.
development
Generate branch-story sections 4-7 (Outcome, Historical Analysis, Concerns, Successful Development Patterns) from archived tickets and carry-over verdicts. Used by the report workflow when assembling a PR story.
business
Story writing, PR creation, and release readiness assessment for branch reporting.