.claude/skills/production-monitoring/SKILL.md
Use when the Production Engineer is assessing production health, setting up monitoring, defining health checks, evaluating SLA compliance, or responding to incidents. Activates when discussing production readiness, system health, alerting, or incident response.
npx skillsauth add dsivov/ai_development_team production-monitoringInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Apply this guidance when:
Every service should expose a health endpoint:
GET /health
Response:
{
"status": "healthy" | "degraded" | "unhealthy",
"version": "1.2.3",
"timestamp": "2026-02-28T10:00:00Z",
"checks": {
"database": "healthy",
"cache": "healthy",
"external_api": "degraded"
}
}
| Level | What to Check | Frequency | |-------|--------------|-----------| | Liveness | Process is running | Every 10s | | Readiness | Can handle requests | Every 30s | | Deep health | All dependencies ok | Every 60s |
| SLI | Target | Measurement | |-----|--------|-------------| | Availability | 99.9% | Successful responses / total | | Latency | p95 < 200ms | Response time percentiles | | Error rate | < 0.1% | Error responses / total | | Throughput | > N req/s | Requests per second |
After merging to main and deploying:
healthy| Severity | Condition | Response | |----------|-----------|----------| | Critical | Service down, data loss risk | Page on-call, immediate action | | Warning | Degraded performance, elevated errors | Investigate within 30 min | | Info | Anomaly detected, approaching threshold | Review during business hours |
After any production incident, document in reports/INCIDENT_<YYYYMMDD>_<NNN>.md:
# Incident Report — <date>
## Summary
<One-line description of what happened>
## Timeline
- HH:MM — Issue detected
- HH:MM — Investigation started
- HH:MM — Root cause identified
- HH:MM — Fix applied
- HH:MM — Verified resolved
## Impact
- Duration: X minutes
- Users affected: N
- Data impact: None / Describe
## Root Cause
<What caused the issue>
## Resolution
<How it was fixed>
## Prevention
<What changes will prevent recurrence>
development
Use when the Integrator is writing unit tests, e2e tests, designing test strategies, improving test coverage, creating test fixtures, or mocking dependencies. Activates for any testing-related work including TDD, test refactoring, or test debugging.
development
Use when the Architect is breaking down change requests into implementable tasks, defining acceptance criteria, estimating task size, mapping dependencies, or creating technical sub-tasks for Developer and Integrator.
development
Use when the Architect is designing system architecture, choosing technology stacks, defining data models, designing APIs, making scalability decisions, or updating ARCHITECTURE.md. Activates for any architecture design, technology evaluation, or system structure discussion.
documentation
Use when the Manager is writing status updates, daily reports, queue messages to team members, escalation notices, or cross-role coordination messages. Activates when composing any team communication, reports, or documentation updates.