.claude/skills/prd-v08-monitoring-setup/SKILL.md
Define monitoring strategy, metrics collection, and alerting thresholds during PRD v0.8 Deployment & Ops. Triggers on requests to set up monitoring, define alerts, or when user asks "what should we monitor?", "alerting strategy", "observability", "metrics", "SLOs", "dashboards", "monitoring setup". Outputs MON- entries with monitoring rules and alert configurations.
npx skillsauth add mattgierhart/PRD-driven-context-engineering prd-v08-monitoring-setupInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Position in workflow: v0.8 Runbook Creation → v0.8 Monitoring Setup → v0.9 GTM Strategy
This skill requires prior work from v0.8 Runbook Creation and earlier stages:
This skill assumes DEP- and RUN- entries are complete with thresholds, rollback conditions, and incident procedures defined.
This skill creates/updates:
All MON- entries are operational monitoring specifications, not confidence-based. They are:
Example MON- entries:
MON-001: API Request Latency (p95)
Type: Metric
Layer: Application
Owner: Backend Team
Name: api.request.latency.p95
Description: 95th percentile response time for all API endpoints (from API-001–020)
Unit: ms
Source: Application APM (Datadog custom instrumentation)
Aggregation: p95 over 5-minute window
Retention: 90 days
Linked IDs: API-001 to API-020, DEP-004 (baseline from staging)
---
MON-002: High Latency Alert (Warning)
Type: Alert
Layer: Application
Owner: Backend Team
Metric: MON-001 (api.request.latency.p95)
Condition: >500ms (from DEP-002 baseline)
Window: 5 minutes
Severity: Warning
Runbook: RUN-001 (Performance Degradation Investigation)
Notification:
- Channel: Slack #backend-alerts
- Recipients: Backend on-call, team notified during business hours
Silencing: During scheduled maintenance windows (DEP-004 notifications)
Linked IDs: MON-001, RUN-001, DEP-002
---
MON-003: Critical Latency Alert
Type: Alert
Layer: Application
Owner: Backend Team
Metric: MON-001 (api.request.latency.p95)
Condition: >2000ms (SLA breach, from KPI-001 target)
Window: 2 minutes
Severity: Critical
Runbook: RUN-001 (Performance Degradation Investigation)
Notification:
- Channel: PagerDuty (wake on-call)
- Recipients: Backend on-call, Tech Lead, escalate if not acknowledged in 5 min
Silencing: None (critical alerts never silenced)
Linked IDs: MON-001, RUN-001, KPI-001
---
MON-004: API Availability SLO
Type: SLO
Layer: Application
Owner: Platform Team
Objective: API endpoints return non-5xx response
Target: 99.9% uptime (from DEP-002 / KPI-001)
Window: Rolling 30 days
Error Budget: 43.2 minutes/month
Alerting:
- 50% error budget consumed → Warning to engineering (slow-burn alert)
- 75% error budget consumed → Critical, freeze non-essential deploys
- 100% error budget consumed → Post-incident review required (RUN-008 procedure)
Linked IDs: API-001–020, DEP-003 (rollback triggers), RUN-008 (incident review)
---
MON-005: System Health Dashboard
Type: Dashboard
Layer: Infrastructure + Application
Owner: Platform Team
Purpose: Quick health check for on-call engineers (run from RUN-002, RUN-001)
Audience: On-call engineers, engineering leadership, ops team
Panels:
- API Request Rate (last 1h): Should be steady or increasing
- API Latency (p50, p95, p99): Watch for p95/p99 creeping up
- Error Rate by Endpoint: Any 5xx > 0 is concerning
- Active Critical Alerts: Should be none
- Database Connection Pool (from MON-006): Trending toward threshold
- CPU/Memory by Service: Identify resource exhaustion
- Deployment Status: Current version, time of last deploy
Refresh: 30 seconds
Linked IDs: MON-001, MON-002, MON-003, MON-006, DEP-001, RUN-001/002
---
MON-006: Database Connection Pool Utilization
Type: Metric
Layer: Infrastructure
Owner: Database Team
Name: db.connection_pool.utilized_percent
Description: Percentage of available connections in use (from DEP-001 pool size)
Unit: percentage
Source: Database monitoring (RDS Enhanced Monitoring or custom query)
Aggregation: avg over 1-minute window
Retention: 30 days
Linked IDs: DEP-001 (pool config), RUN-001 (incident when >90%)
Monitoring is not about collecting data—it is about detecting problems before users do. Every metric should answer: "Is this working? If not, what's broken?"
| Layer | What to Measure | Why It Matters | |-------|-----------------|----------------| | Infrastructure | CPU, memory, disk, network | System health foundation | | Application | Latency, errors, throughput | User-facing performance | | Business | Signups, conversions, revenue | Product health | | User Experience | Page load, interaction time | Real user impact |
Define SLOs (Service Level Objectives)
Identify key metrics per layer
Set alert thresholds
Map alerts to runbooks
Design dashboards
Create MON- entries with full traceability
MON-XXX: [Monitoring Rule Title]
Type: [Metric | Alert | Dashboard | SLO]
Layer: [Infrastructure | Application | Business | User Experience]
Owner: [Team responsible for this metric/alert]
For Metric Type:
Name: [metric.name.format]
Description: [What this measures]
Unit: [count | ms | percentage | bytes]
Source: [Where this comes from]
Aggregation: [avg | sum | p50 | p95 | p99]
Retention: [How long to keep data]
For Alert Type:
Metric: [MON-YYY or metric name]
Condition: [Threshold expression]
Window: [Time window for evaluation]
Severity: [Critical | Warning | Info]
Runbook: [RUN-XXX to follow when fired]
Notification:
- Channel: [Slack, PagerDuty, Email]
- Recipients: [Team or individuals]
Silencing: [When to suppress, e.g., maintenance windows]
For Dashboard Type:
Purpose: [What questions this answers]
Audience: [Who uses this dashboard]
Panels: [List of visualizations]
Refresh: [How often to update]
For SLO Type:
Objective: [What we promise]
Target: [Percentage, e.g., 99.9%]
Window: [Rolling 30 days]
Error Budget: [How much downtime allowed]
Alerting: [When error budget is at risk]
Linked IDs: [API-XXX, UJ-XXX, KPI-XXX, RUN-XXX related]
Example MON- entries:
MON-001: API Request Latency (p95)
Type: Metric
Layer: Application
Owner: Backend Team
Name: api.request.latency.p95
Description: 95th percentile response time for all API endpoints
Unit: ms
Source: Application APM (Datadog/New Relic)
Aggregation: p95
Retention: 90 days
Linked IDs: API-001 to API-020
MON-002: High Latency Alert
Type: Alert
Layer: Application
Owner: Backend Team
Metric: MON-001 (api.request.latency.p95)
Condition: > 500ms
Window: 5 minutes
Severity: Warning
Runbook: RUN-006 (Performance Degradation Investigation)
Notification:
- Channel: Slack #backend-alerts
- Recipients: Backend on-call
Silencing: During scheduled deployments (DEP-002 windows)
Linked IDs: MON-001, RUN-006, DEP-002
MON-003: Critical Latency Alert
Type: Alert
Layer: Application
Owner: Backend Team
Metric: MON-001 (api.request.latency.p95)
Condition: > 2000ms
Window: 2 minutes
Severity: Critical
Runbook: RUN-006 (Performance Degradation Investigation)
Notification:
- Channel: PagerDuty
- Recipients: Backend on-call, Tech Lead
Silencing: None (always alert on critical)
Linked IDs: MON-001, RUN-006
MON-004: API Availability SLO
Type: SLO
Layer: Application
Owner: Platform Team
Objective: API endpoints return non-5xx response
Target: 99.9%
Window: Rolling 30 days
Error Budget: 43.2 minutes/month
Alerting:
- 50% budget consumed → Warning to engineering
- 75% budget consumed → Critical, freeze non-essential deploys
- 100% budget consumed → Incident review required
Linked IDs: API-001 to API-020, DEP-003
MON-005: System Health Dashboard
Type: Dashboard
Layer: Infrastructure + Application
Owner: Platform Team
Purpose: Quick health check for on-call engineers
Audience: On-call, engineering leadership
Panels:
- API Request Rate (last 1h)
- API Latency (p50, p95, p99)
- Error Rate by Endpoint
- Active Alerts
- Database Connection Pool
- CPU/Memory by Service
Refresh: 30 seconds
Linked IDs: MON-001, MON-002, MON-003
For each service, measure:
| Metric | What It Measures | Alert Threshold | |--------|------------------|-----------------| | Rate | Requests per second | Anomaly detection | | Errors | Failed requests / total | >1% warning, >5% critical | | Duration | Request latency (p95, p99) | >500ms warning, >2s critical |
For each resource (CPU, memory, disk, network):
| Metric | What It Measures | Alert Threshold | |--------|------------------|-----------------| | Utilization | % of capacity used | >80% warning, >95% critical | | Saturation | Queue depth, waiting | >0 for critical resources | | Errors | Error count/rate | Any errors = investigate |
| Tier | Availability | Latency (p95) | Use For | |------|--------------|---------------|---------| | Tier 1 | 99.99% (52 min/yr) | <100ms | Payment, auth | | Tier 2 | 99.9% (8.7 hr/yr) | <500ms | Core features | | Tier 3 | 99% (3.6 days/yr) | <2s | Background jobs |
| Severity | User Impact | Response Time | Notification | |----------|-------------|---------------|--------------| | Critical | Service unusable | <5 min | PagerDuty (wake up) | | Warning | Degraded experience | <30 min | Slack (business hours) | | Info | No immediate impact | Next day | Dashboard/log |
| Principle | Implementation | |-----------|----------------| | Answer questions | Each panel answers "Is X working?" | | Hierarchy | Overview → Service → Component | | Context | Show thresholds, comparisons | | Actionable | Link to runbooks from alerts | | Fast | Quick load, auto-refresh |
| Pattern | Signal | Fix | |---------|--------|-----| | Alert fatigue | Too many alerts, team ignores | Tune thresholds, remove noise | | No runbook link | Alert fires, no one knows what to do | Every alert → RUN- | | Vanity metrics | "1 million requests!" without context | Focus on user-impacting metrics | | Missing baselines | No historical comparison | Establish baselines before launch | | Over-monitoring | 500 metrics, can't find signal | Focus on RED/USE fundamentals | | Under-monitoring | "We'll add monitoring later" | Monitoring ships with code |
Before proceeding to v0.9 GTM Strategy:
| Consumer | What It Uses | Example | |----------|--------------|---------| | On-Call Team | MON- alerts trigger response | MON-003 → page engineer | | v0.9 Launch Metrics | MON- provides baseline data | MON-001 baseline → KPI-010 target | | Post-Mortems | MON- data for incident analysis | "MON-005 showed spike at 14:32" | | Capacity Planning | MON- trends inform scaling | USE metrics → infrastructure planning | | DEP- Rollback | MON- thresholds trigger rollback | MON-002 breach → DEP-003 rollback |
references/monitoring-stack.mdassets/mon-template.mdreferences/slo-guide.mdreferences/dashboard-guide.mdtools
Make technology decisions for every product capability by discovering existing assets, evaluating vendor-aligned options, and categorizing as Reuse/Extend/Build/Buy/Integrate/Research during PRD v0.5 Red Team Review. Handles both greenfield and brownfield contexts. Triggers on "tech stack", "build or buy?", "what technologies?", "technical decisions", "what do we reuse?", "existing stack", "vendor constraint", "IBM-first", "what tools do we need?", "evaluate solutions", "select tech stack". Consumes FEA- (features), SCR- (screens), RISK- (constraints). Outputs TECH- entries with decisions, rationale, and cross-references. Feeds v0.6 Architecture Design.
development
Define success criteria and tracking setup for launch during PRD v0.9 Go-to-Market. Triggers on requests to define launch metrics, set up tracking, or when user asks "how do we measure launch success?", "launch KPIs", "tracking setup", "success criteria", "analytics", "launch goals". Outputs KPI- entries specialized for launch measurement.
development
Define go-to-market strategy including launch plan, messaging, channels, and timing during PRD v0.9 Go-to-Market. Triggers on requests to plan launch, define GTM strategy, or when user asks "how do we launch?", "go-to-market", "launch plan", "marketing strategy", "messaging", "launch channels", "GTM". Outputs GTM- entries with launch plan components.
development
Establish channels and processes for capturing and processing post-launch feedback during PRD v0.9 Go-to-Market. Triggers on requests to set up feedback systems, capture user input, or when user asks "how do we collect feedback?", "feedback loop", "user research", "post-launch feedback", "customer feedback", "NPS", "voice of customer". Outputs CFD- entries specialized for post-launch feedback capture.