observability-planner/SKILL.md
Defines metrics, events, dashboards, alerts, and SLOs to monitor production systems. Use after Gate 2 or with release-manager to ensure production observability.
npx skillsauth add agile-v/agile_v_skills observability-plannerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You operate after Gate 2 (or parallel with release-manager). Goal: Production Intelligence.
Requirements are continuously validated in production. Every metric maps to REQ-XXXX. Incidents feed CR-XXXX for next cycle.
Position: Stage 5 (Acceptance) → RELEASE → OPERATE (You) Checkpoint Type: Auto (monitoring) + Human-Verify (thresholds) + Human-Action (incidents)
Rule: Every metric must cite REQ-XXXX. No REQ = debugging metric (not a requirement) OR missing requirement (return to requirement-architect).
# Observability Plan
## MET-XXXX: [Metric Name]
**Type:** Counter/Gauge/Histogram · **REQ:** REQ-XXXX · **Description:** [What measured]
**Unit:** req/s, ms, bytes, % · **Labels:** [endpoint, status, user_id] · **Source:** [app middleware, DB driver, business logic]
**Baseline:** [Normal range: p50=150ms, p95=300ms] · **Threshold:** [p95 >500ms for 5 min → Alert]
**Collection:** [Prometheus, CloudWatch, Datadog] · **Retention:** [90 days]
## Event Schema (Structured Logs)
{
"timestamp": "ISO8601", "level": "ERROR", "event": "checkout_failure",
"req_id": "REQ-XXXX", "user_id": "...", "trace_id": "...",
"error_code": "PAYMENT_TIMEOUT", "context": {...}
}
Common Metrics (examples):
Dashboard Categories:
Example Panel (Requirement Validation Dashboard):
### Panel: REQ-0015 (Dashboard Load ≤3s)
**Metric:** MET-0001 · **Query:** `histogram_quantile(0.95, rate(http_duration_bucket{endpoint="/dashboard"}[5m]))`
**Threshold:** ≤3s · **Viz:** Time series, 24h · **Status:** Green <3s, Red ≥3s
## ALR-XXXX: [Alert Name]
**Metric:** MET-XXXX · **REQ:** REQ-XXXX · **Condition:** [PromQL or equivalent]
**Threshold:** [When to fire] · **Duration:** [5 minutes sustained] · **Severity:** CRITICAL/HIGH/MEDIUM/LOW
**Notification:** [PagerDuty, Slack, Email] · **Runbook:** [/runbooks/alert-name.md]
Examples:
Alert Severity: | Severity | Impact | Response Time | Notification | |---|---|---|---| | CRITICAL | Service down, data loss, SLO violation | Immediate 24/7 | PagerDuty | | HIGH | Degraded perf, REQ violation, user-facing | <1h business hours | Slack + Email | | MEDIUM | Non-critical degradation, anomaly | <4h | Slack | | LOW | Informational, capacity planning | Next day | Email digest |
## SLO-XXXX: [Service Level Objective]
**REQ:** REQ-XXXX · **Metric:** MET-XXXX · **Objective:** [99.9% requests succeed over 28 days]
**Measurement Window:** [Rolling 28 days] · **Error Budget:** [0.1% error rate = ~40 min downtime/month]
**Calculation:** `1 - (sum(errors[28d]) / sum(total[28d]))`
**Budget Policy:**
- 50% consumed: Alert engineering (informational)
- 75% consumed: Pause non-critical features, focus reliability
- 100% consumed: Stop feature work, incident declared, root cause required
Examples:
## INC-XXXX: [Title]
**Severity:** CRITICAL/HIGH · **Detected:** [Date/Time] (ALR-XXXX) · **Resolved:** [Date/Time] · **Duration:** [15 min]
**Impact:** [Checkout unavailable, 500 users affected]
**Root Cause:** [N+1 query caused DB timeout]
**REQ Violation:** REQ-0018 (Query <100ms) · **Why Missed:** [No query count test in TC-XXXX]
**Resolution:** [Rollback to prev version; fixed N+1 in hotfix]
**Follow-Up:**
- CAPA-XXXX: Add query count test (prevent recurrence)
- CR-XXXX: Update REQ-0018: specify max query count per request
- RISK-XXXX: Update RISK_REGISTER (DB scaling risk)
Feed into CR-XXXX: If incident reveals REQ gap or ambiguity → create CR → requirement-architect → Gate 1 approval → next cycle
For each alert, provide runbook (stored in project /runbooks/):
# Runbook: High Error Rate (ALR-0001)
## Symptom: 5xx rate >1% for >5 min
## Impact: REQ-0020 violation, service degraded
## Triage: 1) Check dashboard · 2) Identify endpoints (topk query) · 3) Recent deploy? · 4) Upstream services? · 5) Check logs
## Mitigation: Rollback (if recent deploy) · Failover (if dependency down) · Scale DB (if overload)
## Resolution: Execute mitigation · Verify error rate <1% · Monitor 15 min · Notify stakeholders
## Post-Incident: Log INC-XXXX, CAPA-XXXX, CR-XXXX · Post-mortem 48h
Before rollout:
Release Manager includes in pre-release checklist: "Monitoring & Alerting configured (observability-planner sign-off)"
At any time, produce:
/runbooks/*.md (per alert)All stored in .agile-v/ for traceability.
development
The Verification Agent — challenges Build Agent artifacts via independent verification. Executes tests against artifacts. Use to audit code, schematics, or firmware against requirements.
development
# Skill: system-understanding-agent ## Purpose Use this skill when Agile V is applied to an existing codebase, documentation set, or knowledge base. The skill consumes Understand Anything outputs and creates a concise, reviewable system overview that gives agents sufficient context before modifying code. This is **Gate 0** of the integrated Agile V lifecycle. No requirements should be generated, and no code should be built, until this skill has run and the system overview has been reviewed.
development
# Skill: regression-selection-agent ## Purpose Select and prioritize regression tests based on the impact map and graph dependency relationships. This skill ensures that existing tests are identified, prioritized, and run after a change, and that gaps in test coverage are flagged before the Red Team step. --- ## Trigger conditions Use this skill when: - Existing behavior must not break (regression risk). - An impact map is available. - The change affects shared modules, services, or APIs.
development
# Skill: impact-analysis-agent ## Purpose Identify the likely impact of a proposed change before implementation. This skill maps the change request to graph nodes, identifies affected files, functions, APIs, and tests, and produces a reviewable impact map that gates the Build Agent's context. --- ## Trigger conditions Use this skill when: - A change request targets an existing system. - The change could affect multiple files or modules. - Regression risk exists (the change touches shared c