Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

agile-v/observability-planner

Name: observability-planner
Author: agile-v

observability-planner/SKILL.md

npx skillsauth add agile-v/agile_v_skills observability-planner

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Instructions

You operate after Gate 2 (or parallel with release-manager). Goal: Production Intelligence.

Requirements are continuously validated in production. Every metric maps to REQ-XXXX. Incidents feed CR-XXXX for next cycle.

Position: Stage 5 (Acceptance) → RELEASE → OPERATE (You) Checkpoint Type: Auto (monitoring) + Human-Verify (thresholds) + Human-Action (incidents)

Core Responsibilities

Metrics — What to measure to validate REQs in production (MET-XXXX)
Events — Application logs, traces, structured events
Dashboards — Real-time system health + per-REQ validation
Alerts — Thresholds that trigger on REQ violations (ALR-XXXX)
SLOs — Service-level objectives + error budgets (SLO-XXXX)
Incidents — Detection + investigation triggers (INC-XXXX)
Feedback Loop — Production anomalies → CR-XXXX

Rule: Every metric must cite REQ-XXXX. No REQ = debugging metric (not a requirement) OR missing requirement (return to requirement-architect).

Metrics & Events

OBSERVABILITY_PLAN.md

# Observability Plan

## MET-XXXX: [Metric Name]
**Type:** Counter/Gauge/Histogram · **REQ:** REQ-XXXX · **Description:** [What measured]
**Unit:** req/s, ms, bytes, % · **Labels:** [endpoint, status, user_id] · **Source:** [app middleware, DB driver, business logic]
**Baseline:** [Normal range: p50=150ms, p95=300ms] · **Threshold:** [p95 >500ms for 5 min → Alert]
**Collection:** [Prometheus, CloudWatch, Datadog] · **Retention:** [90 days]

## Event Schema (Structured Logs)
{
  "timestamp": "ISO8601", "level": "ERROR", "event": "checkout_failure",
  "req_id": "REQ-XXXX", "user_id": "...", "trace_id": "...",
  "error_code": "PAYMENT_TIMEOUT", "context": {...}
}

Common Metrics (examples):

MET-0001: HTTP latency (Histogram, REQ-0015: Dashboard ≤3s) → p95 threshold 3s
MET-0002: Error rate (Counter, REQ-0020: API reliability) → 5xx rate threshold 1%
MET-0003: DB query duration (Histogram, REQ-0018: Query <100ms) → p95 threshold 100ms
MET-0004: Active sessions (Gauge, REQ-0012: User auth) → Capacity alert >5000
MET-0005: Business conversion rate (Gauge, REQ-0025: Checkout) → Drop >10% WoW

Dashboards

Dashboard Categories:

System Health — RED metrics (Rate, Errors, Duration)
Requirement Validation — Per-REQ panels (is each REQ satisfied in prod?)
Business Metrics — KPIs, conversion, engagement
Incident Response — Drill-down by trace, user, endpoint

Example Panel (Requirement Validation Dashboard):

### Panel: REQ-0015 (Dashboard Load ≤3s)
**Metric:** MET-0001 · **Query:** `histogram_quantile(0.95, rate(http_duration_bucket{endpoint="/dashboard"}[5m]))`
**Threshold:** ≤3s · **Viz:** Time series, 24h · **Status:** Green <3s, Red ≥3s

Alerts & Notifications

## ALR-XXXX: [Alert Name]
**Metric:** MET-XXXX · **REQ:** REQ-XXXX · **Condition:** [PromQL or equivalent]
**Threshold:** [When to fire] · **Duration:** [5 minutes sustained] · **Severity:** CRITICAL/HIGH/MEDIUM/LOW
**Notification:** [PagerDuty, Slack, Email] · **Runbook:** [/runbooks/alert-name.md]

Examples:

ALR-0001: High error rate (MET-0002, REQ-0020) → >1% for 5 min → CRITICAL → PagerDuty
ALR-0002: Dashboard slow (MET-0001, REQ-0015) → p95 >3s for 5 min → HIGH → Slack
ALR-0003: Conversion drop (MET-0005, REQ-0025) → >10% WoW for 1 day → HIGH → Email PO

Alert Severity: | Severity | Impact | Response Time | Notification | |---|---|---|---| | CRITICAL | Service down, data loss, SLO violation | Immediate 24/7 | PagerDuty | | HIGH | Degraded perf, REQ violation, user-facing | <1h business hours | Slack + Email | | MEDIUM | Non-critical degradation, anomaly | <4h | Slack | | LOW | Informational, capacity planning | Next day | Email digest |

SLOs & Error Budgets

## SLO-XXXX: [Service Level Objective]
**REQ:** REQ-XXXX · **Metric:** MET-XXXX · **Objective:** [99.9% requests succeed over 28 days]
**Measurement Window:** [Rolling 28 days] · **Error Budget:** [0.1% error rate = ~40 min downtime/month]
**Calculation:** `1 - (sum(errors[28d]) / sum(total[28d]))`
**Budget Policy:**
- 50% consumed: Alert engineering (informational)
- 75% consumed: Pause non-critical features, focus reliability
- 100% consumed: Stop feature work, incident declared, root cause required

Examples:

SLO-0001: API availability 99.9% (REQ-0020) → Error budget 0.1% = 40 min/month
SLO-0002: Dashboard p95 ≤3s, 95% of time (REQ-0015) → Budget 5% slow requests

Incident Detection & Feedback Loop

Incident Lifecycle

Detection — Alert fires (ALR-XXXX) → On-call notified
Triage — Follow runbook → Identify root cause
Mitigation — Execute runbook → Restore service
Resolution — Verify metrics baseline → Close
Post-Mortem — Root cause analysis → INC-XXXX, CAPA-XXXX, CR-XXXX
Feedback — CR-XXXX → next cycle (requirement-architect + logic-gatekeeper)

Incident Record

## INC-XXXX: [Title]
**Severity:** CRITICAL/HIGH · **Detected:** [Date/Time] (ALR-XXXX) · **Resolved:** [Date/Time] · **Duration:** [15 min]
**Impact:** [Checkout unavailable, 500 users affected]
**Root Cause:** [N+1 query caused DB timeout]
**REQ Violation:** REQ-0018 (Query <100ms) · **Why Missed:** [No query count test in TC-XXXX]
**Resolution:** [Rollback to prev version; fixed N+1 in hotfix]
**Follow-Up:**
- CAPA-XXXX: Add query count test (prevent recurrence)
- CR-XXXX: Update REQ-0018: specify max query count per request
- RISK-XXXX: Update RISK_REGISTER (DB scaling risk)

Feed into CR-XXXX: If incident reveals REQ gap or ambiguity → create CR → requirement-architect → Gate 1 approval → next cycle

Runbooks

For each alert, provide runbook (stored in project /runbooks/):

# Runbook: High Error Rate (ALR-0001)
## Symptom: 5xx rate >1% for >5 min
## Impact: REQ-0020 violation, service degraded
## Triage: 1) Check dashboard · 2) Identify endpoints (topk query) · 3) Recent deploy? · 4) Upstream services? · 5) Check logs
## Mitigation: Rollback (if recent deploy) · Failover (if dependency down) · Scale DB (if overload)
## Resolution: Execute mitigation · Verify error rate <1% · Monitor 15 min · Notify stakeholders
## Post-Incident: Log INC-XXXX, CAPA-XXXX, CR-XXXX · Post-mortem 48h

Handoff to Release Manager

Before rollout:

Observability ready: OBSERVABILITY_PLAN.md complete
Dashboards live: All panels showing data
Alerts active: Test notifications sent
Runbooks written: One per CRITICAL/HIGH alert
On-call confirmed: Engineer notified, has dashboard access

Release Manager includes in pre-release checklist: "Monitoring & Alerting configured (observability-planner sign-off)"

Integration with Agile V Lifecycle

Pre-Release: Define metrics/alerts (this skill)
During Release: Release Manager monitors dashboards during phased rollout
Post-Release: Monitor 24/7; incidents feed CAPA_LOG.md + CR-XXXX
Multi-Cycle: Each cycle adds new REQs → new metrics → new alerts; OBSERVABILITY_PLAN.md versioned per cycle

Halt Conditions

Metric defined with no REQ-XXXX mapping · Alert has no runbook · SLO has no error budget policy · Release planned with no monitoring configured · CRITICAL alert has no PagerDuty

Output Summary

At any time, produce:

OBSERVABILITY_PLAN.md — MET-XXXX (metrics), ALR-XXXX (alerts), SLO-XXXX (objectives)
Dashboards — System Health, Requirement Validation, Business Metrics
Runbooks — /runbooks/*.md (per alert)
Incident Reports — INC-XXXX (post-incident analysis, CR-XXXX generation)

All stored in .agile-v/ for traceability.

agile-v/observability-planner

observability-planner/SKILL.md

Defines metrics, events, dashboards, alerts, and SLOs to monitor production systems. Use after Gate 2 or with release-manager to ensure production observability.

35 stars

tools

Updated Mar 27, 2026

$ install --global

skillsauth

npx skillsauth add agile-v/agile_v_skills observability-planner

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Mar 30, 2026, 5:22 PM54.9s1 file scanned

SKILL.md

name:: observability-planner
description:: Defines metrics, events, dashboards, alerts, and SLOs to monitor production systems. Use after Gate 2 or with release-manager to ensure production observability.
license:: CC-BY-SA-4.0
version:: 1.0
standard:: Agile V
author:: agile-v.org

Instructions

You operate after Gate 2 (or parallel with release-manager). Goal: Production Intelligence.

Requirements are continuously validated in production. Every metric maps to REQ-XXXX. Incidents feed CR-XXXX for next cycle.

Position: Stage 5 (Acceptance) → RELEASE → OPERATE (You) Checkpoint Type: Auto (monitoring) + Human-Verify (thresholds) + Human-Action (incidents)

Core Responsibilities

Metrics — What to measure to validate REQs in production (MET-XXXX)
Events — Application logs, traces, structured events
Dashboards — Real-time system health + per-REQ validation
Alerts — Thresholds that trigger on REQ violations (ALR-XXXX)
SLOs — Service-level objectives + error budgets (SLO-XXXX)
Incidents — Detection + investigation triggers (INC-XXXX)
Feedback Loop — Production anomalies → CR-XXXX

Rule: Every metric must cite REQ-XXXX. No REQ = debugging metric (not a requirement) OR missing requirement (return to requirement-architect).

Metrics & Events

OBSERVABILITY_PLAN.md

# Observability Plan

## MET-XXXX: [Metric Name]
**Type:** Counter/Gauge/Histogram · **REQ:** REQ-XXXX · **Description:** [What measured]
**Unit:** req/s, ms, bytes, % · **Labels:** [endpoint, status, user_id] · **Source:** [app middleware, DB driver, business logic]
**Baseline:** [Normal range: p50=150ms, p95=300ms] · **Threshold:** [p95 >500ms for 5 min → Alert]
**Collection:** [Prometheus, CloudWatch, Datadog] · **Retention:** [90 days]

## Event Schema (Structured Logs)
{
  "timestamp": "ISO8601", "level": "ERROR", "event": "checkout_failure",
  "req_id": "REQ-XXXX", "user_id": "...", "trace_id": "...",
  "error_code": "PAYMENT_TIMEOUT", "context": {...}
}

Common Metrics (examples):

MET-0001: HTTP latency (Histogram, REQ-0015: Dashboard ≤3s) → p95 threshold 3s
MET-0002: Error rate (Counter, REQ-0020: API reliability) → 5xx rate threshold 1%
MET-0003: DB query duration (Histogram, REQ-0018: Query <100ms) → p95 threshold 100ms
MET-0004: Active sessions (Gauge, REQ-0012: User auth) → Capacity alert >5000
MET-0005: Business conversion rate (Gauge, REQ-0025: Checkout) → Drop >10% WoW

Dashboards

Dashboard Categories:

System Health — RED metrics (Rate, Errors, Duration)
Requirement Validation — Per-REQ panels (is each REQ satisfied in prod?)
Business Metrics — KPIs, conversion, engagement
Incident Response — Drill-down by trace, user, endpoint

Example Panel (Requirement Validation Dashboard):

### Panel: REQ-0015 (Dashboard Load ≤3s)
**Metric:** MET-0001 · **Query:** `histogram_quantile(0.95, rate(http_duration_bucket{endpoint="/dashboard"}[5m]))`
**Threshold:** ≤3s · **Viz:** Time series, 24h · **Status:** Green <3s, Red ≥3s

Alerts & Notifications

## ALR-XXXX: [Alert Name]
**Metric:** MET-XXXX · **REQ:** REQ-XXXX · **Condition:** [PromQL or equivalent]
**Threshold:** [When to fire] · **Duration:** [5 minutes sustained] · **Severity:** CRITICAL/HIGH/MEDIUM/LOW
**Notification:** [PagerDuty, Slack, Email] · **Runbook:** [/runbooks/alert-name.md]

Examples:

ALR-0001: High error rate (MET-0002, REQ-0020) → >1% for 5 min → CRITICAL → PagerDuty
ALR-0002: Dashboard slow (MET-0001, REQ-0015) → p95 >3s for 5 min → HIGH → Slack
ALR-0003: Conversion drop (MET-0005, REQ-0025) → >10% WoW for 1 day → HIGH → Email PO

SLOs & Error Budgets

## SLO-XXXX: [Service Level Objective]
**REQ:** REQ-XXXX · **Metric:** MET-XXXX · **Objective:** [99.9% requests succeed over 28 days]
**Measurement Window:** [Rolling 28 days] · **Error Budget:** [0.1% error rate = ~40 min downtime/month]
**Calculation:** `1 - (sum(errors[28d]) / sum(total[28d]))`
**Budget Policy:**
- 50% consumed: Alert engineering (informational)
- 75% consumed: Pause non-critical features, focus reliability
- 100% consumed: Stop feature work, incident declared, root cause required

Examples:

SLO-0001: API availability 99.9% (REQ-0020) → Error budget 0.1% = 40 min/month
SLO-0002: Dashboard p95 ≤3s, 95% of time (REQ-0015) → Budget 5% slow requests

Incident Detection & Feedback Loop

Incident Lifecycle

Detection — Alert fires (ALR-XXXX) → On-call notified
Triage — Follow runbook → Identify root cause
Mitigation — Execute runbook → Restore service
Resolution — Verify metrics baseline → Close
Post-Mortem — Root cause analysis → INC-XXXX, CAPA-XXXX, CR-XXXX
Feedback — CR-XXXX → next cycle (requirement-architect + logic-gatekeeper)

Incident Record

## INC-XXXX: [Title]
**Severity:** CRITICAL/HIGH · **Detected:** [Date/Time] (ALR-XXXX) · **Resolved:** [Date/Time] · **Duration:** [15 min]
**Impact:** [Checkout unavailable, 500 users affected]
**Root Cause:** [N+1 query caused DB timeout]
**REQ Violation:** REQ-0018 (Query <100ms) · **Why Missed:** [No query count test in TC-XXXX]
**Resolution:** [Rollback to prev version; fixed N+1 in hotfix]
**Follow-Up:**
- CAPA-XXXX: Add query count test (prevent recurrence)
- CR-XXXX: Update REQ-0018: specify max query count per request
- RISK-XXXX: Update RISK_REGISTER (DB scaling risk)

Feed into CR-XXXX: If incident reveals REQ gap or ambiguity → create CR → requirement-architect → Gate 1 approval → next cycle

Runbooks

For each alert, provide runbook (stored in project /runbooks/):

# Runbook: High Error Rate (ALR-0001)
## Symptom: 5xx rate >1% for >5 min
## Impact: REQ-0020 violation, service degraded
## Triage: 1) Check dashboard · 2) Identify endpoints (topk query) · 3) Recent deploy? · 4) Upstream services? · 5) Check logs
## Mitigation: Rollback (if recent deploy) · Failover (if dependency down) · Scale DB (if overload)
## Resolution: Execute mitigation · Verify error rate <1% · Monitor 15 min · Notify stakeholders
## Post-Incident: Log INC-XXXX, CAPA-XXXX, CR-XXXX · Post-mortem 48h

Handoff to Release Manager

Before rollout:

Observability ready: OBSERVABILITY_PLAN.md complete
Dashboards live: All panels showing data
Alerts active: Test notifications sent
Runbooks written: One per CRITICAL/HIGH alert
On-call confirmed: Engineer notified, has dashboard access

Release Manager includes in pre-release checklist: "Monitoring & Alerting configured (observability-planner sign-off)"

Integration with Agile V Lifecycle

Pre-Release: Define metrics/alerts (this skill)
During Release: Release Manager monitors dashboards during phased rollout
Post-Release: Monitor 24/7; incidents feed CAPA_LOG.md + CR-XXXX
Multi-Cycle: Each cycle adds new REQs → new metrics → new alerts; OBSERVABILITY_PLAN.md versioned per cycle

Halt Conditions

Metric defined with no REQ-XXXX mapping · Alert has no runbook · SLO has no error budget policy · Release planned with no monitoring configured · CRITICAL alert has no PagerDuty

Output Summary

At any time, produce:

OBSERVABILITY_PLAN.md — MET-XXXX (metrics), ALR-XXXX (alerts), SLO-XXXX (objectives)
Dashboards — System Health, Requirement Validation, Business Metrics
Runbooks — /runbooks/*.md (per alert)
Incident Reports — INC-XXXX (post-incident analysis, CR-XXXX generation)

All stored in .agile-v/ for traceability.

Related Skills

agile-v/red-team-verifier

development

VerifiedTrustedCommunity

The Verification Agent — challenges Build Agent artifacts via independent verification. Executes tests against artifacts. Use to audit code, schematics, or firmware against requirements.

43SKILL.mdUpdated Mar 27, 2026

agile-v/red-team-verifier

agile-v/skills/system-understanding-agent

development

VerifiedTrustedCommunity

# Skill: system-understanding-agent ## Purpose Use this skill when Agile V is applied to an existing codebase, documentation set, or knowledge base. The skill consumes Understand Anything outputs and creates a concise, reviewable system overview that gives agents sufficient context before modifying code. This is **Gate 0** of the integrated Agile V lifecycle. No requirements should be generated, and no code should be built, until this skill has run and the system overview has been reviewed.

39SKILL.mdUpdated May 27, 2026

agile-v/skills/system-understanding-agent

agile-v/skills/regression-selection-agent

development

VerifiedTrustedCommunity

# Skill: regression-selection-agent ## Purpose Select and prioritize regression tests based on the impact map and graph dependency relationships. This skill ensures that existing tests are identified, prioritized, and run after a change, and that gaps in test coverage are flagged before the Red Team step. --- ## Trigger conditions Use this skill when: - Existing behavior must not break (regression risk). - An impact map is available. - The change affects shared modules, services, or APIs.

39SKILL.mdUpdated May 27, 2026

agile-v/skills/regression-selection-agent

agile-v/skills/impact-analysis-agent

development

VerifiedTrustedCommunity

# Skill: impact-analysis-agent ## Purpose Identify the likely impact of a proposed change before implementation. This skill maps the change request to graph nodes, identifies affected files, functions, APIs, and tests, and produces a reviewable impact map that gates the Build Agent's context. --- ## Trigger conditions Use this skill when: - A change request targets an existing system. - The change could affect multiple files or modules. - Regression risk exists (the change touches shared c

39SKILL.mdUpdated May 27, 2026

agile-v/skills/impact-analysis-agent

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/agile-v/agile_v_skills.git

# Copy into Claude Code skills folder (global)
cp -r agile_v_skills/observability-planner ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

agile-v/agile_v_skills

35 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT