skills/grafana-observability/SKILL.md
# Skill: Grafana Observability > Connects to Grafana to pull real traffic baselines, existing dashboards, and alert rules for services in the ADLC pipeline. Validates PRD traffic estimates against production reality. Provisions dashboards and alert rules from Build Brief throughput SLOs. Long-term, feeds live metrics into failure detection, capacity planning, and anomaly alerting across all services. --- ## Why This Exists Without real observability data, the ADLC pipeline operates on estima
npx skillsauth add bigeasyfreeman/adlc skills/grafana-observabilityInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Connects to Grafana to pull real traffic baselines, existing dashboards, and alert rules for services in the ADLC pipeline. Validates PRD traffic estimates against production reality. Provisions dashboards and alert rules from Build Brief throughput SLOs. Long-term, feeds live metrics into failure detection, capacity planning, and anomaly alerting across all services.
Without real observability data, the ADLC pipeline operates on estimates:
This skill bridges the gap between "what we think traffic looks like" and "what Grafana actually shows." It runs at two points in the pipeline:
Long-term, this skill enables:
| When | What | Mode |
|------|------|------|
| ADLC Planning Phase (after Codebase Research) | Pull traffic baselines and existing observability state for affected services | baseline |
| ADLC Phase 6 (SLOs) | Validate PRD throughput estimates against Grafana reality | validate |
| ADLC Deploy Phase (post-pipeline, parallel with Runbook) | Provision dashboards and alert rules from Build Brief throughput SLOs | provision |
| On-demand (periodic) | Refresh traffic baselines for a service to update anomaly detection thresholds | refresh |
| On-demand (investigation) | Pull current metrics for a service during incident response or capacity review | query |
{
"grafana_url": "string (Grafana instance URL — e.g., https://grafana.company.com)",
"grafana_api_token": "string (via MCP server auth — Service Account token with Viewer or Editor role)",
"mode": "baseline | validate | provision | refresh | query",
"services": [
{
"name": "string (service name as it appears in Grafana — e.g., 'api-server', 'widget-service')",
"namespace": "string (optional — Kubernetes namespace or environment label)",
"datasource": "string (optional — specific Grafana datasource to query, e.g., 'Prometheus', 'CloudWatch')"
}
],
"prd_traffic_estimates": {
"launch_rps": "number (optional — from PRD Traffic & Load Expectations)",
"steady_state_rps": "number (optional)",
"peak_rps": "number (optional)",
"polling_interval_seconds": "number (optional)",
"traffic_pattern": "string (optional — steady | bursty | time-of-day | event-driven)"
},
"build_brief_slos": {
"throughput_slos": "array (optional — from Build Brief Section 6, for provision mode)",
"failure_modes": "array (optional — traffic-based failure modes from Section 4)"
},
"time_range": {
"from": "string (ISO 8601 — default: 7 days ago)",
"to": "string (ISO 8601 — default: now)"
},
"output_config": {
"dashboard_folder": "string (optional — Grafana folder for provisioned dashboards)",
"alert_notification_channel": "string (optional — Slack channel or PagerDuty service for alerts)",
"alert_format": "grafana_native | prometheus_rules | both"
}
}
baseline{
"service_baselines": [
{
"service": "string",
"time_range": {"from": "ISO date", "to": "ISO date"},
"traffic_profile": {
"avg_rps": "number",
"p50_rps": "number",
"p95_rps": "number",
"p99_rps": "number",
"peak_rps": "number",
"peak_time": "ISO date",
"min_rps": "number",
"min_time": "ISO date",
"daily_pattern": {
"description": "string (e.g., 'traffic peaks 9am-5pm EST, drops 80% overnight')",
"hourly_averages": [
{"hour": 0, "avg_rps": "number"},
{"hour": 1, "avg_rps": "number"}
]
},
"weekly_pattern": {
"description": "string (e.g., 'weekday traffic 3x weekend')",
"daily_averages": [
{"day": "monday", "avg_rps": "number"},
{"day": "tuesday", "avg_rps": "number"}
]
},
"traffic_classification": "steady | bursty | time-of-day | event-driven"
},
"latency_profile": {
"p50_ms": "number",
"p95_ms": "number",
"p99_ms": "number",
"max_ms": "number"
},
"error_profile": {
"error_rate_percent": "number",
"dominant_error_codes": [
{"code": "number (e.g., 500, 429, 503)", "percentage": "number"}
]
},
"existing_dashboards": [
{
"uid": "string",
"title": "string",
"url": "string",
"panels_relevant_to_feature": ["string (panel titles)"]
}
],
"existing_alerts": [
{
"uid": "string",
"title": "string",
"condition": "string",
"state": "ok | alerting | pending | no_data",
"notification_channels": ["string"]
}
]
}
],
"generated_at": "ISO date",
"grafana_instance": "string"
}
validate{
"validation_results": [
{
"service": "string",
"prd_estimate": {
"field": "string (e.g., 'steady_state_rps')",
"value": "number",
"basis": "string (from PRD)"
},
"grafana_actual": {
"metric": "string",
"value": "number",
"time_range": "string"
},
"verdict": "aligned | optimistic | pessimistic | no_data",
"deviation_percent": "number",
"recommendation": "string (e.g., 'PRD estimates 200 RPS but Grafana shows current baseline at 450 RPS — adjust upward')"
}
],
"overall_assessment": "string",
"missing_baselines": [
{
"service": "string",
"reason": "no_dashboard | no_metrics | service_not_found",
"recommendation": "string"
}
]
}
provision{
"provisioned_dashboards": [
{
"uid": "string",
"title": "string",
"url": "string",
"folder": "string",
"panels": [
{
"title": "string",
"type": "graph | stat | gauge | alert-list | table",
"metric": "string (PromQL / CloudWatch / etc.)",
"description": "string"
}
]
}
],
"provisioned_alerts": [
{
"uid": "string",
"title": "string",
"condition": "string",
"severity": "P0 | P1 | P2 | P3",
"notification_channels": ["string"],
"runbook_link": "string (Confluence runbook page URL)",
"dashboard_link": "string (panel URL for context)"
}
],
"summary": "string"
}
refreshSame as baseline output but includes a changes_since_last section:
{
"changes_since_last": {
"last_baseline_date": "ISO date",
"traffic_trend": "increasing | stable | decreasing",
"percent_change": "number",
"new_anomalies": ["string (description of new patterns)"],
"alert_adjustments_recommended": [
{
"alert": "string (alert title)",
"current_threshold": "string",
"recommended_threshold": "string",
"reason": "string"
}
]
}
}
query{
"query_results": [
{
"service": "string",
"metrics": {
"current_rps": "number",
"current_latency_p99_ms": "number",
"current_error_rate_percent": "number",
"current_queue_depth": "number (if applicable)"
},
"status": "healthy | degraded | critical",
"active_alerts": [
{
"title": "string",
"state": "alerting | pending",
"since": "ISO date",
"dashboard_url": "string"
}
],
"recent_anomalies": [
{
"description": "string",
"detected_at": "ISO date",
"severity": "string"
}
]
}
]
}
baseline — Pull Traffic BaselinesFor each service specified:
GET /api/search?query=[service_name])GET /api/v1/provisioning/alert-rules)rate(http_requests_total{service="[name]"}[5m])AWS/ApplicationELB metrics with TargetGroup dimensionsum:trace.http.request.hits{service:[name]}.as_rate()histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service="[name]"}[5m]))rate(http_requests_total{service="[name]",status=~"5.."}[5m]) / rate(http_requests_total{service="[name]"}[5m])validate and provision modesGrafana API calls used:
GET /api/search?query={service_name}&type=dash-db
GET /api/dashboards/uid/{uid}
GET /api/v1/provisioning/alert-rules
GET /api/ds/query (with datasource-specific queries)
GET /api/datasources
validate — Check PRD Estimates Against RealityFor each PRD traffic estimate:
baseline mode)| PRD Field | Grafana Comparison | Verdict Logic |
|-----------|-------------------|---------------|
| launch_rps | Current avg_rps for the service | If PRD < 50% of current → "pessimistic"; if PRD > 200% of current → "optimistic" |
| steady_state_rps | 7-day avg_rps | If within ±30% → "aligned"; otherwise flag deviation |
| peak_rps | 7-day p99_rps | If PRD < actual p99 → "pessimistic — service already exceeds this" |
| polling_interval | Request cadence patterns | Check if interval matches observed request patterns |
| traffic_pattern | Daily/weekly pattern classification | Compare classification |
provision — Create Dashboards & AlertsFrom Build Brief Section 6 (Throughput SLOs) and Section 4 (traffic-based failure modes):
Step 1: Generate Dashboard
Create a Grafana dashboard via the Dashboard API (POST /api/dashboards/db) with these panels:
| Panel | Type | Metric | Purpose |
|-------|------|--------|---------|
| Request Rate (RPS) | Time series | rate(http_requests_total{service="[name]"}[5m]) | Primary traffic signal |
| Request Rate by Status Code | Time series (stacked) | rate(http_requests_total{service="[name]"}[5m]) by status_code | Error breakdown |
| Latency Percentiles | Time series | histogram_quantile([0.5,0.95,0.99], ...) | Latency SLO tracking |
| Error Rate (%) | Stat | 5xx / total × 100 | Error SLO tracking |
| Throughput vs Baseline | Time series (with threshold bands) | Current RPS overlaid with ±threshold from baseline | Anomaly visualization |
| Queue Depth (if applicable) | Time series | Queue-specific metric | Backpressure signal |
| Traffic Anomaly Detection | Alert list | Alerts from traffic-based rules | Quick anomaly view |
| SLO Burn Rate | Gauge | Error budget consumption rate | SLO health |
Step 2: Generate Alert Rules
Create alert rules via Grafana Alerting API (POST /api/v1/provisioning/alert-rules) from the Build Brief:
| Build Brief Source | Alert Rule |
|-------------------|------------|
| Steady-state RPS SLO | avg(rate(http_requests_total[5m])) < [threshold] for > 5 min → P1 |
| Peak RPS capacity SLO | avg(rate(http_requests_total[1m])) > [threshold] for > 1 min → P2 |
| Traffic anomaly detection | abs(current_rps - baseline_rps) / baseline_rps > [threshold] → P2 |
| Zero-traffic detection | rate(http_requests_total[1m]) == 0 during business hours → P0 |
| RPS drop > 50% | rate(http_requests_total[5m]) < 0.5 * baseline_avg for > 2 min → P1 |
| RPS spike > 3x | rate(http_requests_total[1m]) > 3 * baseline_avg for > 1 min → P1 |
| Gradual RPS decline | Hourly comparison showing consistent 10%+ decline → P2 |
Each alert rule includes:
runbook_link annotation pointing to the Confluence runbookdashboard_link annotation pointing to the relevant dashboard paneloutput_configStep 3: Link to Incident Runbook
Output dashboard URLs and alert rule UIDs so the Incident Runbook Skill can embed direct links:
refresh — Update Baselinesbaseline modeTrigger for automated refresh: Can be scheduled (weekly) or triggered when a deploy changes the service's behavior. The Build Brief's CI/CD pipeline can include a post-deploy step that refreshes baselines.
query — Real-Time Service StatusUsed during incident response or on-demand investigation:
The skill auto-detects the Grafana datasource type and adapts queries:
| Datasource Type | Detection | Query Adaptation |
|----------------|-----------|-----------------|
| Prometheus | type: "prometheus" in datasource config | PromQL queries |
| CloudWatch | type: "cloudwatch" | CloudWatch Metrics Insights queries |
| InfluxDB | type: "influxdb" | Flux or InfluxQL queries |
| Elasticsearch | type: "elasticsearch" | Lucene queries with date histogram |
| Datadog | type: "datadog" (via plugin) | Datadog metric queries |
| Google Cloud Monitoring | type: "stackdriver" | MQL queries |
If the datasource is unknown or custom, the skill:
http_requests_total, request_duration_seconds)grafana_baselines{
"name": "grafana_baselines",
"description": "Pull traffic baselines, latency profiles, and existing observability state from Grafana for specified services",
"inputSchema": {
"type": "object",
"properties": {
"services": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"namespace": {"type": "string"}
},
"required": ["name"]
},
"description": "Services to pull baselines for"
},
"time_range_days": {
"type": "number",
"default": 7,
"description": "Number of days to look back for baselines"
},
"include_dashboards": {
"type": "boolean",
"default": true,
"description": "Include existing dashboard inventory"
},
"include_alerts": {
"type": "boolean",
"default": true,
"description": "Include existing alert rule inventory"
}
},
"required": ["services"]
}
}
grafana_validate_estimates{
"name": "grafana_validate_estimates",
"description": "Validate PRD traffic estimates against actual Grafana metrics",
"inputSchema": {
"type": "object",
"properties": {
"services": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"namespace": {"type": "string"}
},
"required": ["name"]
}
},
"prd_estimates": {
"type": "object",
"properties": {
"launch_rps": {"type": "number"},
"steady_state_rps": {"type": "number"},
"peak_rps": {"type": "number"},
"polling_interval_seconds": {"type": "number"},
"traffic_pattern": {"type": "string"}
},
"description": "Traffic estimates from the PRD"
}
},
"required": ["services", "prd_estimates"]
}
}
grafana_provision{
"name": "grafana_provision",
"description": "Provision Grafana dashboards and alert rules from Build Brief throughput SLOs and failure modes",
"inputSchema": {
"type": "object",
"properties": {
"build_brief": {
"type": "string",
"description": "Build Brief markdown — Section 6 (SLOs) and Section 4 (failure modes) are consumed"
},
"service_name": {
"type": "string",
"description": "Primary service name for the feature"
},
"dashboard_folder": {
"type": "string",
"default": "ADLC Features",
"description": "Grafana folder to create dashboards in"
},
"notification_channel": {
"type": "string",
"description": "Notification channel for alert rules (Slack channel, PagerDuty service, etc.)"
},
"runbook_url": {
"type": "string",
"description": "Confluence runbook URL to link in alert annotations"
},
"alert_format": {
"type": "string",
"enum": ["grafana_native", "prometheus_rules", "both"],
"default": "grafana_native"
}
},
"required": ["build_brief", "service_name"]
}
}
grafana_query{
"name": "grafana_query",
"description": "Query current metrics and health status for a service from Grafana. Used for incident response, on-demand investigation, or real-time status checks.",
"inputSchema": {
"type": "object",
"properties": {
"service_name": {
"type": "string",
"description": "Service to query"
},
"namespace": {
"type": "string",
"description": "Optional Kubernetes namespace or environment"
},
"time_range_minutes": {
"type": "number",
"default": 5,
"description": "How far back to look (default: 5 minutes for real-time)"
},
"include_alerts": {
"type": "boolean",
"default": true,
"description": "Include active alert states"
},
"compare_to_baseline": {
"type": "boolean",
"default": true,
"description": "Compare current metrics against cached baselines"
}
},
"required": ["service_name"]
}
}
grafana_refresh_baselines{
"name": "grafana_refresh_baselines",
"description": "Refresh cached traffic baselines and recommend alert threshold adjustments",
"inputSchema": {
"type": "object",
"properties": {
"services": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"namespace": {"type": "string"}
},
"required": ["name"]
}
},
"time_range_days": {
"type": "number",
"default": 7,
"description": "Number of days for new baseline window"
},
"auto_adjust_alerts": {
"type": "boolean",
"default": false,
"description": "If true, automatically update alert thresholds (requires Editor role). If false, only recommend adjustments."
}
},
"required": ["services"]
}
}
# Pull traffic baselines for a service
adlc-grafana baselines --service api-server --days 7
# Pull baselines for multiple services
adlc-grafana baselines --services api-server,worker,auth-service --days 14
# Validate PRD traffic estimates against Grafana
adlc-grafana validate --service api-server --prd ./prd.md
# Provision dashboards and alerts from Build Brief
adlc-grafana provision --brief ./build-brief.md --service widget-service --folder "ADLC Features" --notify "#eng-alerts"
# Query real-time service status
adlc-grafana query --service api-server --minutes 5
# Refresh baselines and check for alert drift
adlc-grafana refresh --service api-server --days 7
# Refresh and auto-adjust alert thresholds
adlc-grafana refresh --service api-server --days 7 --auto-adjust
The skill requires a Grafana Service Account token with appropriate permissions:
| Mode | Minimum Role | Why |
|------|-------------|-----|
| baseline | Viewer | Read dashboards, datasources, and query metrics |
| validate | Viewer | Read metrics for comparison |
| provision | Editor | Create dashboards and alert rules |
| refresh (read-only) | Viewer | Read metrics for comparison |
| refresh (auto-adjust) | Editor | Update alert rule thresholds |
| query | Viewer | Read current metrics and alert states |
Setup:
GRAFANA_API_TOKEN)| Consumer | What They Use | How |
|----------|-------------|-----|
| Build Brief Agent (Phase 6) | baseline output → pre-fills throughput SLOs with real data | Traffic baselines replace PRD estimates where available |
| Build Brief Agent (Phase 4) | baseline output → informs traffic-based failure detection thresholds | Real baselines make anomaly detection accurate |
| Eval Council | validate output → validates that throughput SLOs are realistic | Flags SLOs that conflict with Grafana reality |
| Incident Runbook Skill | provision output → dashboard URLs and alert rule links | Runbook includes direct links to Grafana panels |
| Slack Orchestration Skill | query output → real-time service status during incidents | Posts Grafana status to Slack channels |
| CI/CD Pipeline Skill | provision output → post-deploy baseline refresh trigger | Pipeline triggers refresh after deploy |
| Engineer (directly) | All modes → traffic context for architecture and capacity decisions | Engineer sees real data, not guesses |
These capabilities are not in v1 but the skill architecture supports them:
refresh mode on a recurring basis (e.g., weekly)runbook_link and dashboard_link annotationscontract_version and compatibility enforcement.development
Orchestration skill: chains the full ADLC Build Loop. PRD → Brief → Council → Scaffold → Codegen → LDD → TDD → Council → PR. Use when implementing a new feature end-to-end.
development
# Skill: Helm & ArgoCD Deployment > Validates Helm charts and generates ArgoCD Application manifests when the ADLC pipeline produces infrastructure or service code. Ensures every deployable artifact has correct chart structure, environment-specific values, and a GitOps-ready Application manifest before code review. --- ## Why This Exists Without deployment validation in the pipeline, common failures slip through to production: - **Helm charts fail `helm template`** because of missing values,
testing
Decide whether an intersecting verifier actually exercises the semantic change.
development
# Skill: UX Flow Builder > Generates user flow diagrams (Mermaid) from PRD personas and screen specifications. Surfaces dead ends, missing screens, and disconnected flows before design or engineering starts. Helps PMs think in screens, not features. --- ## Trigger - Automatically during PRD Phase 4 (Personas & Flows) to visualize the user journey - On-demand when the PM says "show me the flow" or "map the user journey" - During PRD evaluation to verify screen connectivity --- ## Input ```