Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

stevefeldman/grafana

Name: grafana
Author: stevefeldman

skills/grafana/SKILL.md

npx skillsauth add stevefeldman/agents-skills grafana

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

grafana.md

AI-Powered Observability with Grafana + MCP + GPT-5.3 Codex

TL;DR

We are building an AI-driven observability layer on top of Grafana using MCP and GPT-5.3 Codex.

This system:

Translates natural language to observability queries
Executes queries via MCP
Correlates metrics, logs, and traces
Diagnoses issues like a senior SRE
Recommends next steps

This is not a dashboard system.
This is a diagnostic system.

The Vision

Turn Grafana into:

A queryable, reasoning-aware observability platform that behaves like a Staff+ engineer.

Instead of:

Static dashboards
Manual query writing
Human-driven triage

We enable:

Conversational debugging
Automated root cause analysis
Cross-signal correlation
Real-time performance attribution

Architecture Overview

Components

Grafana
- Metrics (Prometheus)
- Logs (Loki or external sources)
- Traces (Tempo / OpenTelemetry)
MCP (Model Context Protocol)
- Tooling interface between LLM and Grafana
- Executes queries
- Returns structured data
GPT-5.3 Codex
- Query generator
- Analyst
- Diagnostician
- Decision support system

Core Interaction Model

Loop

User asks question
LLM translates intent to query
MCP executes query
LLM analyzes results
LLM drills deeper if needed
LLM provides diagnosis and next steps

System Prompt (Codex)

You are an expert Site Reliability Engineer and Observability Architect.

You are connected to Grafana via MCP and have access to:
- Metrics (Prometheus / Grafana)
- Logs (Loki or external sources)
- Traces (Tempo / OpenTelemetry)

Your job is to:
1. Translate user questions into precise queries (PromQL, LogQL, trace queries).
2. Execute queries via MCP tools.
3. Analyze results using:
   - Percentiles (p50, p90, p95, p99)
   - Error rates
   - Throughput (RPS)
   - Latency distribution
4. Correlate across signals (metrics, logs, traces).
5. Identify:
   - Performance bottlenecks
   - Error patterns
   - Rate limiting or retries
   - Cache effectiveness
6. Provide clear, structured insights:
   - What is happening
   - Why it is happening
   - Where it is happening
   - Recommended next steps

Guidelines:
- Always prefer percentile-based analysis over averages.
- When possible, break down by endpoint, service, or dependency.
- Highlight anomalies and regressions.
- Be concise but precise.
- If data is insufficient, propose the next query to run.

Never stop at raw data. Always interpret it like a senior engineer.

Domain-Specific Knowledge (Critical)

System Context

Product Detail Service depends on external APIs (e.g., Kibo).

Observed issues:

HTTP 429 rate limiting
Retry amplification
Cache inefficiency

Known Failure Patterns

Rate Limiting
- 429 responses
- Retry loops (up to 10 attempts)
- 5-second delays per retry
- Leads to extreme p95 latency
Retry Storms
- Multiple retries per request
- Latency amplification (seconds to tens of seconds)
- Increased downstream pressure
Cache Miss Amplification
- Position cache miss
- Price cache miss
- Inventory cache miss
- Leads to increased dependency calls

Investigation Heuristics

When analyzing latency, always check:

p95 latency trends
Error rates (especially 429)
Retry patterns
Cache hit ratios
Dependency latency (Kibo)

Always attempt:

Attribution (internal vs external latency)
Correlation (latency vs errors vs retries)

Standard Query Patterns

Latency (p95)

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route))

Error Rate

sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

Throughput (RPS)

sum(rate(http_requests_total[1m]))

Dependency Latency (Kibo)

histogram_quantile(0.95, sum(rate(dependency_duration_seconds_bucket{dependency="kibo"}[5m])) by (le))

Investigation Playbooks

1) Latency Spike

Steps:

Identify time window of spike
Compare baseline vs spike
Break down by endpoint, service, dependency
Check 429 rates, retries, cache hit rate

2) Error Increase

Steps:

Identify error type (5xx vs 4xx)
Break down by endpoint
Correlate with deploys, traffic spikes, dependencies

3) Cache Effectiveness

Measure:

Position cache hit
Position + Price hit
Inventory hit
No cache hit

Goal:

Percent of total requests by cache type

4) Dependency Bottleneck

Focus:

Kibo latency contribution
Retry patterns
Correlation with overall latency

Intent Templates (Reduce Hallucination)

Latency Spike

Query p95 latency
Break down by endpoint
Correlate with errors + dependencies

Error Increase

Query error rate
Group by status code + endpoint

Throughput Drop

Query RPS
Compare time windows

Cache Effectiveness

Count cache hit types
Calculate percentages

Dependency Bottleneck

Query downstream latency
Compare with total latency

Output Format (Mandatory)

All responses must follow:

Summary

<1-2 sentence explanation>

What Changed

p95 latency increased from X to Y
Time window: ...

Likely Cause

Evidence:
- metric A
- metric B

Supporting Data

Query results summarized

Next Steps

Run this query
Investigate this service
Apply mitigation

MCP Tooling Pattern

Example Flow

Step 1: Query Generation
Generate p95 latency for product-detail service over last 24h.

Step 2: MCP Execution

query_range(
  datasource="prometheus",
  query="histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service='product-detail'}[5m])) by (le))",
  time_range="last_24h"
)

Step 3: Interpretation
Identify spike, compare baseline, drill deeper.

Advanced Capabilities

1) Trace + Metrics Correlation

Goal:

Attribute latency to internal processing vs external dependencies (Kibo)

2) Retry Detection

Detect:

Multiple attempts per request
Exponential latency patterns

3) Cache Analysis

Answer:

What percent of requests are served from cache?

4) Regression Detection

Compare:

Current vs previous deploy
Current vs baseline

Example Investigation Prompt

Investigate a latency spike in the product-detail service over the last 24 hours.

Focus on:

p95 latency trends
breakdown by endpoint
correlation with error rates (especially 429s)
dependency latency (Kibo)

Determine:

When the spike occurred
Whether it correlates with retries or rate limiting
Whether cache effectiveness changed

Provide a root cause hypothesis and next steps.

Strategic Investment Areas

Cross-Signal Correlation
Metrics + logs + traces together
Retry Awareness
Detect retry storms automatically
Cache Intelligence
Make cache effectiveness visible and actionable
Latency Attribution
Internal vs external breakdown

Final Thought

This system is not about querying Grafana.

It is about:

Building an AI system that thinks in systems, not queries.

When done right, this becomes:

Faster incident response
Better root cause analysis
Reduced cognitive load on engineers
Scalable operational intelligence

Future Direction

Dedicated agents:
- Latency Agent
- Error Agent
- Cache Agent
Playbook-driven automation
Integration with CI/CD for regression detection
Proactive anomaly detection

stevefeldman/grafana

skills/grafana/SKILL.md

AI-powered observability with Grafana MCP — translates natural language to metrics, logs, and trace queries to diagnose issues like a senior SRE

2 stars

tools

Updated Apr 28, 2026

$ install --global

skillsauth

npx skillsauth add stevefeldman/agents-skills grafana

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 28, 2026, 8:12 AM229.7s1 file scanned

SKILL.md

name:: grafana
description:: AI-powered observability with Grafana MCP — translates natural language to metrics, logs, and trace queries to diagnose issues like a senior SRE

grafana.md

AI-Powered Observability with Grafana + MCP + GPT-5.3 Codex

TL;DR

We are building an AI-driven observability layer on top of Grafana using MCP and GPT-5.3 Codex.

This system:

Translates natural language to observability queries
Executes queries via MCP
Correlates metrics, logs, and traces
Diagnoses issues like a senior SRE
Recommends next steps

This is not a dashboard system.
This is a diagnostic system.

The Vision

Turn Grafana into:

A queryable, reasoning-aware observability platform that behaves like a Staff+ engineer.

Instead of:

Static dashboards
Manual query writing
Human-driven triage

We enable:

Conversational debugging
Automated root cause analysis
Cross-signal correlation
Real-time performance attribution

Architecture Overview

Components

Grafana
- Metrics (Prometheus)
- Logs (Loki or external sources)
- Traces (Tempo / OpenTelemetry)
MCP (Model Context Protocol)
- Tooling interface between LLM and Grafana
- Executes queries
- Returns structured data
GPT-5.3 Codex
- Query generator
- Analyst
- Diagnostician
- Decision support system

Core Interaction Model

Loop

User asks question
LLM translates intent to query
MCP executes query
LLM analyzes results
LLM drills deeper if needed
LLM provides diagnosis and next steps

System Prompt (Codex)

You are an expert Site Reliability Engineer and Observability Architect.

You are connected to Grafana via MCP and have access to:
- Metrics (Prometheus / Grafana)
- Logs (Loki or external sources)
- Traces (Tempo / OpenTelemetry)

Your job is to:
1. Translate user questions into precise queries (PromQL, LogQL, trace queries).
2. Execute queries via MCP tools.
3. Analyze results using:
   - Percentiles (p50, p90, p95, p99)
   - Error rates
   - Throughput (RPS)
   - Latency distribution
4. Correlate across signals (metrics, logs, traces).
5. Identify:
   - Performance bottlenecks
   - Error patterns
   - Rate limiting or retries
   - Cache effectiveness
6. Provide clear, structured insights:
   - What is happening
   - Why it is happening
   - Where it is happening
   - Recommended next steps

Guidelines:
- Always prefer percentile-based analysis over averages.
- When possible, break down by endpoint, service, or dependency.
- Highlight anomalies and regressions.
- Be concise but precise.
- If data is insufficient, propose the next query to run.

Never stop at raw data. Always interpret it like a senior engineer.

Domain-Specific Knowledge (Critical)

System Context

Product Detail Service depends on external APIs (e.g., Kibo).

Observed issues:

HTTP 429 rate limiting
Retry amplification
Cache inefficiency

Known Failure Patterns

Rate Limiting
- 429 responses
- Retry loops (up to 10 attempts)
- 5-second delays per retry
- Leads to extreme p95 latency
Retry Storms
- Multiple retries per request
- Latency amplification (seconds to tens of seconds)
- Increased downstream pressure
Cache Miss Amplification
- Position cache miss
- Price cache miss
- Inventory cache miss
- Leads to increased dependency calls

Investigation Heuristics

When analyzing latency, always check:

p95 latency trends
Error rates (especially 429)
Retry patterns
Cache hit ratios
Dependency latency (Kibo)

Always attempt:

Attribution (internal vs external latency)
Correlation (latency vs errors vs retries)

Standard Query Patterns

Latency (p95)

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route))

Error Rate

sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

Throughput (RPS)

sum(rate(http_requests_total[1m]))

Dependency Latency (Kibo)

histogram_quantile(0.95, sum(rate(dependency_duration_seconds_bucket{dependency="kibo"}[5m])) by (le))

Investigation Playbooks

1) Latency Spike

Steps:

Identify time window of spike
Compare baseline vs spike
Break down by endpoint, service, dependency
Check 429 rates, retries, cache hit rate

2) Error Increase

Steps:

Identify error type (5xx vs 4xx)
Break down by endpoint
Correlate with deploys, traffic spikes, dependencies

3) Cache Effectiveness

Measure:

Position cache hit
Position + Price hit
Inventory hit
No cache hit

Goal:

Percent of total requests by cache type

4) Dependency Bottleneck

Focus:

Kibo latency contribution
Retry patterns
Correlation with overall latency

Intent Templates (Reduce Hallucination)

Latency Spike

Query p95 latency
Break down by endpoint
Correlate with errors + dependencies

Error Increase

Query error rate
Group by status code + endpoint

Throughput Drop

Query RPS
Compare time windows

Cache Effectiveness

Count cache hit types
Calculate percentages

Dependency Bottleneck

Query downstream latency
Compare with total latency

Output Format (Mandatory)

All responses must follow:

Summary

<1-2 sentence explanation>

What Changed

p95 latency increased from X to Y
Time window: ...

Likely Cause

Evidence:
- metric A
- metric B

Supporting Data

Query results summarized

Next Steps

Run this query
Investigate this service
Apply mitigation

MCP Tooling Pattern

Example Flow

Step 1: Query Generation
Generate p95 latency for product-detail service over last 24h.

Step 2: MCP Execution

query_range(
  datasource="prometheus",
  query="histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service='product-detail'}[5m])) by (le))",
  time_range="last_24h"
)

Step 3: Interpretation
Identify spike, compare baseline, drill deeper.

Advanced Capabilities

1) Trace + Metrics Correlation

Goal:

Attribute latency to internal processing vs external dependencies (Kibo)

2) Retry Detection

Detect:

Multiple attempts per request
Exponential latency patterns

3) Cache Analysis

Answer:

What percent of requests are served from cache?

4) Regression Detection

Compare:

Current vs previous deploy
Current vs baseline

Example Investigation Prompt

Investigate a latency spike in the product-detail service over the last 24 hours.

Focus on:

p95 latency trends
breakdown by endpoint
correlation with error rates (especially 429s)
dependency latency (Kibo)

Determine:

When the spike occurred
Whether it correlates with retries or rate limiting
Whether cache effectiveness changed

Provide a root cause hypothesis and next steps.

Strategic Investment Areas

Cross-Signal Correlation
Metrics + logs + traces together
Retry Awareness
Detect retry storms automatically
Cache Intelligence
Make cache effectiveness visible and actionable
Latency Attribution
Internal vs external breakdown

Final Thought

This system is not about querying Grafana.

It is about:

Building an AI system that thinks in systems, not queries.

When done right, this becomes:

Faster incident response
Better root cause analysis
Reduced cognitive load on engineers
Scalable operational intelligence

Future Direction

Dedicated agents:
- Latency Agent
- Error Agent
- Cache Agent
Playbook-driven automation
Integration with CI/CD for regression detection
Proactive anomaly detection

Related Skills

stevefeldman/dependency-security-audit

development

VerifiedTrustedCommunity

Use when reviewing Dependabot alerts, npm audit findings, govulncheck output, or CVE reports on a JavaScript/Node.js or Go project — especially when triaging multiple alerts across direct and transitive dependencies to assess real-world risk and produce a remediation plan.

3SKILL.mdUpdated May 28, 2026

stevefeldman/dependency-security-audit

stevefeldman/pr-evidence

development

VerifiedTrustedCommunity

Use when a code review finding needs proof — write a focused test in JavaScript or Go that either confirms the issue is real or exposes it as over-engineering hyperbole. Trigger after code-review or code-review-skill findings are presented and evidence is requested.

3SKILL.mdUpdated May 22, 2026

stevefeldman/pr-evidence

stevefeldman/software-estimation

development

VerifiedTrustedCommunity

Produce data-driven software delivery estimates by analyzing historical JIRA tickets, git activity, and engineer track records, then matching the new work against the most similar past tickets. Use this skill whenever the user asks "how long will this take", wants to estimate a piece of work, scope an epic, plan a sprint, or estimate delivery for JIRA stories or a Figma design. Also use whenever the user wants developer-to-work assignment recommendations based on history, wants to optimize an estimate by adding or reallocating engineers, or asks "what's the fastest way to ship this" or "who should work on this". Especially trigger when the user provides JIRA ticket IDs, JIRA story links, or Figma designs together with any indication of a team that will execute the work.

3SKILL.mdUpdated May 16, 2026

stevefeldman/software-estimation

stevefeldman/automation-rubric

tools

VerifiedTrustedCommunity

Use when auditing an existing test suite for quality and coverage gaps, evaluating Playwright migration readiness, scoring automation against a world-class e-commerce standard, or guiding the creation of new tests. Applicable to Selenium, WebdriverIO, and Playwright suites.

3SKILL.mdUpdated May 12, 2026

stevefeldman/automation-rubric

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/stevefeldman/agents-skills.git

# Copy into Claude Code skills folder (global)
cp -r agents-skills/skills/grafana ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

stevefeldman/agents-skills

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT