Grafana Observability

MCP Server

| Property | Value | |----------|-------| | Source | grafana/mcp-grafana | | Transport | stdio (default), SSE, or streamable-http | | Language | Go (runs via uvx mcp-grafana) | | Tools | 75+ (dashboards, Prometheus, Loki, alerting, incidents, OnCall, annotations, admin) | | Auth | Service account token (preferred) or username/password | | Requires | Grafana 9.0+, service account with Editor role or granular RBAC |

How to Run

# stdio mode (default — used by NetClaw)
uvx mcp-grafana

# Read-only mode (prevents dashboard/alert modifications)
uvx mcp-grafana --disable-write

Environment Variables

| Variable | Required | Example | Description | |----------|----------|---------|-------------| | GRAFANA_URL | Yes | http://grafana.example.com:3000 | Grafana instance URL | | GRAFANA_SERVICE_ACCOUNT_TOKEN | Yes* | glsa_abc123... | Service account token (preferred auth) | | GRAFANA_USERNAME | Alt | admin | Basic auth username (alternative to token) | | GRAFANA_PASSWORD | Alt | changeme | Basic auth password | | GRAFANA_ORG_ID | No | 1 | Organization ID for multi-org setups |

*Either service account token or username/password required.

Key Tool Categories

Dashboard Operations

| Tool | What It Does | |------|-------------| | search_dashboards | Find dashboards by title or metadata | | get_dashboard_summary | Lightweight overview (context-efficient — use this first) | | get_dashboard_by_uid | Full dashboard JSON (large — use sparingly) | | get_dashboard_property | Extract specific fields via JSONPath | | get_dashboard_panel_queries | Extract panel query details | | update_dashboard | Create or modify dashboards | | patch_dashboard | Targeted modifications without full JSON replacement |

Prometheus (PromQL)

| Tool | What It Does | |------|-------------| | query_prometheus | Execute instant or range PromQL queries | | list_prometheus_metric_names | Discover available metrics | | list_prometheus_label_names | List labels matching selectors | | list_prometheus_label_values | Retrieve values for a specific label | | query_prometheus_histogram | Calculate percentiles (p50, p90, p95, p99) | | list_prometheus_metric_metadata | Metric type, help text, unit |

Loki (LogQL)

| Tool | What It Does | |------|-------------| | query_loki_logs | Execute LogQL queries against log streams | | list_loki_label_names | Discover available log labels | | list_loki_label_values | List values for a specific log label | | query_loki_stats | Stream statistics (volume, rate) | | query_loki_patterns | Detect log structure patterns |

Alerting

| Tool | What It Does | |------|-------------| | list_alert_rules | View all Grafana and datasource-managed alert rules | | get_alert_rule_by_uid | Retrieve specific alert rule details | | create_alert_rule | Create new alert rule | | update_alert_rule | Modify existing alert rule | | delete_alert_rule | Remove alert rule | | list_contact_points | View notification endpoints (email, Slack, PagerDuty, etc.) |

Incident Management

| Tool | What It Does | |------|-------------| | list_incidents | View Grafana Incidents with filtering | | get_incident | Single incident details | | create_incident | Create a new incident | | add_activity_to_incident | Add timeline entry to incident |

OnCall

| Tool | What It Does | |------|-------------| | list_oncall_schedules | View on-call rotation schedules | | get_oncall_shift | Shift details | | get_current_oncall_users | Who is on call right now | | list_alert_groups | OnCall alert groups with filtering |

Annotations & Rendering

| Tool | What It Does | |------|-------------| | get_annotations | Query annotations with time/tag filters | | create_annotation | Add annotation to dashboard/panel | | get_panel_image | Render a panel or dashboard as PNG image | | generate_deeplink | Create accurate Grafana URLs for sharing |

Investigation (Sift)

| Tool | What It Does | |------|-------------| | list_sift_investigations | List automated investigations | | get_sift_investigation | Investigation details | | find_error_pattern_logs | Detect elevated error patterns in logs | | find_slow_requests | Identify slow requests via Tempo traces |

Workflow: Network Infrastructure Monitoring

When checking network device metrics in Grafana:

Find dashboards: search_dashboards with keyword (e.g., "network", "interface", "BGP")
Dashboard overview: get_dashboard_summary for panel list without full JSON
Query metrics: query_prometheus with PromQL for specific metrics:
- Interface traffic: rate(ifHCInOctets{instance="router1"}[5m]) * 8
- BGP peer state: bgp_peer_state{peer="10.1.1.2"}
- CPU utilization: device_cpu_utilization{device="core-rtr-01"}
- Interface errors: increase(ifInErrors{device=~".*"}[1h])
Check alerts: list_alert_rules to see active alerting thresholds
Search logs: query_loki_logs for syslog or SNMP trap data
Report: Metrics summary with alert status and log correlation
GAIT: Record all queries in audit trail

Example: Interface Utilization Check

search_dashboards(title="Network Interfaces")
get_dashboard_summary(uid="abc123")
query_prometheus(expr="rate(ifHCInOctets{device='core-rtr-01'}[5m]) * 8", time_range="1h")
query_prometheus(expr="rate(ifHCOutOctets{device='core-rtr-01'}[5m]) * 8", time_range="1h")
list_alert_rules(folder="Network")

Workflow: Alert Investigation

When investigating Grafana alerts:

List alerts: list_alert_rules — find firing or pending rules
Alert details: get_alert_rule_by_uid — thresholds, conditions, datasource
Query metrics: query_prometheus — check the metric that triggered the alert
Search logs: query_loki_logs — correlate with log events around alert time
Check incidents: list_incidents — is this already tracked?
Contact points: list_contact_points — verify notification routes
Report: Alert analysis with root cause and metric evidence

Workflow: Incident Response

When responding to a Grafana incident:

List incidents: list_incidents — find open incidents
Incident details: get_incident — timeline, severity, labels
OnCall: get_current_oncall_users — who should be notified
Correlate metrics: query_prometheus — check affected service metrics
Correlate logs: query_loki_logs — find error patterns around incident time
Investigate: find_error_pattern_logs — automated error pattern detection
Update incident: add_activity_to_incident — add findings to timeline
Annotate: create_annotation — mark event on relevant dashboards

Workflow: Log Analysis

When investigating network logs stored in Loki:

Discover labels: list_loki_label_names — find available labels (host, severity, facility)
Label values: list_loki_label_values — enumerate hosts, severity levels
Query logs: query_loki_logs with LogQL:
- By device: {host="core-rtr-01"}
- By severity: {host="core-rtr-01"} |= "error"
- Pattern match: {job="syslog"} |~ "BGP|OSPF"
Patterns: query_loki_patterns — detect recurring log structures
Stats: query_loki_stats — log volume and rate analysis

Integration with Other Skills

| Skill | Integration | |-------|-------------| | pyats-health-check | Cross-reference pyATS health data with Grafana metrics and dashboards | | pyats-routing | Correlate OSPF/BGP state changes with Grafana metric timelines | | gait-session-tracking | Record all Grafana queries and findings in GAIT audit trail | | slack-network-alerts | Grafana alerts fed through Slack + NetClaw for automated investigation | | servicenow-change-workflow | Annotate Grafana dashboards during change windows; correlate incidents with CRs | | te-network-monitoring | Pair ThousandEyes path data with Grafana infrastructure metrics | | aws-cloud-monitoring | Compare Grafana dashboards with CloudWatch data for hybrid visibility | | markmap-viz | Visualize Grafana alert rule hierarchies as mind maps |

Context Window Management

Grafana dashboards can be large JSON documents. Use these strategies:

Always start with get_dashboard_summary — lightweight overview, not full JSON
Use get_dashboard_property with JSONPath for specific fields
Avoid get_dashboard_by_uid unless you need the complete dashboard definition
Use get_dashboard_panel_queries to extract just the query definitions

Important Rules

Prefer read-only operations — use search_dashboards, get_dashboard_summary, query_prometheus, query_loki_logs, list_alert_rules before any write operations
Dashboard modifications require ServiceNow CR — unless in lab/dev Grafana instance
Alert rule changes require approval — creating/updating/deleting alert rules affects production monitoring
Token-efficient queries — use get_dashboard_summary over get_dashboard_by_uid, use time ranges to limit Prometheus/Loki result size
GAIT audit mandatory — record all Grafana queries, dashboard modifications, alert changes, and incident updates
No secrets in queries — never embed credentials or sensitive data in PromQL/LogQL expressions

Error Handling

Auth fails (401/403): Check GRAFANA_URL and GRAFANA_SERVICE_ACCOUNT_TOKEN in ~/.openclaw/.env. Verify service account has Editor role or required RBAC permissions.
Datasource not found: Use list_datasources to discover available datasource UIDs and names.
PromQL/LogQL errors: Use list_prometheus_metric_names or list_loki_label_names to discover valid metric/label names before querying.
Dashboard not found: Use search_dashboards to find dashboards by title before using UID-based tools.
Rate limiting: Grafana may rate-limit API requests; space out large query batches.

Grafana Observability

MCP Server

How to Run

# stdio mode (default — used by NetClaw)
uvx mcp-grafana

# Read-only mode (prevents dashboard/alert modifications)
uvx mcp-grafana --disable-write

Environment Variables

*Either service account token or username/password required.

Key Tool Categories

Dashboard Operations

Prometheus (PromQL)

Loki (LogQL)

Alerting

Incident Management

OnCall

Annotations & Rendering

Investigation (Sift)

Workflow: Network Infrastructure Monitoring

When checking network device metrics in Grafana:

Find dashboards: search_dashboards with keyword (e.g., "network", "interface", "BGP")
Dashboard overview: get_dashboard_summary for panel list without full JSON
Query metrics: query_prometheus with PromQL for specific metrics:
- Interface traffic: rate(ifHCInOctets{instance="router1"}[5m]) * 8
- BGP peer state: bgp_peer_state{peer="10.1.1.2"}
- CPU utilization: device_cpu_utilization{device="core-rtr-01"}
- Interface errors: increase(ifInErrors{device=~".*"}[1h])
Check alerts: list_alert_rules to see active alerting thresholds
Search logs: query_loki_logs for syslog or SNMP trap data
Report: Metrics summary with alert status and log correlation
GAIT: Record all queries in audit trail

Example: Interface Utilization Check

search_dashboards(title="Network Interfaces")
get_dashboard_summary(uid="abc123")
query_prometheus(expr="rate(ifHCInOctets{device='core-rtr-01'}[5m]) * 8", time_range="1h")
query_prometheus(expr="rate(ifHCOutOctets{device='core-rtr-01'}[5m]) * 8", time_range="1h")
list_alert_rules(folder="Network")

Workflow: Alert Investigation

When investigating Grafana alerts:

List alerts: list_alert_rules — find firing or pending rules
Alert details: get_alert_rule_by_uid — thresholds, conditions, datasource
Query metrics: query_prometheus — check the metric that triggered the alert
Search logs: query_loki_logs — correlate with log events around alert time
Check incidents: list_incidents — is this already tracked?
Contact points: list_contact_points — verify notification routes
Report: Alert analysis with root cause and metric evidence

Workflow: Incident Response

When responding to a Grafana incident:

List incidents: list_incidents — find open incidents
Incident details: get_incident — timeline, severity, labels
OnCall: get_current_oncall_users — who should be notified
Correlate metrics: query_prometheus — check affected service metrics
Correlate logs: query_loki_logs — find error patterns around incident time
Investigate: find_error_pattern_logs — automated error pattern detection
Update incident: add_activity_to_incident — add findings to timeline
Annotate: create_annotation — mark event on relevant dashboards

Workflow: Log Analysis

When investigating network logs stored in Loki:

Discover labels: list_loki_label_names — find available labels (host, severity, facility)
Label values: list_loki_label_values — enumerate hosts, severity levels
Query logs: query_loki_logs with LogQL:
- By device: {host="core-rtr-01"}
- By severity: {host="core-rtr-01"} |= "error"
- Pattern match: {job="syslog"} |~ "BGP|OSPF"
Patterns: query_loki_patterns — detect recurring log structures
Stats: query_loki_stats — log volume and rate analysis

Integration with Other Skills

Context Window Management

Grafana dashboards can be large JSON documents. Use these strategies:

Always start with get_dashboard_summary — lightweight overview, not full JSON
Use get_dashboard_property with JSONPath for specific fields
Avoid get_dashboard_by_uid unless you need the complete dashboard definition
Use get_dashboard_panel_queries to extract just the query definitions

Important Rules

Prefer read-only operations — use search_dashboards, get_dashboard_summary, query_prometheus, query_loki_logs, list_alert_rules before any write operations
Dashboard modifications require ServiceNow CR — unless in lab/dev Grafana instance
Alert rule changes require approval — creating/updating/deleting alert rules affects production monitoring
Token-efficient queries — use get_dashboard_summary over get_dashboard_by_uid, use time ranges to limit Prometheus/Loki result size
GAIT audit mandatory — record all Grafana queries, dashboard modifications, alert changes, and incident updates
No secrets in queries — never embed credentials or sensitive data in PromQL/LogQL expressions

Error Handling

Auth fails (401/403): Check GRAFANA_URL and GRAFANA_SERVICE_ACCOUNT_TOKEN in ~/.openclaw/.env. Verify service account has Editor role or required RBAC permissions.
Datasource not found: Use list_datasources to discover available datasource UIDs and names.
PromQL/LogQL errors: Use list_prometheus_metric_names or list_loki_label_names to discover valid metric/label names before querying.
Dashboard not found: Use search_dashboards to find dashboards by title before using UID-based tools.
Rate limiting: Grafana may rate-limit API requests; space out large query batches.

Adoption

automateyournetwork/grafana-observability

$ install --global

Security Scan Results

SKILL.md

Grafana Observability

MCP Server

How to Run

Environment Variables

Key Tool Categories

Dashboard Operations

Prometheus (PromQL)

Loki (LogQL)

Alerting

Incident Management

OnCall

Annotations & Rendering

Investigation (Sift)

Workflow: Network Infrastructure Monitoring

Example: Interface Utilization Check

Workflow: Alert Investigation

Workflow: Incident Response

Workflow: Log Analysis

Integration with Other Skills

Context Window Management

Important Rules

Error Handling

Related Skills

automateyournetwork/n2n-federation

automateyournetwork/ue5-network-viz

automateyournetwork/humanrail-escalation

automateyournetwork/subnet-calculator

automateyournetwork/grafana-observability

$ install --global

Security Scan Results

SKILL.md

Grafana Observability

MCP Server

How to Run

Environment Variables

Key Tool Categories

Dashboard Operations

Prometheus (PromQL)

Loki (LogQL)

Alerting

Incident Management

OnCall

Annotations & Rendering

Investigation (Sift)

Workflow: Network Infrastructure Monitoring

Example: Interface Utilization Check

Workflow: Alert Investigation

Workflow: Incident Response

Workflow: Log Analysis

Integration with Other Skills

Context Window Management

Important Rules

Error Handling

Related Skills

automateyournetwork/n2n-federation

automateyournetwork/ue5-network-viz

automateyournetwork/humanrail-escalation

automateyournetwork/subnet-calculator