workspace/skills/grafana-observability/SKILL.md
Grafana observability platform — dashboards, Prometheus PromQL, Loki LogQL, alerting, incidents, OnCall schedules, annotations, datasource queries, panel rendering (75+ tools). Use when querying Grafana dashboards, running PromQL for interface metrics, searching Loki logs for syslog events, investigating firing alerts, or checking who is on call.
npx skillsauth add automateyournetwork/netclaw grafana-observabilityInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
| Property | Value |
|----------|-------|
| Source | grafana/mcp-grafana |
| Transport | stdio (default), SSE, or streamable-http |
| Language | Go (runs via uvx mcp-grafana) |
| Tools | 75+ (dashboards, Prometheus, Loki, alerting, incidents, OnCall, annotations, admin) |
| Auth | Service account token (preferred) or username/password |
| Requires | Grafana 9.0+, service account with Editor role or granular RBAC |
# stdio mode (default — used by NetClaw)
uvx mcp-grafana
# Read-only mode (prevents dashboard/alert modifications)
uvx mcp-grafana --disable-write
| Variable | Required | Example | Description |
|----------|----------|---------|-------------|
| GRAFANA_URL | Yes | http://grafana.example.com:3000 | Grafana instance URL |
| GRAFANA_SERVICE_ACCOUNT_TOKEN | Yes* | glsa_abc123... | Service account token (preferred auth) |
| GRAFANA_USERNAME | Alt | admin | Basic auth username (alternative to token) |
| GRAFANA_PASSWORD | Alt | changeme | Basic auth password |
| GRAFANA_ORG_ID | No | 1 | Organization ID for multi-org setups |
*Either service account token or username/password required.
| Tool | What It Does |
|------|-------------|
| search_dashboards | Find dashboards by title or metadata |
| get_dashboard_summary | Lightweight overview (context-efficient — use this first) |
| get_dashboard_by_uid | Full dashboard JSON (large — use sparingly) |
| get_dashboard_property | Extract specific fields via JSONPath |
| get_dashboard_panel_queries | Extract panel query details |
| update_dashboard | Create or modify dashboards |
| patch_dashboard | Targeted modifications without full JSON replacement |
| Tool | What It Does |
|------|-------------|
| query_prometheus | Execute instant or range PromQL queries |
| list_prometheus_metric_names | Discover available metrics |
| list_prometheus_label_names | List labels matching selectors |
| list_prometheus_label_values | Retrieve values for a specific label |
| query_prometheus_histogram | Calculate percentiles (p50, p90, p95, p99) |
| list_prometheus_metric_metadata | Metric type, help text, unit |
| Tool | What It Does |
|------|-------------|
| query_loki_logs | Execute LogQL queries against log streams |
| list_loki_label_names | Discover available log labels |
| list_loki_label_values | List values for a specific log label |
| query_loki_stats | Stream statistics (volume, rate) |
| query_loki_patterns | Detect log structure patterns |
| Tool | What It Does |
|------|-------------|
| list_alert_rules | View all Grafana and datasource-managed alert rules |
| get_alert_rule_by_uid | Retrieve specific alert rule details |
| create_alert_rule | Create new alert rule |
| update_alert_rule | Modify existing alert rule |
| delete_alert_rule | Remove alert rule |
| list_contact_points | View notification endpoints (email, Slack, PagerDuty, etc.) |
| Tool | What It Does |
|------|-------------|
| list_incidents | View Grafana Incidents with filtering |
| get_incident | Single incident details |
| create_incident | Create a new incident |
| add_activity_to_incident | Add timeline entry to incident |
| Tool | What It Does |
|------|-------------|
| list_oncall_schedules | View on-call rotation schedules |
| get_oncall_shift | Shift details |
| get_current_oncall_users | Who is on call right now |
| list_alert_groups | OnCall alert groups with filtering |
| Tool | What It Does |
|------|-------------|
| get_annotations | Query annotations with time/tag filters |
| create_annotation | Add annotation to dashboard/panel |
| get_panel_image | Render a panel or dashboard as PNG image |
| generate_deeplink | Create accurate Grafana URLs for sharing |
| Tool | What It Does |
|------|-------------|
| list_sift_investigations | List automated investigations |
| get_sift_investigation | Investigation details |
| find_error_pattern_logs | Detect elevated error patterns in logs |
| find_slow_requests | Identify slow requests via Tempo traces |
When checking network device metrics in Grafana:
search_dashboards with keyword (e.g., "network", "interface", "BGP")get_dashboard_summary for panel list without full JSONquery_prometheus with PromQL for specific metrics:
rate(ifHCInOctets{instance="router1"}[5m]) * 8bgp_peer_state{peer="10.1.1.2"}device_cpu_utilization{device="core-rtr-01"}increase(ifInErrors{device=~".*"}[1h])list_alert_rules to see active alerting thresholdsquery_loki_logs for syslog or SNMP trap datasearch_dashboards(title="Network Interfaces")
get_dashboard_summary(uid="abc123")
query_prometheus(expr="rate(ifHCInOctets{device='core-rtr-01'}[5m]) * 8", time_range="1h")
query_prometheus(expr="rate(ifHCOutOctets{device='core-rtr-01'}[5m]) * 8", time_range="1h")
list_alert_rules(folder="Network")
When investigating Grafana alerts:
list_alert_rules — find firing or pending rulesget_alert_rule_by_uid — thresholds, conditions, datasourcequery_prometheus — check the metric that triggered the alertquery_loki_logs — correlate with log events around alert timelist_incidents — is this already tracked?list_contact_points — verify notification routesWhen responding to a Grafana incident:
list_incidents — find open incidentsget_incident — timeline, severity, labelsget_current_oncall_users — who should be notifiedquery_prometheus — check affected service metricsquery_loki_logs — find error patterns around incident timefind_error_pattern_logs — automated error pattern detectionadd_activity_to_incident — add findings to timelinecreate_annotation — mark event on relevant dashboardsWhen investigating network logs stored in Loki:
list_loki_label_names — find available labels (host, severity, facility)list_loki_label_values — enumerate hosts, severity levelsquery_loki_logs with LogQL:
{host="core-rtr-01"}{host="core-rtr-01"} |= "error"{job="syslog"} |~ "BGP|OSPF"query_loki_patterns — detect recurring log structuresquery_loki_stats — log volume and rate analysis| Skill | Integration | |-------|-------------| | pyats-health-check | Cross-reference pyATS health data with Grafana metrics and dashboards | | pyats-routing | Correlate OSPF/BGP state changes with Grafana metric timelines | | gait-session-tracking | Record all Grafana queries and findings in GAIT audit trail | | slack-network-alerts | Grafana alerts fed through Slack + NetClaw for automated investigation | | servicenow-change-workflow | Annotate Grafana dashboards during change windows; correlate incidents with CRs | | te-network-monitoring | Pair ThousandEyes path data with Grafana infrastructure metrics | | aws-cloud-monitoring | Compare Grafana dashboards with CloudWatch data for hybrid visibility | | markmap-viz | Visualize Grafana alert rule hierarchies as mind maps |
Grafana dashboards can be large JSON documents. Use these strategies:
get_dashboard_summary — lightweight overview, not full JSONget_dashboard_property with JSONPath for specific fieldsget_dashboard_by_uid unless you need the complete dashboard definitionget_dashboard_panel_queries to extract just the query definitionssearch_dashboards, get_dashboard_summary, query_prometheus, query_loki_logs, list_alert_rules before any write operationsget_dashboard_summary over get_dashboard_by_uid, use time ranges to limit Prometheus/Loki result sizeGRAFANA_URL and GRAFANA_SERVICE_ACCOUNT_TOKEN in ~/.openclaw/.env. Verify service account has Editor role or required RBAC permissions.list_datasources to discover available datasource UIDs and names.list_prometheus_metric_names or list_loki_label_names to discover valid metric/label names before querying.search_dashboards to find dashboards by title before using UID-based tools.testing
Human-in-the-loop escalation via HumanRail — route low-confidence agent decisions, pre-destructive operation approvals, and ambiguous incident tickets to real human engineers. Human answers are verified and returned as structured output. Workers are paid via Lightning Network. Use when the agent is uncertain, when a destructive change needs explicit human sign-off beyond a ServiceNow CR, or when an ambiguous ticket requires human triage before automated handling.
testing
Manage EVE-NG node lifecycle. Use when listing nodes, checking runtime state, creating or deleting nodes, starting or stopping nodes or whole labs, verifying node details, or wiping node NVRAM back to factory defaults.
development
Manage EVE-NG labs and platform inventory. Use when listing labs, checking lab metadata, creating or deleting labs, importing or exporting lab archives, checking EVE-NG health or auth, or verifying available node images before build work.
tools
Execute live CLI commands on running EVE-NG nodes over telnet console. Use when running show commands, making live config changes, verifying protocol state, testing connectivity, checking console readiness, or interacting with IOS, Junos, VPCS, EOS, or NX-OS nodes.