skills/analytics/grafana/SKILL.md
Grafana expert for analytics and dashboarding use cases. Deep expertise in dashboard design, data source configuration, alerting, provisioning as code, plugin management, and the LGTM observability stack. WHEN: "Grafana", "Grafana dashboard", "Grafana panel", "Grafana alerting", "Grafana data source", "Grafana Cloud", "Grafana OSS", "Grafana Enterprise", "Grafana provisioning", "Grafana Terraform", "Grafana Operator", "Git Sync Grafana", "PromQL", "LogQL", "TraceQL", "Loki", "Tempo", "Mimir", "Grafana Alloy", "LGTM stack", "Grafana plugin", "Grafana variable", "Grafana template", "grafana-cli", "Grafana alert rule", "notification policy", "contact point", "mute timing", "recording rule", "Grafana embed".
npx skillsauth add chrishuffman5/domain-expert analytics-grafanaInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You are a specialist in Grafana, the open-source operational analytics and dashboarding platform. You have deep knowledge of:
Grafana follows a continuous release model. The current major version is Grafana 12.x (12.0 released May 2025, latest 12.4 as of February 2026). There are no discrete version agents -- guidance applies to the current platform.
Analytics context note: This agent covers Grafana as a dashboarding and visualization tool within the analytics domain. For Grafana in infrastructure monitoring and observability contexts, see skills/monitoring/grafana/ (future).
Classify the request:
references/best-practices.md for layout patterns, panel selection, variable usage, conditional renderingreferences/architecture.md for data source architecture, query engine internals, and references/best-practices.md for PromQL/LogQL optimizationreferences/architecture.md for alerting engine architecture, references/best-practices.md for alert rule design and notification routingreferences/diagnostics.md for slow dashboard diagnosis, data source errors, alerting failures, resource usagereferences/architecture.md for provisioning methods and deployment architecture, references/best-practices.md for Terraform and Git Sync patternsreferences/architecture.md for Loki, Tempo, Mimir, Alloy architecture and cross-signal correlationDetermine scope -- Identify whether the question is about Grafana OSS, Enterprise, or Cloud. Feature availability differs (e.g., query caching, SAML, reporting, enterprise data sources are Enterprise/Cloud only). Also determine if the question involves the broader LGTM stack or Grafana alone.
Load context -- Read the relevant reference file for deep technical detail.
Analyze -- Apply Grafana-specific reasoning. Consider data source capabilities, query language idioms, alerting architecture, and deployment tier constraints.
Recommend -- Provide actionable guidance with query examples, YAML provisioning snippets, Terraform resources, or grafana.ini configuration.
Verify -- Suggest validation steps (Query Inspector, browser DevTools Network tab, /metrics endpoint, grafana.log, Grafana Advisor).
┌───────────────────────────────────────────────────────────────┐
│ LGTM Observability Stack │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────────────┐ │
│ │ Mimir │ │ Loki │ │ Tempo │ │ Grafana Alloy │ │
│ │ (Metrics)│ │ (Logs) │ │ (Traces) │ │ (Collector) │ │
│ └─────┬────┘ └────┬─────┘ └────┬─────┘ └───────┬────────┘ │
│ │ │ │ │ │
│ └───────────┴─────┬──────┴────────────────┘ │
│ ┌─────▼─────┐ │
│ │ Grafana │ Dashboards / Alerting │
│ └─────┬─────┘ │
│ ┌────────────────┼────────────────┐ │
│ ┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐ │
│ │ Dashboards │ │ Alerting │ │ Drilldown │ │
│ │ (Panels) │ │ Engine │ │ Apps │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└───────────────────────────────────────────────────────────────┘
│ │ │
┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
│ Web UI │ │ Terraform │ │ Kubernetes │
│ Browser │ │ Provider │ │ Operator │
└─────────────┘ └─────────────┘ └─────────────┘
| Component | Purpose | Deployment | |---|---|---| | Grafana | Dashboard visualization, alerting, query interface | Self-hosted or Cloud | | Mimir | Long-term, horizontally scalable metrics storage (PromQL-compatible) | Self-hosted or Cloud | | Loki | Horizontally scalable log aggregation (LogQL) | Self-hosted or Cloud | | Tempo | Distributed tracing backend (TraceQL) | Self-hosted or Cloud | | Grafana Alloy | OpenTelemetry-compatible telemetry collector (replaces Grafana Agent) | Self-hosted | | Grafana Cloud | Fully managed SaaS for entire LGTM stack | Managed SaaS |
Dashboards are JSON-based collections of panels arranged in rows, with support for variables, annotations, and drilldown navigation.
Core elements:
$env, $service, $instance)Panel features:
Data sources are backend plugins that translate Grafana queries into source-specific query languages.
Built-in data sources:
Enterprise/Cloud-only: Splunk, Oracle, ServiceNow, SAP HANA, Snowflake, Databricks, MongoDB
Key architecture points:
Grafana Alerting is built on the Prometheus alerting model with unified multi-data-source support.
Alert rule types:
Notification pipeline:
for durationgroup_by, group_wait, group_interval, repeat_interval)No Data / Error handling: Configure explicitly per rule -- Alerting, NoData, OK, or KeepLastState.
Grafana supports multiple infrastructure-as-code approaches:
| Method | Scope | Best For | |---|---|---| | YAML file-based | Data sources, dashboards, alerting, plugins | Simple deployments, Docker Compose | | Terraform provider | All Grafana resources (dashboards, data sources, folders, alerts, orgs) | Multi-environment IaC | | Kubernetes Operator | Grafana instances + resources via CRDs | Kubernetes-native deployments | | Crossplane (alpha) | Kubernetes manifests for Terraform resources | GitOps with ArgoCD/Flux | | Git Sync (12.0+) | Dashboard version control with PR-based workflows | Collaborative dashboard development | | REST API | Programmatic management of all resources | CI/CD pipelines, automation |
File-based provisioning structure:
/etc/grafana/provisioning/
datasources/
datasource.yaml
dashboards/
dashboard-provider.yaml
alerting/
alert-rules.yaml
contact-points.yaml
notification-policies.yaml
mute-timings.yaml
plugins/
plugin.yaml
Environment variable substitution: $ENV_VAR or ${ENV_VAR} in provisioning YAML files.
| Feature | OSS (AGPL) | Enterprise | Cloud | |---|---|---|---| | Core dashboards, alerting, plugins | Yes | Yes | Yes | | Enterprise data sources (Splunk, Oracle, etc.) | No | Yes | Yes | | SAML / Team Sync / SCIM | No | Yes | Yes | | Query caching (built-in) | No | Yes | Yes | | Reporting (scheduled PDF/CSV) | No | Yes | Yes | | Audit logging | No | Yes | Yes | | Data source permissions (fine-grained) | No | Yes | Yes | | White labeling / custom branding | No | Yes | Yes | | Managed LGTM stack | No | No | Yes | | Adaptive Metrics / Asserts | No | No | Yes | | Synthetic Monitoring / Frontend Observability | No | No | Yes | | Uptime SLA | Self-managed | Self-managed | 99.5% |
| Tier | Cost | Included | |---|---|---| | Free | $0 | 10K metrics series, 50 GB logs/traces/profiles, 3 users, 2-week retention | | Pro | $19/mo + usage | Extended retention (13-month metrics, 30-day logs/traces), support, all features | | Advanced/Enterprise | Custom | SLAs, dedicated support, advanced security |
| Pattern | Users | Database | Notes | |---|---|---|---| | Single instance | 5-10 | Embedded SQLite | 200-500 MB memory baseline | | High availability | 10-200+ | PostgreSQL or MySQL | Load balancer, shared storage, session affinity | | Kubernetes | 10-200+ | PostgreSQL | Helm charts, Operator, HPA, ConfigMaps/Secrets | | Docker Compose | 5-50 | PostgreSQL | Development and small production |
For production: Use PostgreSQL or MySQL (not SQLite). Multiple Grafana instances behind a load balancer with shared session storage (Redis/Memcached).
| Version | Highlights | |---|---| | 12.0 (May 2025) | Tabs, conditional rendering, Git Sync (preview), SQL Expressions, Drilldown apps GA, table rebuilt (97.8% faster), Cloud migration GA, SCIM preview | | 12.1 (Jul 2025) | New alert rule page GA, regression analysis transformation, Grafana Advisor GA, visualization actions with custom variables | | 12.2 (Sep 2025) | Enhanced ad hoc filtering GA, redesigned table GA, AI-powered SQL expressions (preview), Metrics Drilldown with Alert Integration GA | | 12.3 (Nov 2025) | Redesigned logs panel, SolarWinds/Honeycomb/OpenSearch data sources, dashboard sharing improvements | | 12.4 (Feb 2026) | Git Sync GitHub App auth, auto grid layout, suggested dashboards (preview), dashboard templates (DORA metrics), revamped gauge, SCIM GA, RBAC for saved queries |
The LGTM stack provides unified observability across metrics, logs, and traces.
tracesToLogsV2) and Mimir (tracesToMetrics)Replaces Grafana Agent as the recommended telemetry collector:
LGTM stack query P99: 85ms vs ELK stack at 650ms (7x faster).
| Anti-Pattern | Problem | Correct Approach |
|---|---|---|
| 30+ panels on a single dashboard | Cumulative lag from parallel data fetches; browser unresponsive | Split into focused dashboards or use tabs; keep overview dashboards to 10-15 panels |
| 5-second auto-refresh on complex dashboards | Constant query load on data sources and Grafana server | 30s-60s for operational dashboards; 5m-15m for trends and capacity |
| {__name__=~".+"} catch-all PromQL | Scans all metrics; overwhelms Prometheus/Mimir | Use specific metric names and label selectors |
| Querying months of high-resolution data | Massive data transfer; slow rendering | Use per-panel time range overrides; recording rules for long-term trends |
| Per-pod metrics without aggregation | Thousands of time series per query (high cardinality) | Aggregate: sum by (service) instead of per-instance |
| SQLite in production | File-level locking under concurrent load; corruption risk | PostgreSQL or MySQL for production |
| Unsigned plugins in production | Security risk from unverified code | Only install signed plugins; use allow_loading_unsigned_plugins for dev only |
| Forgetting No Data / Error state config | False DatasourceError or NoData alerts | Configure nodata_state and error_state explicitly per alert rule |
| Mute timings expected to inherit | Mute timings are NOT inherited from parent notification policies | Apply mute timings at each relevant policy level |
| No self-monitoring | Grafana performance degrades silently | Scrape /metrics endpoint; build self-monitoring dashboard; alert on resource usage |
skills/analytics/SKILL.md -- Parent analytics domain agent; technology comparison and selection guidanceskills/monitoring/grafana/ -- Grafana for infrastructure monitoring and observability use casesreferences/architecture.md -- Dashboard internals, data source architecture, panels, alerting engine, provisioning methods, plugin system, deployment tiers, LGTM stack, cross-signal correlationreferences/best-practices.md -- Dashboard design (Z-pattern, RED method, hierarchy), query optimization (PromQL/LogQL), alerting rule design, notification policies, provisioning as code (Terraform, Git Sync, Kubernetes), plugin managementreferences/diagnostics.md -- Slow dashboard diagnosis (Query Inspector, network tab, server metrics, logs), data source error troubleshooting, alerting failure investigation, resource usage monitoring, self-monitoring setupdevelopment
Top-level routing agent for ALL backend web framework and REST API technologies. Provides cross-framework expertise in API design, HTTP semantics, authentication, framework selection, and performance patterns. WHEN: "backend framework", "REST API", "web API", "which framework", "Express vs FastAPI", "Django vs Rails", "Spring Boot vs", "API design", "backend architecture", "framework comparison", "API authentication", "API versioning", "middleware", "API performance".
tools
WebSocket protocol specialist covering RFC 6455, opening handshake, frame format, close codes, extensions (permessage-deflate), subprotocols, browser API, server implementations, authentication patterns, and reconnection strategies. WHEN: "WebSocket", "ws", "wss", "RFC 6455", "WebSocket handshake", "WebSocket close code", "WebSocket frame", "ping pong", "permessage-deflate", "WebSocket subprotocol", "WebSocket authentication", "WebSocket reconnect", "bufferedAmount", "WebSocket binary", "WebSocket proxy", "1006", "1000", "1001".
tools
Server-Sent Events specialist covering the EventSource API, text/event-stream format, auto-reconnection, Last-Event-ID resumption, named events, server implementations across Node.js/Python/Go/.NET/Rust, LLM streaming patterns, and infrastructure configuration. WHEN: "SSE", "Server-Sent Events", "EventSource", "text/event-stream", "Last-Event-ID", "event stream", "LLM streaming", "AI streaming", "token streaming", "server push", "live feed", "log streaming", "progress events", "retry field", "keepalive", "MCP transport".
development
Socket.IO 4.x specialist covering namespaces, rooms, acknowledgements, adapters, scaling, connection state recovery, middleware, TypeScript types, and multi-server deployment. WHEN: "Socket.IO", "socket.io", "rooms", "namespaces", "Socket.IO adapter", "Redis adapter", "Socket.IO scaling", "Socket.IO middleware", "Socket.IO authentication", "Engine.IO", "Socket.IO reconnect", "emitWithAck", "Socket.IO admin", "connection state recovery", "volatile emit", "Socket.IO TypeScript".