.agents/skills/building-dashboards/SKILL.md
building-dashboards
npx skillsauth add timelessco/recollect building-dashboardsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You design dashboards that help humans make decisions quickly. Dashboards are products: audience, questions, and actions matter more than chart count.
Choose your starting point:
| Starting from | Workflow | |---------------|----------| | Vague description | Intake → check dataset kind → design blueprint (APL or MPL) → queries per panel → deploy | | Template | Pick template → customize dataset/service/env → deploy | | Splunk dashboard | Extract SPL → translate via spl-to-apl → map to chart types → deploy | | Exploration | Use axiom-sre to discover schema/signals → productize into panels |
Before designing, clarify:
Audience & decision
Scope
Dataset kind (mandatory first step)
scripts/metrics/datasets <deploy> to identify each dataset's kindkind is otel:metrics:v1 → this is a metrics dataset. Follow the Metrics path below.⚠️ NEVER run
getschemaon a metrics dataset. APL queries againstotel:metrics:v1datasets return 0 rows without error — you will waste calls widening time ranges before realizing it's the wrong discovery method.
APL path (events/logs datasets):
getschema:['dataset'] | where _time between (ago(1h) .. now()) | getschema
Metrics path (otel:metrics:v1 datasets):
scripts/metrics/metrics-spec <deploy> <dataset> — mandatory before composing any MPL queryscripts/metrics/metrics-info <deploy> <dataset> metricsscripts/metrics/metrics-info <deploy> <dataset> tagsscripts/metrics/metrics-info <deploy> <dataset> tags <tag> values--start set to 7 days ago — sparse metrics (sensors, batch jobs, crons) may not have data in the default 24h windowfind-metrics <value> searches tag values, not metric names — use it only when you know a specific entity name (service, host, device) to find which metrics are associated with itGolden signals (APL path)
Drilldown dimensions (APL path)
Choose the blueprint that matches your dataset kind (identified in Intake step 3).
Single numbers that answer "is it broken right now?"
Time-based patterns that answer "what changed?"
Top-N analysis that answers "where should I look?"
Raw events that answer "what exactly happened?"
Prerequisite: You MUST have run
scripts/metrics/metrics-specandscripts/metrics/metrics-infobefore designing panels. Never guess MPL syntax or metric/tag names.
Current values for key metrics — answer "what's the state right now?"
group using avg or group using last depending on metric type (gauge vs counter)Metric trends over time — answer "what changed?"
align to <interval> using avg|sum|last for proper time bucketingPer-entity detail — answer "where should I look?"
Boolean/state metrics — answer "what is on/off/active?"
align to <interval> using last for state metricsThe console uses react-grid-layout which requires minH, minW, moved, and static on every layout entry. The dashboard-create and dashboard-update scripts auto-fill these if omitted, so layout entries only need i, x, y, w, h.
Every chart MUST have a unique id field. Every layout entry's i field MUST reference a chart id. Missing or mismatched IDs will corrupt the dashboard in the UI (blank state, unable to save/revert).
{
"charts": [
{
"id": "error-rate",
"name": "Error Rate",
"type": "Statistic",
"query": { "apl": "..." }
}
],
"layout": [
{"i": "error-rate", "x": 0, "y": 0, "w": 3, "h": 2}
]
}
Use descriptive kebab-case IDs (e.g. error-rate, p95-latency, traffic-rps). The dashboard-validate and deploy scripts enforce this automatically.
Metrics-backed charts place the MPL pipeline string in query.apl only.
CRITICAL: Run
scripts/metrics/metrics-spec <deployment> <dataset>before composing your first MPL query in a session. NEVER guess MPL syntax.API gotcha: The create API rejects
query.metricsDatasetandquery.mpleven though GET responses for existing metrics dashboards may include them. For create payloads, put the full MPL string inquery.apland omit both fields.
{
"type": "TimeSeries",
"query": {
"apl": "`otel-metrics`:`http.server.duration`\n| where `service.name` == \"api\"\n| align to 1m using avg\n| group by `service.name` using avg"
}
}
Validate queries with scripts/metrics/metrics-query before embedding in dashboard JSON.
See reference/metrics-mpl.md for the full contract and discovery scripts.
Note: Dashboard queries inherit time from the UI picker—no explicit _time filter needed.
Validation: TimeSeries, Statistic, Table, Pie, LogStream, Note, MonitorList are fully validated by dashboard-validate. Heatmap, ScatterPlot, SmartFilter work but may trigger warnings.
When: Single KPI, current value, threshold comparison.
['logs']
| where service == "api"
| summarize
total = count(),
errors = countif(status >= 500)
| extend error_rate = round(100.0 * errors / total, 2)
| project error_rate
Pitfalls: Don't use for time series; ensure query returns single row.
When: Trends over time, before/after comparison, rate changes.
// Single metric - use bin_auto for automatic sizing
['logs']
| summarize ['req/min'] = count() by bin_auto(_time)
// Latency percentiles - use percentiles_array for proper overlay
['logs']
| summarize percentiles_array(duration_ms, 50, 95, 99) by bin_auto(_time)
Best practices:
bin_auto(_time) instead of fixed bin(_time, 1m) — auto-adjusts to time windowpercentiles_array() instead of multiple percentile() calls — renders as one charttop N or filterWhen: Top-N lists, detailed breakdowns, exportable data.
['logs']
| where status >= 500
| summarize errors = count() by route, error_message
| top 10 by errors
| project route, error_message, errors
Pitfalls:
top N to prevent unbounded resultsproject to control column order and namesWhen: Share-of-total for LOW cardinality dimensions (≤6 slices).
['logs']
| summarize count() by status_class = case(
status < 300, "2xx",
status < 400, "3xx",
status < 500, "4xx",
"5xx"
)
Pitfalls:
When: Raw event inspection, debugging, evidence gathering.
['logs']
| where service == "api" and status >= 500
| project-keep _time, trace_id, route, status, error_message, duration_ms
| take 100
Pitfalls:
take N (100-500 max)project-keep to show relevant fields onlyWhen: Distribution visualization, latency patterns, density analysis.
['logs']
| summarize histogram(duration_ms, 15) by bin_auto(_time)
Best for: Latency distributions, response time patterns, identifying outliers.
When: Correlation between two metrics, identifying patterns.
['logs']
| summarize avg(duration_ms), avg(resp_size_bytes) by route
Best for: Response size vs latency correlation, resource usage patterns.
When: Interactive filtering for the entire dashboard.
SmartFilter is a chart type that creates dropdown/search filters. Requires:
SmartFilter chart with filter definitionsdeclare query_parameters in each panel queryFilter types:
selectType: "apl" — Dynamic dropdown from APL queryselectType: "list" — Static dropdown with predefined optionstype: "search" — Free-text inputPanel query pattern:
declare query_parameters (country_filter:string = "");
['logs'] | where isempty(country_filter) or ['geo.country'] == country_filter
See reference/smartfilter.md for full JSON structure and cascading filter examples.
When: Display monitor status on operational dashboards.
No APL needed—select monitors from the UI. Shows:
When: Context, instructions, section headers.
Use GitHub Flavored Markdown for:
Charts support JSON configuration options beyond the query. See reference/chart-config.md for full details.
Quick reference:
| Chart Type | Key Options |
|------------|-------------|
| Statistic | colorScheme, customUnits, unit, showChart (sparkline), errorThreshold/warningThreshold |
| TimeSeries | aggChartOpts: variant (line/area/bars), scaleDistr (linear/log), displayNull |
| LogStream/Table | tableSettings: columns, fontSize, highlightSeverity, wrapLines |
| Pie | hideHeader |
| Note | text (markdown), variant |
Common options (all charts):
overrideDashboardTimeRange: booleanoverrideDashboardCompareAgainst: booleanhideHeader: booleanDashboard panel queries do NOT need explicit time filters. The dashboard UI time picker automatically scopes all queries to the selected time window.
// DASHBOARD QUERY — no time filter needed
['logs']
| where service == "api"
| summarize count() by bin_auto(_time)
Ad-hoc queries (Axiom Query tab, axiom-sre exploration) MUST have explicit time filters:
// AD-HOC QUERY — always include time filter
['logs']
| where _time between (ago(1h) .. now())
| where service == "api"
| summarize count() by bin_auto(_time)
Prefer bin_auto(_time) — it automatically adjusts to the dashboard time window.
Manual bin sizes (only when auto doesn't fit your needs):
| Time window | Bin size | |-------------|----------| | 15m | 10s–30s | | 1h | 1m | | 6h | 5m | | 24h | 15m–1h | | 7d | 1h–6h |
Prevent query explosion:
// GOOD: bounded
| summarize count() by route | top 10 by count_
// BAD: unbounded high-cardinality grouping
| summarize count() by user_id // millions of rows
Fields with dots need bracket notation:
| where ['kubernetes.pod.name'] == "frontend"
Fields with dots IN the name (not hierarchy) need escaping:
| where ['kubernetes.labels.app\\.kubernetes\\.io/name'] == "frontend"
Traffic:
| summarize requests = count() by bin_auto(_time)
Errors (as rate %):
| summarize total = count(), errors = countif(status >= 500) by bin_auto(_time)
| extend error_rate = iff(total > 0, round(100.0 * errors / total, 2), 0.0)
| project _time, error_rate
Latency (use percentiles_array for proper chart overlay):
| summarize percentiles_array(duration_ms, 50, 95, 99) by bin_auto(_time)
Row 0-1: [Stat w=3] [Stat w=3] [Stat w=3] [Stat w=3]
Row 2-5: [TimeSeries w=6, h=4] [TimeSeries w=6, h=4]
Row 6-9: [Table w=6, h=4] [Pie w=6, h=4]
Row 10+: [LogStream w=12, h=6]
Dashboard auto-refreshes at configured interval. Options: 15s, 30s, 1m, 5m, etc.
⚠️ Query cost warning: Short refresh (15s) + long time range (90d) = expensive queries running constantly.
Recommendations: | Use case | Refresh rate | |----------|-------------| | Oncall/real-time | 15s–30s | | Team health | 1m–5m | | Executive/weekly | 5m–15m |
All dashboards created via API tokens are shared with everyone in the org (owner: "X-AXIOM-EVERYONE"). Private dashboards are not supported with API tokens.
Data visibility is still governed by dataset permissions—users only see data from datasets they can access.
?t_qr=24h (quick range), ?t_ts=...&t_te=... (custom), ?t_against=-1d (comparison)
Run scripts/setup to check requirements (curl, jq, ~/.axiom.toml).
Config in ~/.axiom.toml (shared with axiom-sre):
[deployments.prod]
url = "https://api.axiom.co"
token = "xaat-your-token"
org_id = "your-org-id"
| Script | Usage |
|--------|-------|
| scripts/dashboard-list <deploy> | List all dashboards |
| scripts/dashboard-get <deploy> <id> | Fetch dashboard JSON |
| scripts/dashboard-validate <file> | Validate JSON structure |
| scripts/dashboard-create <deploy> <file> | Create dashboard |
| scripts/dashboard-update <deploy> <id> <file> | Update (needs version) |
| scripts/dashboard-copy <deploy> <id> | Clone dashboard |
| scripts/dashboard-link <deploy> <id> | Get shareable URL |
| scripts/dashboard-delete <deploy> <id> | Delete (with confirm) |
| scripts/axiom-api <deploy> <method> <path> | Dashboard/app API only (rewrites to app.*). For data/metrics endpoints use scripts/metrics/axiom-api |
| scripts/metrics/axiom-api <deploy> <method> <path> | Data/metrics API (supports AXIOM_URL_OVERRIDE for edge routing) |
| scripts/metrics/datasets <deploy> | List datasets with kind and edge deployment |
| scripts/metrics/metrics-spec <deploy> <dataset> | Fetch MPL query specification |
| scripts/metrics/metrics-info <deploy> <dataset> ... | Discover metrics, tags, and values |
| scripts/metrics/metrics-query <deploy> <mpl> <start> <end> | Execute a metrics query |
⚠️ Two
axiom-apiscripts exist with different behaviors.scripts/axiom-apirewrites URLs for the dashboard app API (app.*).scripts/metrics/axiom-apiuses raw URLs and supports edge deployment routing. Using the wrong one will produce 404 errors.
⚠️ CRITICAL: Always validate queries BEFORE deploying.
APL workflow:
dashboard-validate to check structuredashboard-create or dashboard-update to deploydashboard-link to get URL — NEVER construct Axiom URLs manually (org IDs and base URLs vary per deployment)Metrics/MPL workflow:
scripts/metrics/metrics-spec to learn MPL syntaxscripts/metrics/metrics-info to discover metrics and tagsscripts/metrics/metrics-query using explicit time rangequery.apl only — do not set query.metricsDataset or query.mpl)dashboard-validate to check structuredashboard-create or dashboard-update to deploydashboard-link to get URLspl-to-apl: Translate Splunk SPL → APL. Map timechart → TimeSeries, stats → Statistic/Table. See reference/splunk-migration.md.
axiom-sre: Discover schema with getschema, explore baselines, identify dimensions, then productize into panels.
query-metrics: Discover metrics datasets, metric names, tags, and tag values. Metrics discovery scripts are also vendored locally in scripts/metrics/.
Pre-built templates in reference/templates/:
| Template | Use case |
|----------|----------|
| service-overview.json | Single service oncall dashboard with Heatmap |
| service-overview-with-filters.json | Same with SmartFilter (route/status dropdowns) |
| api-health.json | HTTP API with traffic/errors/latency |
| blank.json | Minimal skeleton |
Placeholders: {{service}}, {{dataset}}
Usage:
scripts/dashboard-from-template service-overview "my-service" "my-dataset" ./dashboard.json
scripts/dashboard-validate ./dashboard.json
scripts/dashboard-create prod ./dashboard.json
⚠️ Templates assume field names (service, status, route, duration_ms). Discover your schema first and use sed to fix mismatches.
| Problem | Cause | Solution |
|---------|-------|----------|
| "unable to find dataset" errors | Dataset name doesn't exist in your org | Check available datasets in Axiom UI |
| "creating private dashboards" 403 | API tokens can only create shared dashboards | Use owner: "X-AXIOM-EVERYONE" (the default) |
| All panels show errors | Field names don't match your schema | Discover schema first, use sed to fix field names |
| Dashboard shows no data | Service filter too restrictive | Remove or adjust where service == 'x' filters |
| Queries time out | Missing time filter or too broad | Dashboard inherits time from picker; ad-hoc queries need explicit time filter |
| Wrong org in dashboard URL | Manually constructed URL | Always use dashboard-link <deploy> <id> — never guess org IDs or base URLs |
| getschema returns 0 rows | Dataset is otel:metrics:v1, not events | Run scripts/metrics/datasets <deploy> to check kind; use scripts/metrics/metrics-info for metrics discovery |
| Metrics discovery returns empty | Sparse metrics (sensors, batch, cron) outside default 24h window | Retry with --start set to 7 days ago; some metrics only report intermittently |
| 404 from metrics API calls | Used scripts/axiom-api (dashboard) instead of scripts/metrics/axiom-api (data) | Use scripts/metrics/axiom-api for all /v1/query/, /v1/datasets paths |
| find-metrics returns unexpected results | It searches tag values, not metric names | Use metrics-info <deploy> <dataset> metrics to list metric names; find-metrics finds metrics associated with a known tag value |
| metricsDataset rejected on create | Create API does not accept query.metricsDataset for metrics charts | Put the full MPL pipeline in query.apl only |
| query.mpl rejected on create | GET may return query.mpl for existing metrics charts, but create expects query.apl | Move/copy the MPL string into query.apl before deploy |
| decimals rejected on create | Create API does not accept chart-level decimals even though GET may return it | Omit decimals from create payloads |
reference/chart-config.md — All chart configuration options (JSON)reference/metrics-mpl.md — Metrics/MPL chart contract and discovery scriptsreference/smartfilter.md — SmartFilter/FilterBar full configurationreference/chart-cookbook.md — APL patterns per chart typereference/layout-recipes.md — Grid layouts and section blueprintsreference/splunk-migration.md — Splunk panel → Axiom mappingreference/design-playbook.md — Decision-first design principlesreference/templates/ — Ready-to-use dashboard JSON filesFor APL syntax: https://axiom.co/docs/apl/introduction
testing
v2-route-audit
tools
release
development
recollect-mutation-hook-refactoring
databases
recollect-caller-migration