plugin/skills/telemetry-stacks/SKILL.md
Use this skill when querying a production telemetry stack, when authoring or reviewing an analysis workflow that needs vendor query examples, or when identifying which stack is deployed from `CLAUDE.md`, helm charts, `prometheus-operator` CRDs, or an OTel collector config — a knowledge skill providing a production telemetry stack reference covering Prometheus + Grafana, Datadog, Honeycomb, New Relic, Sentry, and OpenTelemetry + Tempo / Jaeger, with per-stack ingestion model, query patterns, UI surface, and when each stack is the right choice. Loaded by `/analyze-prod`, `/analyze-local`, `/env-analyze`, `/infra-change`, and `/bugfix` workflows when production-context diagnosis needs vendor-specific query syntax.
npx skillsauth add avav25/ai-assets telemetry-stacksInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference for the production telemetry stacks supported by /analyze-prod and related analysis workflows. Identify the stack by checking CLAUDE.md, helm charts, prometheus-operator CRDs, or OTel collector configuration, then query directly using the patterns below. Pair with @observability-methods to map the surfaced signals (Latency / Traffic / Errors / Saturation, or Rate / Errors / Duration, or Utilization / Saturation / Errors) onto vendor queries.
| Stack | Where to look | Common queries |
|---|---|---|
| Prometheus + Grafana | Dashboards, Alertmanager, prometheus-operator CRDs | rate(http_requests_total[5m]), histogram_quantile(0.99, ...bucket[5m]), up == 0 |
| Datadog | APM, Service Catalog, Logs | service:x error_rate, @http.status_code:5*, anomaly monitors |
| Honeycomb | Tracing + BubbleUp | WHERE service.name=x \| HEATMAP duration_ms, BubbleUp on slow traces |
| New Relic | APM + Distributed Tracing | SELECT percentile(duration,50,95,99) FROM Transaction WHERE appName='x' |
| Sentry | Issues filtered to release | error patterns + breadcrumbs (frontend / mobile) |
| OTel + Tempo / Jaeger | Tempo / Jaeger UI | trace search by service.name, trace.id |
Ingestion: pull-based scrape from /metrics endpoints; service discovery via Kubernetes API; long-term storage often via Thanos / Mimir / Cortex. PromQL is the query language.
Surface: Grafana dashboards (panels), Alertmanager (alerts), Prometheus UI (ad-hoc PromQL).
Common queries:
rate(http_requests_total[5m])histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))up == 0rate(container_cpu_cfs_throttled_seconds_total[5m])Right choice when: Kubernetes-native shop, open-source preference, cost-sensitive, custom application metrics dominate.
Ingestion: push-based via Datadog Agent (per node) and APM tracer libraries (in-process). Auto-instruments common frameworks; log collection via agent tail.
Surface: APM service map and traces, Service Catalog, Logs Explorer, Dashboards, Monitors (alerts), Watchdog (anomaly detection).
Common queries:
service:x error_rate@http.status_code:5*avg:trace.http.request.duration{service:x} by {resource_name}Right choice when: managed SaaS preferred, multi-language polyglot stack, want APM + logs + metrics + RUM in one product, budget allows.
Ingestion: push-based via OpenTelemetry or Honeycomb Beelines; event-based model — each event is a wide-column record with high-cardinality attributes.
Surface: Query Builder, BubbleUp (attribute correlation), Triggers (alerts), Service Map.
Common queries:
WHERE service.name=x | HEATMAP duration_msGROUP BY user.id, ORDER BY P99(duration_ms) DESCRight choice when: high-cardinality debugging (per-customer, per-tenant, per-endpoint), tracing-first culture, hard-to-reproduce intermittent issues.
Ingestion: push-based via New Relic agents (per language) and OTel; NRQL (New Relic Query Language) for queries.
Surface: APM, Distributed Tracing, Logs, Browser, Infrastructure, Alerts.
Common queries:
SELECT percentile(duration, 50, 95, 99) FROM Transaction WHERE appName='x'SELECT percentage(count(*), WHERE error IS true) FROM Transaction WHERE appName='x'SELECT rate(count(*), 1 minute) FROM Transaction WHERE appName='x' FACET nameRight choice when: enterprise with mixed legacy + modern stacks, want bundled APM + Infra + Logs + Browser, NRQL familiarity exists.
Ingestion: push-based via Sentry SDKs (per language / framework); event-based — each event is an error / exception with stack trace, breadcrumbs, and context.
Surface: Issues view (grouped errors), Releases, Performance (transaction tracing), Alerts.
Common queries / filters:
release:1.2.3 environment:productionusers affected, descendingRight choice when: frontend / mobile error tracking is the priority, want error-grouping + release correlation, paired with another stack for infra metrics.
Ingestion: push-based via OTel SDKs and OTel Collector; vendor-neutral semantic conventions. Tempo (Grafana Labs) and Jaeger (CNCF) are open-source trace storage backends.
Surface: Tempo UI or Jaeger UI; Grafana panels (TraceQL in Tempo); often paired with Prometheus for metrics and Loki for logs (the "LGTM stack" — Loki, Grafana, Tempo, Mimir).
Common queries:
{ service.name = "checkout" && duration > 1s } (TraceQL)trace.id into the UIhttp.status_code, db.statement, error=trueRight choice when: vendor-neutral instrumentation is a priority, open-source stack preferred, already running Grafana + Prometheus and want tracing in the same UI.
| Constraint | Likely stack | |---|---| | Kubernetes-native, OSS, custom metrics dominate | Prometheus + Grafana (+ Tempo for tracing) | | Managed SaaS, polyglot, all-in-one | Datadog or New Relic | | High-cardinality debugging | Honeycomb | | Frontend / mobile error tracking | Sentry (alongside another stack) | | Vendor-neutral, OSS tracing-first | OTel + Tempo / Jaeger |
| Workflow | Apply this knowledge |
|---|---|
| /analyze-prod (snapshot phase) | Identify the stack from CLAUDE.md / helm / CRDs, then run vendor-specific queries to surface the signals chosen via @observability-methods |
| /analyze-local (Docker logs) | If a local stack ships Prometheus / Grafana / OTel containers, use the same query patterns against the local endpoints |
| /env-analyze (multi-scope) | Cross-reference K8s pod state with telemetry-stack metrics (Datadog Service Map, Prometheus up, Honeycomb heatmap) |
| /infra-change (post-apply verify) | Query the stack post-apply to confirm no SLO regression — Prometheus histogram_quantile, Datadog monitors, NRQL percentile queries |
| /bugfix (production-context bugs) | Pull traces (Honeycomb / Tempo / Jaeger / New Relic) and Sentry issues filtered to the offending release to localize the bug |
/analyze-prod, /analyze-local, /env-analyze, /infra-change, /bugfix@observability-methods (which signals to query for — Golden Signals / RED / USE / Tracing), @cloud-platforms (managed-service native metrics via cloud CLI), @deployment-procedures (post-deploy verification queries)prometheus-operator CRDsdevelopment
Use this skill when running the recurring (daily) knowledge-base rescan for a repo that already has knowledge/.knowledge-sync.yml — the main-thread dispatcher that reads the config, computes the git delta since last_scanned_sha, maps changed paths to affected doc areas, early-exits cheaply when nothing changed, then fans out one Agent(content-writer) per affected area, applies the propose/direct update policy, advances the baseline only on success, and writes an L4 run log — all with the G1 untrusted-content choke-point, secret-scan, deny-list, and budget controls woven in. For first-time setup use /knowledge-sync-init.
development
Use this skill when bootstrapping scheduled knowledge-base sync for a repo that has no knowledge/.knowledge-sync.yml yet — to run one-time setup that detects the knowledge_root from CLAUDE.md/AGENTS.md, maps doc areas to source globs, records opt-in external sources (Linear/Notion/WebFetch, all disabled by default), captures a baseline last_scanned_sha, sets the per-area update policy, generates or seeds knowledge/CONVENTIONS.md, provisions the L4 memory dir, and offers to register the daily routine. Routes ongoing recurring sync operations to /knowledge-sync.
tools
Use this skill when bootstrapping a target repository to be ai-skills-aware — on the first run of any ai-skills workflow in a fresh repo, when adopting the ai-skills plugin in an existing repo, or after upgrading to a plugin version that adds new memory paths or templates, including when the user does not say "init" but asks to "set up" or "onboard" the repo — to detect codebase type, create CLAUDE.md + AGENTS.md scaffolding, initialize the .ai-skills-memory/ directory tree from L1 templates, and configure .gitignore. Idempotent — safe to re-run. Accepts `--codebase-type <type>` and `--overwrite`. Not for re-initializing only memory — use `/memory-init` instead.
tools
Use this skill when extending, repairing, or improving plugin assets, when ingesting a `/feedback` report as a fix-cycle backlog, or when you do not remember which lower-level command is right for the job — the umbrella workflow for ai-skills plugin-asset authoring and maintenance: creating, auditing, fixing, improving, refactoring, and migrating skills, agents, rules, hooks, prompts, schemas, and rubrics inside the plugin. Auto-classifies the request, loads the right knowledge skills (`@prompt-engineering`, `@context-engineering`, `@team-protocols`), and spawns the right subagents (`prompt-engineer`, `system-architect`, `python-engineer`, `software-engineer`, `qa-engineer`, `eval-judge`) via the `Agent` tool.