skills/nestjs-lgtm-metrics/SKILL.md
Add application-level Prometheus metrics to a NestJS app using the OpenTelemetry SDK and an OTel Collector. Covers HTTP RED metrics, Node.js runtime metrics, business metrics, and advanced observability patterns. Use this skill whenever the user wants to add metrics, dashboards, or monitoring to a NestJS application, mentions Prometheus, Grafana metrics, OpenTelemetry metrics, OTel Collector, or asks about SLIs/SLOs, or application-level monitoring in a NestJS context. Also trigger when the user wants custom counters, histograms, gauges, or summaries in NestJS. This skill focuses exclusively on metrics the APPLICATION must emit. It does not cover logging, tracing, infra-level metrics from cAdvisor, node-exporter, postgres-exporter, or redis-exporter.
npx skillsauth add cmglezpdev/custom-skills nestjs-lgtm-metricsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill adds application-level metrics to a NestJS app using the OpenTelemetry SDK, pushing them through an OTel Collector into Prometheus. It is organized in progressive stages, from foundational to advanced.
Scope boundary: This skill covers metrics only. Not logging, not tracing. Infrastructure metrics (container resources, host stats, database server stats, Redis server stats) come from their respective exporters and community Grafana dashboards. This skill covers what those exporters cannot see: request-level behavior, business events, runtime internals, and domain-specific KPIs as observed from inside your application.
NestJS App (OTel SDK) ──OTLP/gRPC──▶ OTel Collector ──remote_write──▶ Prometheus ──▶ Grafana
The app never exposes a /metrics endpoint. It pushes metrics to the Collector via OTLP. The Collector handles export to Prometheus (via prometheusremotewrite exporter or a Prometheus scrape on the Collector itself).
Read references/otel-setup.md before writing any application metric code. It covers the SDK bootstrap, Collector pipeline config, and how to verify the pipeline is working end to end.
Each stage builds on the previous. Read the corresponding reference file for full implementation code.
Read:
references/stage-1-foundation.md
Non-negotiable. Every NestJS app in production needs these from the start.
What you get:
http_request_duration_seconds histogram, labeled by method, route, status_codehttp_requests_total counter, labeled by method, route, status_codehttp_requests_in_flight up/down counternodejs_eventloop_lag_seconds histogram (not just mean. percentiles matter)nodejs_active_handles_total observable gaugenodejs_active_requests_total observable gauge@opentelemetry/host-metricsapp_health_check_duration_seconds histogram by check_nameapp_health_check_status gauge (1=healthy, 0=unhealthy) by check_nameWhy these first: HTTP RED gives you the ability to detect problems. Runtime metrics tell you whether the Node.js process itself is degraded. Health checks close the loop with your orchestrator. Together they answer: "Is the app working? How fast? For whom is it failing?"
Histogram bucket strategy: Use explicit bucket boundaries initially: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]. Tune after you have real traffic data. Do not over-bucket. More buckets = more time series = more Prometheus storage cost.
Read:
references/stage-2-app-intelligence.md
These metrics expose application behavior that sits between raw HTTP and business logic.
What you get:
Database query metrics from the app's perspective
app_db_query_duration_seconds histogram by operation (select, insert, update, delete), entityapp_db_query_errors_total counter by operation, error_typeapp_db_connection_pool_wait_seconds histogramRedis operation metrics from the app's perspective
app_redis_operation_duration_seconds histogram by operation (get, set, del, etc.), key_prefixapp_redis_operation_errors_total counterapp_cache_hit_total / app_cache_miss_total counters by cache_namekeyspace_hits/misses globally. Your app knows WHICH cache strategy worked and which didn't.External API call metrics
app_external_api_duration_seconds histogram by service, endpoint, status_codeapp_external_api_errors_total counter by service, error_typeapp_circuit_breaker_state gauge (0=closed, 0.5=half-open, 1=open) by serviceAuthentication and authorization metrics
app_auth_attempts_total counter by method (password, oauth, api_key), result (success, failure, locked)app_auth_token_operations_total counter by operation (issue, refresh, revoke, expire)app_authorization_denied_total counter by resource, actionBackground job / queue metrics (Bull, BullMQ, or similar)
app_job_duration_seconds histogram by queue, job_typeapp_job_completed_total counter by queue, job_type, status (completed, failed, stalled)app_job_queue_depth gauge by queueapp_job_attempts_total counter by queue, job_type (tracks retry behavior)Read:
references/stage-3-business-metrics.md
Once users arrive, technical metrics aren't enough. You need to observe the business through the same system.
What you get:
User lifecycle events
app_user_signups_total counter by source, planapp_user_logins_total counter by methodapp_user_actions_total counter by action_type (generic, high-cardinality-safe pattern)Transaction / conversion metrics
app_transactions_total counter by type, statusapp_transaction_value_total counter (sum of monetary values) by currencyapp_conversion_funnel_step_total counter by funnel, stepFeature usage metrics
app_feature_usage_total counter by feature_name, variant (useful for A/B tests)app_feature_errors_total counter by feature_nameSLI metrics for SLO tracking
app_sli_request_availability counter (successful requests / total requests, partitioned)app_sli_request_latency histogram (tighter buckets around your SLO threshold)Cardinality warning: Business metrics are where cardinality explosions happen. Never use user_id, session_id, or any unbounded value as an attribute. Use bucketed categories. Read the cardinality section in references/stage-3-business-metrics.md.
Read:
references/stage-4-advanced.md
For apps serving real traffic at scale, handling complex flows, or needing tight operational control.
What you get:
WebSocket / real-time connection metrics
app_ws_connections_active gauge by namespaceapp_ws_messages_total counter by namespace, direction (in/out), event_typeapp_ws_connection_duration_seconds histogramRate limiter observability
app_rate_limit_hits_total counter by limiter, routeapp_rate_limit_remaining gauge by limiter (sampled, not per-request)Multi-tenant metrics
app_tenant_requests_total counter by tenant_tier (not tenant_id. cardinality.)app_tenant_resource_usage gauge by tenant_tier, resource_typeGraceful shutdown and lifecycle
app_shutdown_duration_seconds histogramapp_startup_duration_seconds gaugeDeployment version tracking
app_info gauge with version, commit_sha, environment attributes for canary analysisImport-ready Grafana dashboard JSON files, one per stage. Each uses ${DS_PROMETHEUS} as the data source variable, so Grafana will prompt you to select your Prometheus data source on import.
How to import: Grafana → Dashboards → New → Import → Upload JSON file → Select your Prometheus data source.
| Template | Panels | Description |
|---|---|---|
| templates/stage-1-service-overview.json | 14 | Golden signals (stat), request rate by route, error rate, latency percentiles, P99 by route, event loop lag, heap memory, active handles, health check status and duration. |
| templates/stage-2-dependency-health.json | 15 | DB query P95 by operation, pool wait time, DB errors, cache hit ratio by strategy, Redis op latency, Redis errors, external API P95, circuit breaker state, API errors, auth failure rate, token ops, authz denials, queue depth, job duration, job failure rate. |
| templates/stage-3-business-metrics.json | 14 | Signups/revenue/success rate big numbers, signups by source, logins by method, revenue rate, transaction failure rate, funnel step totals, feature usage by variant, feature errors, 30-day SLO availability, SLI latency percentiles, error budget burn rate. |
| templates/stage-4-operational.json | 10 | Running versions table, startup/shutdown duration, active WS connections, WS message throughput, WS connection duration, rate limit rejections by route, remaining tokens, request volume by tenant tier, resource usage by tier. |
Each dashboard is standalone. Combine panels across dashboards as your needs evolve. All dashboards include $job and $instance template variables for filtering. Stage 3 additionally includes $funnel and $sli selectors.
Read references/grafana-dashboards.md for dashboard panel definitions and provisioning.
Dashboard hierarchy:
Alert hierarchy (pair with dashboards):
Read:
references/infra-exporters.md
Your app metrics (Stages 1-4) show how the application experiences its dependencies. Infrastructure exporters show how those services perform internally. You need both. The gap between them is where most production issues hide.
This reference covers exporter setup (docker-compose), Prometheus scrape config, and community Grafana dashboard IDs for: PostgreSQL, Redis, RabbitMQ, MongoDB, Qdrant, Elasticsearch, Nginx, MinIO, plus node-exporter and cAdvisor for host/container metrics.
Each entry includes the key metrics the exporter provides and, critically, what it does NOT tell you (which your app-level metrics fill).
app_. Infra exporters use node_, pg_, redis_, container_.meter.create* calls silently return no-op instruments. Nothing fails, nothing records. Verify by checking the Collector's own metrics or the Prometheus targets page.finally block so errors are measured too.| File | When to read |
|---|---|
| references/otel-setup.md | Before starting. OTel SDK bootstrap and Collector config. |
| references/stage-1-foundation.md | Implementing foundation metrics. Full code. |
| references/stage-2-app-intelligence.md | Adding dependency and internal behavior metrics. |
| references/stage-3-business-metrics.md | Adding business event tracking and SLIs. |
| references/stage-4-advanced.md | WebSockets, rate limiters, multi-tenancy, lifecycle. |
| references/infra-exporters.md | Exporter setup for Postgres, Redis, RabbitMQ, MongoDB, Qdrant, Elasticsearch, Nginx, MinIO, node-exporter, cAdvisor. Docker-compose, scrape configs, community dashboard IDs. |
| references/grafana-dashboards.md | App dashboard structure, alert rules, provisioning. |
| templates/stage-1-service-overview.json | Import into Grafana for HTTP RED, runtime, health check panels. |
| templates/stage-2-dependency-health.json | Import into Grafana for DB, Redis, external API, auth, job panels. |
| templates/stage-3-business-metrics.json | Import into Grafana for signups, revenue, funnels, SLO panels. |
| templates/stage-4-operational.json | Import into Grafana for WebSocket, rate limiter, multi-tenant panels. |
development
Implement structured, wide-event logging in NestJS applications following the canonical log line / wide event pattern. Use this skill whenever the user asks about logging, observability, debugging, or tracing in a NestJS app. Also trigger when the user mentions log lines, structured logging, canonical log lines, wide events, request context, observability, or asks how to improve their NestJS logging setup. Use this even if the user just says "add logging" to a NestJS project, since the wide event pattern should be the default, not scattered console.log calls.
development
How to create, update, and manage documentation for any software project. Use this skill whenever asked to add documentation, update docs, create a runbook, write an ADR, document a feature, add a product spec, write onboarding guides, or any documentation-related task. Also trigger when making behavior-changing code changes (features, fixes, refactors, integrations, infra changes) that require accompanying documentation updates — even if the user doesn't explicitly mention "docs".
development
Perform a comprehensive security and code quality audit on web-based projects (React, Next.js, NestJS). Uses the OWASP Top 10:2025 standard as the primary security framework. Generates a detailed Markdown report with findings categorized by severity (CRITICAL, HIGH, MEDIUM, LOW). Use this skill whenever the user asks to audit, review, scan, or analyze their codebase for vulnerabilities, security issues, code quality problems, bad patterns, or potential bugs. Also trigger when the user mentions "OWASP", "security review", "vulnerability scan", "code audit", "pentest review", "security assessment", "code health check", or asks "is my code secure?" or "find bugs in my project". Trigger even if the user just says "audit this" or "check my code" pointing at a web project. This skill supports React, Next.js, and NestJS projects, including monorepos containing multiple project types.
development
Maintainer-only workflow for handling GitHub Secret Scanning alerts on OpenClaw. Use when Codex needs to triage, redact, clean up, and resolve secret leakage found in issue comments, issue bodies, PR comments, or other GitHub content.