.agents/skills/observability-instrumentation/SKILL.md
Guide worker agents on implementing OpenTelemetry instrumentation, metrics, structured logging, and cross-signal correlation for Go services
npx skillsauth add em-jones/staccato-toolkit Observability InstrumentationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Instrument Go services with the OpenTelemetry SDK for distributed tracing, metrics, and structured logging. Ensure all three signals (traces, logs, metrics) are correlated and exported to the observability stack (Tempo, Loki, Prometheus via OTel Collector).
Before instrumenting a service, ensure:
For new Go services or when migrating from telemetry.InitTelemetry(), use the servicedefaults package for unified observability setup:
go get github.com/staccato-toolkit/core/pkg/servicedefaults
This package provides a single Configure() call that initializes all observability signals (traces, metrics, logs) with best practices built in. See Go Service Defaults for full usage patterns.
For non-Go services (Python, TypeScript, etc.): Start by reading the Language Toolkit Pattern to understand the full capability contract (logging, tracing, metrics, HTTP client defaults, single entry-point, no-op path, graceful shutdown) before selecting libraries. The pattern rule defines what your language toolkit MUST provide, regardless of language runtime.
Quick start with servicedefaults.Configure():
import (
"context"
"log/slog"
"os"
"github.com/staccato-toolkit/core/pkg/servicedefaults"
)
func main() {
ctx := context.Background()
// Initialize all observability signals (traces, metrics, logs)
shutdown, err := servicedefaults.Configure(ctx, "my-service")
if err != nil {
slog.Error("failed to initialize service defaults", "error", err)
os.Exit(1)
}
defer shutdown(ctx)
// Use slog.Default() for logging (no global logger variable)
slog.Info("service started", "port", 8080)
// Your service logic...
}
Benefits of servicedefaults.Configure():
OTEL_SDK_DISABLED=true skips OTel init for dev/test)slog.Default())servicedefaults.NewHTTPClient()Note: The telemetry.InitTelemetry() approach (shown below) still works but servicedefaults.Configure() is preferred for new services.
If you cannot use the servicedefaults package, follow these steps to manually initialize OTel:
Add the OTel Go SDK and required instrumentation libraries to go.mod:
go get go.opentelemetry.io/[email protected]
go get go.opentelemetry.io/otel/[email protected]
go get go.opentelemetry.io/otel/exporters/otlp/otlptrace/[email protected]
go get go.opentelemetry.io/otel/exporters/otlp/otlpmetric/[email protected]
go get go.opentelemetry.io/contrib/instrumentation/net/http/[email protected]
Create an initialization function that sets up the tracer and meter providers:
func initOTel(ctx context.Context, serviceName string) (func(context.Context) error, error) {
res, err := resource.New(ctx,
resource.WithAttributes(
semconv.ServiceNameKey.String(serviceName),
semconv.ServiceVersionKey.String(version),
semconv.DeploymentEnvironmentKey.String(os.Getenv("ENVIRONMENT")),
),
)
if err != nil {
return nil, fmt.Errorf("failed to create resource: %w", err)
}
// Trace exporter
traceExporter, err := otlptracegrpc.New(ctx,
otlptracegrpc.WithEndpoint(os.Getenv("OTEL_EXPORTER_OTLP_ENDPOINT")),
otlptracegrpc.WithInsecure(), // Use TLS in production
)
if err != nil {
return nil, fmt.Errorf("failed to create trace exporter: %w", err)
}
// Tracer provider with sampling
samplingRate := 0.1 // 10% default
if rate := os.Getenv("OTEL_TRACES_SAMPLER_ARG"); rate != "" {
if parsed, err := strconv.ParseFloat(rate, 64); err == nil {
samplingRate = parsed
}
}
tp := trace.NewTracerProvider(
trace.WithBatcher(traceExporter),
trace.WithResource(res),
trace.WithSampler(trace.ParentBased(trace.TraceIDRatioBased(samplingRate))),
)
otel.SetTracerProvider(tp)
// Metric exporter
metricExporter, err := otlpmetricgrpc.New(ctx,
otlpmetricgrpc.WithEndpoint(os.Getenv("OTEL_EXPORTER_OTLP_ENDPOINT")),
otlpmetricgrpc.WithInsecure(),
)
if err != nil {
return nil, fmt.Errorf("failed to create metric exporter: %w", err)
}
mp := metric.NewMeterProvider(
metric.WithReader(metric.NewPeriodicReader(metricExporter)),
metric.WithResource(res),
)
otel.SetMeterProvider(mp)
// Return shutdown function
return func(ctx context.Context) error {
if err := tp.Shutdown(ctx); err != nil {
return err
}
return mp.Shutdown(ctx)
}, nil
}
Call this at service startup:
func main() {
ctx := context.Background()
shutdown, err := initOTel(ctx, "staccato-server")
if err != nil {
log.Fatal(err)
}
defer shutdown(context.Background())
// Start service...
}
Migration note: If you're currently using telemetry.InitTelemetry(), replace it with servicedefaults.Configure() for improved non-blocking dial, env-aware behavior, and unified logging setup. See the Go Service Defaults for details.
Wrap HTTP handlers with otelhttp to automatically create spans:
import "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
mux := http.NewServeMux()
mux.Handle("/api/users", otelhttp.NewHandler(usersHandler, "api.users"))
mux.Handle("/api/orders", otelhttp.NewHandler(ordersHandler, "api.orders"))
// Expose metrics endpoint
mux.Handle("/metrics", promhttp.Handler())
http.ListenAndServe(":8080", mux)
If using servicedefaults, create an instrumented HTTP client:
import "github.com/staccato-toolkit/core/pkg/servicedefaults"
client := servicedefaults.NewHTTPClient()
resp, err := client.Get("https://api.example.com/data")
If using manual OTel setup, wrap HTTP clients to propagate trace context:
client := &http.Client{
Transport: otelhttp.NewTransport(http.DefaultTransport),
}
req, _ := http.NewRequestWithContext(ctx, "GET", "https://api.example.com/data", nil)
resp, err := client.Do(req)
Create child spans for important operations:
func processOrder(ctx context.Context, order *Order) error {
tracer := otel.Tracer("staccato-server")
ctx, span := tracer.Start(ctx, "process_order")
defer span.End()
// Add useful attributes
span.SetAttributes(
attribute.String("order_id", order.ID),
attribute.String("user_id", order.UserID),
attribute.Int("item_count", len(order.Items)),
)
// Business logic...
if err := validateOrder(ctx, order); err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, "validation failed")
return err
}
span.SetStatus(codes.Ok, "order processed")
return nil
}
Create metrics for business events:
var (
meter = otel.Meter("staccato-server")
requestCounter, _ = meter.Int64Counter(
"staccato_server_http_requests_total",
metric.WithDescription("Total HTTP requests"),
)
requestDuration, _ = meter.Float64Histogram(
"staccato_server_http_request_duration_seconds",
metric.WithDescription("HTTP request duration"),
)
orderCounter, _ = meter.Int64Counter(
"staccato_server_orders_total",
metric.WithDescription("Total orders processed"),
)
)
func handleRequest(w http.ResponseWriter, r *http.Request) {
start := time.Now()
// Handle request...
duration := time.Since(start)
requestCounter.Add(r.Context(), 1, metric.WithAttributes(
attribute.String("method", r.Method),
attribute.String("endpoint", r.URL.Path),
attribute.Int("status", 200),
))
requestDuration.Record(r.Context(), duration.Seconds())
}
If using servicedefaults.Configure(), logging is automatically set up with the otelslog bridge and TraceHandler. Simply use slog.Default():
import "log/slog"
func HandleRequest(ctx context.Context, req *Request) error {
slog.InfoContext(ctx, "processing request", "request_id", req.ID)
if err := processData(ctx, req.Data); err != nil {
slog.ErrorContext(ctx, "failed to process data", "error", err)
return err
}
return nil
}
If using manual OTel setup, all services MUST use log/slog with the OpenTelemetry logs bridge (otelslog) to export logs via OTLP to the OTel Collector. This ensures unified observability (traces, metrics, logs) through a single pipeline.
go get go.opentelemetry.io/contrib/bridges/[email protected]
go get go.opentelemetry.io/otel/sdk/[email protected]
go get go.opentelemetry.io/otel/exporters/otlp/otlplog/[email protected]
Add LoggerProvider initialization to your initOTel function (or telemetry package):
import (
"go.opentelemetry.io/otel/exporters/otlp/otlplog/otlploggrpc"
"go.opentelemetry.io/otel/log/global"
sdklog "go.opentelemetry.io/otel/sdk/log"
)
func initLoggerProvider(ctx context.Context, res *resource.Resource) (*sdklog.LoggerProvider, error) {
// Create OTLP/gRPC log exporter (same endpoint as traces/metrics)
logExporter, err := otlploggrpc.New(ctx,
otlploggrpc.WithEndpoint(os.Getenv("OTEL_EXPORTER_OTLP_ENDPOINT")),
otlploggrpc.WithInsecure(), // Use TLS in production
)
if err != nil {
return nil, fmt.Errorf("failed to create log exporter: %w", err)
}
// Create LoggerProvider with batch processor
lp := sdklog.NewLoggerProvider(
sdklog.WithProcessor(sdklog.NewBatchProcessor(logExporter)),
sdklog.WithResource(res),
)
// Set global logger provider (used by otelslog bridge)
global.SetLoggerProvider(lp)
return lp, nil
}
Add LoggerProvider to your main initialization:
func initOTel(ctx context.Context, serviceName string) (func(context.Context) error, error) {
// ... resource, tracer provider, meter provider setup ...
// Initialize LoggerProvider
lp, err := initLoggerProvider(ctx, res)
if err != nil {
return nil, fmt.Errorf("failed to initialize logger provider: %w", err)
}
// Return shutdown function that includes LoggerProvider
return func(ctx context.Context) error {
var errs []error
if err := lp.Shutdown(ctx); err != nil {
errs = append(errs, err)
}
if err := tp.Shutdown(ctx); err != nil {
errs = append(errs, err)
}
if err := mp.Shutdown(ctx); err != nil {
errs = append(errs, err)
}
if len(errs) > 0 {
return fmt.Errorf("shutdown errors: %v", errs)
}
return nil
}, nil
}
Configure the default logger with TraceHandler (for trace context injection) and initialize the otelslog bridge:
import (
"log/slog"
"os"
"github.com/staccato-toolkit/core/pkg/telemetry"
"go.opentelemetry.io/contrib/bridges/otelslog"
)
func main() {
ctx := context.Background()
// Initialize OTel (traces, metrics, logs)
shutdown, err := telemetry.InitTelemetry(ctx, "staccato-server")
if err != nil {
slog.Error("failed to initialize telemetry", "error", err)
os.Exit(1)
}
defer shutdown(ctx)
// Set up structured logging with trace context injection
// Layer: slog.Default() -> TraceHandler -> JSONHandler -> stdout
baseHandler := slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
Level: slog.LevelDebug,
})
traceHandler := telemetry.NewTraceHandler(baseHandler)
slog.SetDefault(slog.New(traceHandler))
// Initialize otelslog bridge (forwards logs to OTel LoggerProvider -> OTLP)
_ = otelslog.NewHandler("staccato-server")
// Use slog throughout the service
slog.Info("service started", "port", 8080)
}
Always use slog.InfoContext(ctx, ...) and slog.ErrorContext(ctx, ...) to include trace context:
func HandleRequest(ctx context.Context, req *Request) error {
slog.InfoContext(ctx, "processing request", "request_id", req.ID)
if err := processData(ctx, req.Data); err != nil {
slog.ErrorContext(ctx, "failed to process data", "error", err, "request_id", req.ID)
return err
}
slog.InfoContext(ctx, "request completed", "request_id", req.ID)
return nil
}
The TraceHandler automatically injects trace_id and span_id from the active span context into log records. This is used for stdout logs. The otelslog bridge handles OTLP export.
// domain/pkg/telemetry/slog.go
type TraceHandler struct {
handler slog.Handler
}
func NewTraceHandler(h slog.Handler) slog.Handler {
return &TraceHandler{handler: h}
}
func (h *TraceHandler) Handle(ctx context.Context, r slog.Record) error {
span := trace.SpanFromContext(ctx)
spanCtx := span.SpanContext()
if spanCtx.IsValid() {
r.AddAttrs(
slog.String("trace_id", spanCtx.TraceID().String()),
slog.String("span_id", spanCtx.SpanID().String()),
)
}
return h.handler.Handle(ctx, r)
}
// Implement Enabled, WithAttrs, WithGroup methods...
The logging setup creates dual output:
slog.Logger → TraceHandler → JSONHandler → stdout
↓
(injects trace_id, span_id)
otelslog.Handler → OTel LoggerProvider → OTLP Exporter → Collector → Loki
↓
(uses global LoggerProvider)
Use environment variables to control log verbosity:
# Kubernetes deployment
env:
- name: OTEL_LOG_LEVEL
value: "info" # debug, info, warn, error
In your handler setup:
level := slog.LevelInfo
if os.Getenv("OTEL_LOG_LEVEL") == "debug" {
level = slog.LevelDebug
}
handler := slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
Level: level,
})
Configure via environment variables:
# Kubernetes deployment
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "otel-collector:4317"
- name: OTEL_TRACES_SAMPLER
value: "parentbased_traceidratio"
- name: OTEL_TRACES_SAMPLER_ARG
value: "0.1" # 10% sampling
- name: ENVIRONMENT
value: "production"
- name: LOG_LEVEL
value: "info"
After instrumentation, verify:
/metrics endpoint (test: curl http://localhost:8080/metrics)trace_id when a trace context is active<service>_<subsystem>_<unit>_totalfunc TracingMiddleware(logger *slog.Logger) func(http.Handler) http.Handler {
return func(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
span := trace.SpanFromContext(r.Context())
if span.SpanContext().IsValid() {
logger = logger.With(
slog.String("trace_id", span.SpanContext().TraceID().String()),
)
}
next.ServeHTTP(w, r.WithContext(
context.WithValue(r.Context(), "logger", logger),
))
})
}
}
func recordError(span trace.Span, err error, msg string) {
span.RecordError(err)
span.SetStatus(codes.Error, msg)
span.SetAttributes(attribute.String("error.message", err.Error()))
}
// Usage
if err := processOrder(ctx, order); err != nil {
recordError(span, err, "failed to process order")
return err
}
Spans not appearing in Tempo
→ Check OTEL_EXPORTER_OTLP_ENDPOINT is correct. Verify OTel Collector is running. Ensure shutdown() is called on service exit to flush pending spans.
Logs missing trace_id
→ Verify you're passing context.Context to log calls. Use trace.SpanFromContext(ctx) to extract trace ID. Ensure the span context is valid before logging.
Metrics not scraped by Prometheus
→ Verify /metrics endpoint is exposed. Check ServiceMonitor selector matches service labels. Confirm OTel Collector is forwarding metrics to Prometheus.
High latency overhead
→ Reduce sampling rate (e.g., 5% or 1%). Use trace.WithBatcher (not WithSyncer). Profile with pprof to identify bottlenecks.
Alloy replaces both Promtail and the standalone OTel Collector. All OTLP signals (traces, metrics, logs) from services should target the Alloy agent endpoint.
Grafana Alloy is a DaemonSet in the monitoring namespace that:
loki.source.kubernetes (replaces Promtail):4317 and HTTP :4318Go services (via servicedefaults or manual OTel init): set the endpoint to Alloy:
# In-cluster
export OTEL_EXPORTER_OTLP_ENDPOINT=http://alloy.monitoring.svc.cluster.local:4317
# Or via DaemonSet node IP (for DaemonSet-aware clients)
# OTEL_EXPORTER_OTLP_ENDPOINT=http://$(NODE_IP):4317
Node.js services (Backstage): the instrumentation.js bootstrap reads OTEL_EXPORTER_OTLP_ENDPOINT:
// Default: http://alloy.monitoring.svc.cluster.local:4318 (OTLP/HTTP)
// Override for local dev: export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
const otlpEndpoint =
process.env.OTEL_EXPORTER_OTLP_ENDPOINT || "http://alloy.monitoring.svc.cluster.local:4318";
The Alloy pipeline is configured in src/ops/observability/alloy/config.alloy using River syntax. Key components:
// OTLP receiver — accepts all signals from services
otelcol.receiver.otlp "default" {
grpc { endpoint = "0.0.0.0:4317" }
http { endpoint = "0.0.0.0:4318" }
output {
traces = [otelcol.exporter.otlp.tempo.input]
metrics = [otelcol.exporter.prometheus.default.input]
logs = [otelcol.exporter.loki.default.input]
}
}
// Kubernetes pod log collection — replaces Promtail
loki.source.kubernetes "pod_logs" {
targets = discovery.relabel.pod_logs.output
forward_to = [loki.write.default.receiver]
}
See Grafana Alloy Usage Rules for the full pipeline reference and deployment checklist.
For local dev (no in-cluster Alloy), start Alloy locally or use a Docker Compose stack:
# Point to a local Alloy or OTel Collector for dev
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
Or set OTEL_SDK_DISABLED=true to skip OTel init entirely in local dev.
tools
<!--VITE PLUS START--> # Using Vite+, the Unified Toolchain for the Web This project is using Vite+, a unified toolchain built on top of Vite, Rolldown, Vitest, tsdown, Oxlint, Oxfmt, and Vite Task. Vite+ wraps runtime management, package management, and frontend tooling in a single global CLI called `vp`. Vite+ is distinct from Vite, but it invokes Vite through `vp dev` and `vp build`. ## Vite+ Workflow `vp` is a global binary that handles the full development lifecycle. Run `vp help` to pr
development
Guide for building performant data tables. Uses tanstack-table for table logic (sorting, filtering, pagination) and tanstack-virtual for rendering large datasets efficiently.
development
Expert guidance for building observable, expressive, and fault-tolerant TypeScript applications using the effect-ts/effect ecosystem. Covers Effect<A, E, R> type, error management, dependency injection via Layers, observability (logging, metrics, tracing), concurrency with Fibers, retry/scheduling, Schema validation, Streams, and Sinks.
tools
Complete E2E (end-to-end) and integration testing skill for TypeScript/NestJS projects using Jest, real infrastructure via Docker, and GWT pattern. ALWAYS use this skill when user needs to: **SETUP** - Initialize or configure E2E testing infrastructure: - Set up E2E testing for a new project - Configure docker-compose for testing (Kafka, PostgreSQL, MongoDB, Redis) - Create jest-e2e.config.ts or E2E Jest configuration - Set up test helpers for database, Kafka, or Redis - Configure .env.e2e environment variables - Create test/e2e directory structure **WRITE** - Create or add E2E/integration tests: - Write, create, add, or generate e2e tests or integration tests - Test API endpoints, workflows, or complete features end-to-end - Test with real databases, message brokers, or external services - Test Kafka consumers/producers, event-driven workflows - Working on any file ending in .e2e-spec.ts or in test/e2e/ directory - Use GWT (Given-When-Then) pattern for tests **REVIEW** - Audit or evaluate E2E tests: - Review existing E2E tests for quality - Check test isolation and cleanup patterns - Audit GWT pattern compliance - Evaluate assertion quality and specificity - Check for anti-patterns (multiple WHEN actions, conditional assertions) **RUN** - Execute or analyze E2E test results: - Run E2E tests - Start/stop Docker infrastructure for testing - Analyze E2E test results - Verify Docker services are healthy - Interpret test output and failures **DEBUG** - Fix failing or flaky E2E tests: - Fix failing E2E tests - Debug flaky tests or test isolation issues - Troubleshoot connection errors (database, Kafka, Redis) - Fix timeout issues or async operation failures - Diagnose race conditions or state leakage - Debug Kafka message consumption issues **OPTIMIZE** - Improve E2E test performance: - Speed up slow E2E tests - Optimize Docker infrastructure startup - Replace fixed waits with smart polling - Reduce beforeEach cleanup time - Improve test parallelization where safe Keywords: e2e, end-to-end, integration test, e2e-spec.ts, test/e2e, Jest, supertest, NestJS, Kafka, Redpanda, PostgreSQL, MongoDB, Redis, docker-compose, GWT pattern, Given-When-Then, real infrastructure, test isolation, flaky test, MSW, nock, waitForMessages, fix e2e, debug e2e, run e2e, review e2e, optimize e2e, setup e2e