skills/golang-observability/SKILL.md
Golang everyday observability — the always-on signals in production. Covers structured logging with slog, Prometheus metrics, OpenTelemetry distributed tracing, continuous profiling with pprof/Pyroscope, server-side RUM event tracking, alerting, and Grafana dashboards. Apply when instrumenting Go services for production monitoring, setting up metrics or alerting, adding OpenTelemetry tracing, correlating logs with traces, migrating legacy loggers (zap/logrus/zerolog) to slog, adding observability to new features, or implementing GDPR/CCPA-compliant tracking with Customer Data Platforms (CDP). Not for temporary deep-dive performance investigation (→ See `golang-benchmark` and `golang-performance` skills).
npx skillsauth add rockcookies/skills golang-observabilityInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Persona: You are a Go observability engineer. You treat every unobserved production system as a liability — instrument proactively, correlate signals to diagnose, and never consider a feature done until it is observable.
Modes:
Community default. A company skill that explicitly supersedes
golang-observabilityskill takes precedence.
Observability is the ability to understand a system's internal state from its external outputs. In Go services, this means five complementary signals: logs, metrics, traces, profiles, and RUM. Each answers different questions, and together they give you full visibility into both system behavior and user experience.
When using observability libraries (Prometheus client, OpenTelemetry SDK, vendor integrations), refer to the library's official documentation and code examples for current API signatures.
log/slog — production services MUST emit structured logs (JSON), not freeform stringsslog.InfoContext(ctx, ...) to correlate logs with traceshistogram_quantile() in PromQLSee golang-error-handling skill for the single handling rule. See golang-troubleshooting skill for using observability signals to diagnose production issues. See golang-security skill for protecting pprof endpoints and avoiding PII in logs. See golang-context skill for propagating trace context across service boundaries. See samber/cc-skills@promql-cli skill for querying and exploring PromQL expressions against Prometheus from the CLI.
For simple fan-out to multiple slog handlers, prefer stdlib slog.NewMultiHandler before adding third-party handler-composition dependencies.
logger := slog.New(slog.NewMultiHandler(
slog.NewJSONHandler(os.Stdout, nil),
auditHandler,
))
Use third-party slog handler libraries only when the stdlib handler composition is insufficient.
| Signal | Question it answers | Tool | When to use |
| --- | --- | --- | --- |
| Logs | What happened? | log/slog | Discrete events, errors, audit trails |
| Metrics | How much / how fast? | Prometheus client | Aggregated measurements, alerting, SLOs |
| Traces | Where did time go? | OpenTelemetry | Request flow across services, latency breakdown |
| Profiles | Why is it slow / using memory? | pprof, Pyroscope | CPU hotspots, memory leaks, lock contention |
| RUM | How do users experience it? | PostHog, Segment | Product analytics, funnels, session replay |
Each signal has a dedicated guide with full code examples, configuration patterns, and cost analysis:
Structured Logging — Why structured logging matters for log aggregation at scale. Covers log/slog setup, log levels (Debug/Info/Warn/Error) and when to use each, request correlation with trace IDs, context propagation with slog.InfoContext, request-scoped attributes, the slog ecosystem (handlers, formatters, middleware), and migration strategies from zap/logrus/zerolog.
Metrics Collection — Prometheus client setup and the four metric types (Counter for rate-of-change, Gauge for snapshots, Histogram for latency aggregation). Deep dive: why Histograms beat Summaries (server-side aggregation, supports histogram_quantile PromQL), naming conventions, the PromQL-as-comments convention (write queries above metric declarations for discoverability), production-grade PromQL examples, multi-window SLO burn rate alerting, and the high-cardinality label problem (why unbounded values like user IDs destroy performance).
Distributed Tracing — When and how to use OpenTelemetry SDK to trace request flows across services. Covers spans (creating, attributes, status recording), otelhttp middleware for HTTP instrumentation, error recording with span.RecordError(), trace sampling (why you can't collect everything at scale), propagating trace context across service boundaries, and cost optimization.
Profiling — On-demand profiling with pprof (CPU, heap, goroutine, mutex, block profiles) — how to enable it in production, secure it with auth, and toggle via environment variables without redeploying. Continuous profiling with Pyroscope for always-on performance visibility. Cost implications of each profiling type and mitigation strategies.
Real User Monitoring — Understanding how users actually experience your service. Covers product analytics (event tracking, funnels), Customer Data Platform integration, and critical compliance: GDPR/CCPA consent checks, data subject rights (user deletion endpoints), and privacy checklist for tracking. Server-side event tracking (PostHog, Segment) and identity key best practices.
Alerting — Proactive problem detection. Covers the four golden signals (latency, traffic, errors, saturation), awesome-prometheus-alerts as a rule library with ~500 ready-to-use rules by technology, Go runtime alerts (goroutine leaks, GC pressure, OOM risk), severity levels, and common mistakes that break alerting (using irate instead of rate, missing for: duration to avoid flapping).
Grafana Dashboards — Prebuilt dashboards for Go runtime monitoring (heap allocation, GC pause frequency, goroutine count, CPU). Explains the standard dashboards to install, how to customize them for your service, and when each dashboard answers a different operational question.
Signals are most powerful when connected. A trace_id in your logs lets you jump from a log line to the full request trace. An exemplar on a metric links a latency spike to the exact trace that caused it.
otelslog bridgeimport "go.opentelemetry.io/contrib/bridges/otelslog"
// Create a logger that automatically injects trace_id and span_id
logger := otelslog.NewHandler("my-service")
slog.SetDefault(slog.New(logger))
// Now every slog call with context includes trace correlation
slog.InfoContext(ctx, "order created", "order_id", orderID)
// Output includes: {"trace_id":"abc123", "span_id":"def456", "msg":"order created", ...}
// When recording a histogram observation, attach the trace_id as an exemplar
// so you can jump from a P99 spike directly to the offending trace
obs := histogram.WithLabelValues("POST", "/orders")
if eo, ok := obs.(prometheus.ExemplarObserver); ok {
eo.ObserveWithExemplar(duration, prometheus.Labels{"trace_id": traceID})
} else {
obs.Observe(duration)
}
If the project currently uses zap, logrus, or zerolog, migrate to log/slog. It is the standard library logger since Go 1.21, has a stable API, and the ecosystem has consolidated around it. Continuing with third-party loggers means maintaining an extra dependency for no benefit.
Migration strategy:
slog as the new logger with slog.SetDefault()zap.L().Info(...) / logrus.Info(...) / log.Info().Msg(...) calls with slog.Info(...)A feature is not production-ready until it is observable. Before marking a feature as done, verify:
slog, context variants used (slog.InfoContext), no PII in logs, errors MUST be either logged OR returned (NEVER both).span.RecordError().user_id (not email), consent checked before tracking.// ✗ Bad — log AND return (error gets logged multiple times up the chain)
if err != nil {
slog.Error("query failed", "error", err)
return fmt.Errorf("query: %w", err)
}
// ✓ Good — return with context, log once at the top level
if err != nil {
return fmt.Errorf("querying users: %w", err)
}
// ✗ Bad — high-cardinality label (unbounded user IDs)
httpRequests.WithLabelValues(r.Method, r.URL.Path, userID).Inc()
// ✓ Good — bounded label values only
httpRequests.WithLabelValues(r.Method, routePattern).Inc()
// ✗ Bad — not passing context (breaks trace propagation)
result, err := db.Query("SELECT ...")
// ✓ Good — context flows through, trace continues
result, err := db.QueryContext(ctx, "SELECT ...")
// ✗ Bad — using Summary for latency (can't aggregate across instances)
prometheus.NewSummary(prometheus.SummaryOpts{
Name: "http_request_duration_seconds",
Objectives: map[float64]float64{0.99: 0.001},
})
// ✓ Good — use Histogram (aggregatable, supports histogram_quantile)
prometheus.NewHistogram(prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Buckets: prometheus.DefBuckets,
})
development
Vue 3 debugging and error handling for runtime errors, warnings, async failures, and SSR/hydration issues. Use when diagnosing or fixing Vue issues.
development
MUST be used for Vue.js tasks. Strongly recommends Composition API with `<script setup>` and TypeScript as the standard approach. Covers Vue 3, SSR, Volar, vue-tsc. Load for any Vue, .vue files, Vue Router, Pinia, or Vite with Vue work. ALWAYS use Composition API unless the project explicitly requires Options API.
development
GORM Gen 类型安全 DAO 代码生成,基于 github.com/rockcookies/go-gen(rockcookies fork)。涵盖代码生成配置、模型生成、查询构建、增删改查、关联关系、动态 SQL 注解、事务处理、datatypes 自定义字段类型(JSON/JSONMap/JSONSlice/JSONType/Date/UUID)、soft_delete 软删除插件(unix 时间戳/flag 模式),以及 fork 专有功能:Tmpl 运行时模板覆写(18 个模板)、Unsafe 底层方法(UnsafeSetDB/Alias/ModelType/TableName)、IGenericsDo[T,E] 泛型接口。使用时机:需要从数据库生成 DAO 代码(GenerateModel/GenerateModelAs)、编写 DAL 查询(DO 链式调用、DaoScope、事务、关联加载)、配置生成器(gen.Config、ModelOpt、FieldGORMTag、FieldModify、FieldType、Tmpl 自定义模板)、使用 datatypes(JSONMap、JSONSlice、JSONQuery、JSONSet)或 soft_delete(DeletedAt、softDelete:milli、deleteOpts)时使用本技能。当用户消息中包含以下任一关键词(go-gen、gorm-gen、GenerateModelAs、ModelOpt、FieldGORMTag、FieldModify、DaoScope、LoadOneToMany、LoadManyToMany、IGenericsDo、UnsafeSetDB、datatypes、JSONMap、JSONSlice、JSONQuery、soft_delete、softDelete、DeletedAt),或用户明确请求 GORM Gen 代码生成/DAO 编写时触发本技能。
development
轻量级 Go HTTP 客户端库,基于 github.com/rockcookies/go-fetch(零外部依赖)。涵盖 Dispatcher 初始化与中间件、Request 链式构建(RequestFunc 与 Middleware 分层)、Response 解码(JSON/XML/流)、请求体编码(JSON/XML/Form/Multipart/BodyGet)、URL 参数(PrepareURLMiddleware/URLOptions)、Header/Cookie 管理(ApplyHeader/ApplyCookie 与 Context)、中间件组合(Dispatcher/Request/Do 三层)、HTTP 交换日志(dump.New/dump.Transport/过滤器/WithRequestRedactor/WithResponseRedactor/SlogWriter)。使用时机:需要发起 HTTP 请求(GET/POST/PUT/PATCH/DELETE,均需 context.Context)、上传文件(Multipart/GetReader)、配置全局认证头(dispatcher.Use)、记录 HTTP 交换日志(dump.New、WithFilter、DefaultRedactor)、构建可复用的请求基础(Request.Clone)时使用本技能。当用户消息中包含以下任一关键词(go-fetch、NewDispatcher、NewDispatcherWithTransport、RequestFunc、PreFuncs、UseFuncs、BodyGet、MultipartField、dump.New、WithFilter、WithRequestRedactor、WithResponseRedactor、DefaultRedactor、DumpOptions、SlogWriter、URLOptions、PrepareURLMiddleware、PathParams、SetURLOptions、WithURLOptions、ApplyHeader、SetHeaderOptions、WithHeaderOptions、ApplyCookie、SetCookieOptions、WithCookieOptions、HandlerFunc、fetch.Handler、fetch.Middleware、dispatcher.Use、resp.Close、resp.JSON、resp.XML),或用户明确请求 go-fetch HTTP 客户端用法时触发本技能。