skills/otel-trace-analysis/SKILL.md
Analyze OpenTelemetry traces for latency issues, error diagnosis, retry patterns, dependency bottlenecks, and trace integrity problems. Use when the user provides OTEL trace JSON files or asks about distributed system performance issues.
npx skillsauth add stevefeldman/agents-skills otel-trace-analysisInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Interpret, diagnose, and explain OpenTelemetry (OTEL) traces from distributed systems. Focus on latency analysis, error diagnosis, retry behavior, dependency bottlenecks, and trace integrity issues.
CRITICAL: Deep Dive (Full Analysis) is the DEFAULT. Never take shortcuts unless explicitly requested.
This is the default mode. Always use Deep Dive unless the user explicitly requests Triage.
Deep Dive requires ALL of the following steps and output sections. Do NOT skip, summarize, or abbreviate any section:
Required Steps:
Required Output Sections (ALL 10 are MANDATORY):
Use ONLY when user explicitly says: "quick", "triage", "summary only", "brief", "60-second answer"
Triage is a SHORTCUT. Never default to it.
status.code=2Triage Output:
Duration: 3.2s | Spans: 47 | Errors: 2
Bottleneck: "db.query" (2.1s) in orders-service
Root Error: "connection refused" in payments-service
Pattern: Timeout cascade (parent ended before child)
Next: Check payments-service connectivity
| User Command / Phrase | Mode | Description |
|----------------------|------|-------------|
| /otel-trace-analysis <file> | Deep Dive | Full analysis with all 10 sections |
| "analyze this trace" | Deep Dive | Full analysis |
| "what happened?" | Deep Dive | Full analysis |
| "investigate" | Deep Dive | Full analysis |
| "deep dive" | Deep Dive | Full analysis |
| "full analysis" | Deep Dive | Full analysis |
| (no specific mode requested) | Deep Dive | Full analysis (DEFAULT) |
| "quick" | Triage | 5-line summary only |
| "triage" | Triage | 5-line summary only |
| "brief" | Triage | 5-line summary only |
| "summary only" | Triage | 5-line summary only |
| "60 seconds" | Triage | 5-line summary only |
NEVER take shortcuts unless explicitly asked. When in doubt, use Deep Dive.
OTEL traces use this hierarchical format:
{
"batches": [
{
"resource": {
"attributes": [
{"key": "service.name", "value": {"stringValue": "my-service"}},
{"key": "telemetry.sdk.language", "value": {"stringValue": "go"}}
]
},
"instrumentationLibrarySpans": [
{
"instrumentationLibrary": {
"name": "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",
"version": "0.60.0"
},
"spans": [/* array of span objects */]
}
]
}
]
}
OTEL JSON exports come in two schema versions:
| Version | Top-Level Key | Span Container | Status |
|---------|--------------|----------------|--------|
| OTLP v0.x | batches | instrumentationLibrarySpans | Common (your current traces) |
| OTLP v1.x | resourceSpans | scopeSpans | Newer standard |
Detection logic:
// Detect schema version
const isLegacy = trace.batches !== undefined;
const isModern = trace.resourceSpans !== undefined;
// Normalize to common accessor
const getSpans = (trace) => {
if (trace.batches) {
return trace.batches.flatMap(b =>
b.instrumentationLibrarySpans?.flatMap(ils => ils.spans) ?? []
);
}
if (trace.resourceSpans) {
return trace.resourceSpans.flatMap(rs =>
rs.scopeSpans?.flatMap(ss => ss.spans) ?? []
);
}
return [];
};
Analysis tip: Always check which schema you have before running jq commands. The skill's jq examples use the legacy format; adapt paths for modern format.
{
"traceId": "70eca60a2b3a542769facb8874e621fd",
"spanId": "03731dffbdf4129e",
"parentSpanId": "30f2d4a18f3512cd",
"name": "GET /api/users",
"kind": "SPAN_KIND_SERVER",
"startTimeUnixNano": 1771599406034000000,
"endTimeUnixNano": 1771599416167578000,
"attributes": [
{"key": "http.method", "value": {"stringValue": "GET"}},
{"key": "http.status_code", "value": {"intValue": 200}}
],
"status": {"code": 0, "message": ""},
"events": [
{
"name": "exception",
"timeUnixNano": 1771599416167000000,
"attributes": [
{"key": "exception.type", "value": {"stringValue": "*errors.Error"}},
{"key": "exception.message", "value": {"stringValue": "connection refused"}}
]
}
]
}
| Field | Description |
|-------|-------------|
| traceId | Unique ID linking all spans in a distributed trace |
| spanId | Unique ID for this span |
| parentSpanId | ID of parent span (empty/missing for root spans) |
| name | Operation name |
| kind | Span type (see Span Kinds) |
| startTimeUnixNano / endTimeUnixNano | Timing in nanoseconds since Unix epoch |
| attributes | Key-value pairs with operation details |
| status.code | 0=UNSET, 1=OK, 2=ERROR |
| status.message | Error description when code=2 |
| events | Timestamped events (exceptions, logs) |
| droppedAttributesCount | Indicates truncation if >0 |
Attributes use typed values:
{"key": "http.method", "value": {"stringValue": "GET"}}
{"key": "http.status_code", "value": {"intValue": 200}}
{"key": "cache.hit", "value": {"boolValue": true}}
{"key": "db.rows_affected", "value": {"intValue": 5}}
| Kind | Meaning | Examples |
|------|---------|----------|
| SPAN_KIND_SERVER | Handler for incoming request | HTTP server endpoint, gRPC server method |
| SPAN_KIND_CLIENT | Initiator of outgoing request | HTTP client call, database query, gRPC client |
| SPAN_KIND_INTERNAL | Internal operation | Business logic, local processing |
| SPAN_KIND_PRODUCER | Message queue producer | Kafka produce, RabbitMQ publish |
| SPAN_KIND_CONSUMER | Message queue consumer | Kafka consume, RabbitMQ subscribe |
| null | Unspecified (treat as INTERNAL) | Legacy instrumentation |
Analysis tip: CLIENT spans calling external services often dominate latency. SERVER spans show where time is spent handling requests.
Numeric Span Kinds (some exporters):
Some exporters use numeric enums instead of strings:
| Numeric | String Equivalent |
|---------|------------------|
| 0 | SPAN_KIND_UNSPECIFIED |
| 1 | SPAN_KIND_INTERNAL |
| 2 | SPAN_KIND_SERVER |
| 3 | SPAN_KIND_CLIENT |
| 4 | SPAN_KIND_PRODUCER |
| 5 | SPAN_KIND_CONSUMER |
Note: Your current traces use string constants. Handle both formats:
const normalizeKind = (kind) => {
if (typeof kind === 'number') {
return ['UNSPECIFIED','INTERNAL','SERVER','CLIENT','PRODUCER','CONSUMER'][kind] || 'UNKNOWN';
}
return kind?.replace('SPAN_KIND_', '') || 'INTERNAL';
};
Resource attributes identify the service instance:
| Attribute | Description |
|-----------|-------------|
| service.name | Service identifier (most important) |
| service.namespace | Service grouping |
| service.version | Deployment version |
| telemetry.sdk.language | go, nodejs, java, python, etc. |
| telemetry.sdk.name | Usually "opentelemetry" |
| telemetry.sdk.version | SDK version |
| k8s.cluster.name | Kubernetes cluster |
| k8s.namespace.name | Kubernetes namespace |
| k8s.pod.name | Pod name |
| cloud.provider | gcp, aws, azure |
| cloud.platform | gcp_kubernetes_engine, aws_eks, etc. |
| host.name | Host identifier |
| process.runtime.name | Runtime (nodejs, go, java) |
| Attribute | Description |
|-----------|-------------|
| http.method | GET, POST, PUT, DELETE |
| http.route | URL pattern (/users/:id) |
| http.target | Full path with query |
| http.url | Complete URL |
| http.status_code | Response code (200, 404, 500) |
| http.status_text | "OK", "Not Found" |
| http.host | Target host |
| http.user_agent | Client user agent |
| http.request_content_length | Request body size |
| http.response_content_length | Response body size |
| net.peer.name | Remote hostname |
| net.peer.port | Remote port |
| net.peer.ip | Remote IP |
| Attribute | Description |
|-----------|-------------|
| db.system | mysql, postgresql, redis, mongodb |
| db.name | Database name |
| db.operation | SELECT, INSERT, UPDATE, DELETE |
| db.statement | SQL query (may be sanitized) |
| db.sql.table | Table name |
| db.rows_affected | Rows modified |
| db.connection_string | Connection info (sanitized) |
| Attribute | Description |
|-----------|-------------|
| rpc.system | grpc, jsonrpc |
| rpc.service | Service name |
| rpc.method | Method name |
| rpc.grpc.status_code | gRPC status |
| Attribute | Description |
|-----------|-------------|
| messaging.system | kafka, rabbitmq, sqs |
| messaging.destination | Queue/topic name |
| messaging.operation | send, receive, process |
| messaging.message_id | Message identifier |
| Attribute | Description |
|-----------|-------------|
| net.transport | ip_tcp, ip_udp |
| net.host.name | Local hostname |
| net.host.port | Local port |
| tls.protocol | TLSv1.2, TLSv1.3 |
| tls.cipher.name | Cipher suite |
OTEL semantic conventions evolved. Traces often contain BOTH old and new attribute names. Use this resolution order:
| Canonical | Check First (New) | Fallback (Old) |
|-----------|------------------|----------------|
| Status Code | http.response.status_code | http.status_code |
| Method | http.request.method | http.method |
| Body Size | http.response.body.size | http.response_content_length |
| URL Path | url.path | http.target |
| URL Scheme | url.scheme | http.scheme |
| Canonical | Check First (New) | Fallback (Old) |
|-----------|------------------|----------------|
| Server Address | server.address | net.peer.name → http.host |
| Server Port | server.port | net.peer.port |
| Client Address | client.address | net.sock.peer.addr → net.peer.ip |
// Get HTTP status code from span attributes
const getHttpStatus = (attrs) => {
const find = (key) => attrs.find(a => a.key === key)?.value?.intValue;
return find('http.response.status_code') ?? find('http.status_code');
};
// Get peer/server address
const getPeerAddress = (attrs) => {
const find = (key) => attrs.find(a => a.key === key)?.value?.stringValue;
return find('server.address') ?? find('net.peer.name') ?? find('http.host');
};
Analysis tip: Your traces show mixed conventions (e.g., http.status_code alongside http.response.status_code). Always check both.
| Library | Language | Span Patterns |
|---------|----------|---------------|
| otelhttp | Go | HTTP client/server spans |
| @opentelemetry/instrumentation-express | Node.js | middleware - * spans |
| @opentelemetry/instrumentation-http | Node.js | HTTP client/server |
| io.opentelemetry.tomcat-* | Java | Servlet container spans |
| io.opentelemetry.jdbc | Java | Database spans |
| otelgorm | Go | GORM database spans |
| redisotel | Go | Redis spans |
Express instrumentation creates spans for each middleware:
middleware - query (URL parsing)
middleware - expressInit (Express setup)
middleware - cookieParser (Cookie handling)
middleware - cors (CORS)
middleware - jsonParser (Body parsing)
middleware - <anonymous> (Custom middleware)
request handler - /path (Route handler)
Analysis tip: Middleware spans are typically <1ms. The request handler or custom middleware spans contain the actual work.
ORMs like GORM create spans per query:
gorm.Query (Generic query)
gorm.Row (Single row)
SELECT tablename (Table-specific)
UPDATE tablename (Updates)
INSERT tablename (Inserts)
| Code | Name | Meaning | |------|------|---------| | 0 | UNSET | No explicit status (usually OK) | | 1 | OK | Explicitly successful | | 2 | ERROR | Operation failed |
Errors often include exception events:
{
"name": "exception",
"attributes": [
{"key": "exception.type", "value": {"stringValue": "TimeoutError"}},
{"key": "exception.message", "value": {"stringValue": "context deadline exceeded"}},
{"key": "exception.stacktrace", "value": {"stringValue": "..."}}
]
}
| Error Message Pattern | Likely Cause |
|----------------------|--------------|
| context deadline exceeded | Upstream timeout |
| context canceled | Caller canceled request |
| connection refused | Target service down |
| connection reset / ECONNRESET | Connection dropped |
| no such host | DNS resolution failed |
| certificate errors | TLS/SSL issues |
| max concurrency / circuit open | Circuit breaker tripped |
| timeout | Operation exceeded time limit |
| rate limit / 429 | Throttling |
| not found / 404 | Resource doesn't exist (may be expected) |
Errors propagate up the span tree:
status.code=2Analysis tip: Always find the deepest error first — that's usually the root cause.
const durationMs = (endTimeUnixNano - startTimeUnixNano) / 1_000_000;
const durationSeconds = durationMs / 1000;
Root spans have empty or missing parentSpanId:
jq '.batches[].instrumentationLibrarySpans[].spans[] |
select(.parentSpanId == "" or .parentSpanId == null)' trace.json
spanIdparentSpanIdThe critical path is the chain of spans determining end-to-end latency. The naive algorithm ("pick child with latest end time") can fail with:
Correct Algorithm:
[start, end, duration, selfTime]selfTime = duration - sum(overlapping child durations)endTime AND// Improved critical path detection
const findCriticalPath = (spans, rootId) => {
const byId = new Map(spans.map(s => [s.spanId, s]));
const children = new Map();
spans.forEach(s => {
if (!children.has(s.parentSpanId)) children.set(s.parentSpanId, []);
children.get(s.parentSpanId).push(s);
});
const path = [];
let current = byId.get(rootId);
while (current) {
path.push(current);
const kids = children.get(current.spanId) || [];
// Filter out async tails (child ends after parent)
const blocking = kids.filter(k => k.endTimeUnixNano <= current.endTimeUnixNano);
// Pick child with latest end time among blocking children
current = blocking.sort((a, b) =>
Number(b.endTimeUnixNano - a.endTimeUnixNano)
)[0];
}
return path;
};
Output critical path with timing breakdown:
Critical Path (3.2s total):
orders-api [self: 50ms, waiting: 3150ms]
└─> payments [self: 100ms, waiting: 2000ms] ← 66% of total
└─> db.query [self: 15ms] (leaf)
Self-time = span duration minus children's durations
selfTime = duration - sum(child.duration for all children)
High self-time indicates work done in that span itself. Low self-time with high duration indicates waiting for children.
If parentSpanId references a non-existent span:
Spans disconnected from the root:
parentSpanId chainsIf multiple spans share the same spanId:
# Detect duplicate spanIds
jq -r '[.batches[].instrumentationLibrarySpans[].spans[].spanId] |
group_by(.) | map(select(length > 1) | .[0])' trace.json
If spans have different traceId values:
# Check for multiple traceIds
jq -r '[.batches[].instrumentationLibrarySpans[].spans[].traceId] | unique' trace.json
Signs of propagation failure:
# Find CLIENT spans without corresponding SERVER children
jq -r '
[.batches[].instrumentationLibrarySpans[].spans[]] as $all |
($all | map(.spanId) | unique) as $ids |
$all[] |
select(.kind == "SPAN_KIND_CLIENT") |
select([.spanId] as $parent | $all | map(select(.parentSpanId == $parent[0])) | length == 0) |
{name, spanId, "missing_child": true}
' trace.json
| Indicator | Likely Sampling | Likely Instrumentation Gap | |-----------|----------------|---------------------------| | Random spans missing | Probabilistic sampling | No | | Entire service missing | Tail sampling excluded | Missing SDK setup | | First N spans only | Head sampling limit | No | | Only error spans present | Error-biased sampling | No | | Internal spans missing but CLIENT/SERVER present | No | Auto-instrumentation only |
Analysis tip: Check telemetry.sdk.* attributes to identify uninstrumented services.
These commands require jq to be installed. They're provided as convenience helpers, not requirements. The analysis logic above should be applied regardless of tooling.
Note: These examples use the legacy batches[].instrumentationLibrarySpans path. For modern OTLP (resourceSpans[].scopeSpans), adapt paths accordingly.
# Total spans
jq '[.batches[].instrumentationLibrarySpans[].spans[]] | length' trace.json
# Unique services
jq -r '.batches[].resource.attributes[] |
select(.key == "service.name") | .value.stringValue' trace.json | sort -u
# Span kinds distribution
jq -r '.batches[].instrumentationLibrarySpans[].spans[].kind' trace.json |
sort | uniq -c | sort -rn
# All error spans
jq '.batches[].instrumentationLibrarySpans[].spans[] |
select(.status.code == 2) | {name, message: .status.message}' trace.json
# Exception messages
jq -r '.batches[].instrumentationLibrarySpans[].spans[].events[]? |
select(.name == "exception") | .attributes[] |
select(.key == "exception.message") | .value.stringValue' trace.json
# Slowest spans
jq -r '.batches[].instrumentationLibrarySpans[].spans[] |
"\(.name)|\(((.endTimeUnixNano - .startTimeUnixNano) / 1000000 | floor))ms"' trace.json |
sort -t'|' -k2 -rn | head -10
# Spans over 1 second
jq '.batches[].instrumentationLibrarySpans[].spans[] |
select(((.endTimeUnixNano - .startTimeUnixNano) / 1000000) > 1000) |
{name, duration_ms: ((.endTimeUnixNano - .startTimeUnixNano) / 1000000)}' trace.json
# HTTP status codes
jq -r '.batches[].instrumentationLibrarySpans[].spans[].attributes[]? |
select(.key == "http.status_code") | .value.intValue' trace.json |
sort | uniq -c | sort -rn
# Failed HTTP requests (4xx, 5xx)
jq '.batches[].instrumentationLibrarySpans[].spans[] |
select(.attributes[]? | select(.key == "http.status_code" and .value.intValue >= 400))' trace.json
# Outbound calls (CLIENT spans with peer info)
jq -r '.batches[].instrumentationLibrarySpans[].spans[] |
select(.kind == "SPAN_KIND_CLIENT") | .attributes[]? |
select(.key == "net.peer.name" or .key == "http.host") |
.value.stringValue' trace.json | sort | uniq -c | sort -rn
CRITICAL: All 10 sections below are REQUIRED for Full Analysis. Do NOT skip, summarize, or combine sections.
Always include first. Must have ALL fields:
┌─────────────────────────────────────────────────┐
│ TRACE SCORECARD │
├─────────────────────────────────────────────────┤
│ Trace ID: 70eca60a2b3a5427... │
│ Duration: 3.2s │
│ Spans: 47 (3 errors) │
│ Services: 4 (orders, payments, inventory, db)│
│ Bottleneck: db.query (2.1s, 66% of total) │
│ Root Error: connection refused @ payments │
│ Completeness: Complete (no orphans) │
│ Confidence: HIGH │
└─────────────────────────────────────────────────┘
Confidence Levels:
Include for ALL traces with 2+ services:
┌─────────────────────────────────────────────────────────────────┐
│ SERVICE DEPENDENCIES │
├──────────────┬──────────────┬───────┬─────────┬────────┬────────┤
│ From │ To │ Calls │ p50 │ Max │ Errors │
├──────────────┼──────────────┼───────┼─────────┼────────┼────────┤
│ orders │ payments │ 3 │ 45ms │ 2100ms │ 1 │
│ orders │ inventory │ 1 │ 12ms │ 12ms │ 0 │
│ payments │ db │ 5 │ 8ms │ 15ms │ 0 │
└──────────────┴──────────────┴───────┴─────────┴────────┴────────┘
2-3 sentences covering:
Must include ALL of the following with SPECIFIC values (no generalities):
Span Statistics:
Span Kind Distribution:
HTTP Status Code Distribution: | Status | Count | |--------|-------| | 200 | X | | 4xx | X | | 5xx | X |
Service Span Counts: | Service | Spans | Errors | |---------|-------|--------| | service-a | X | X |
Slowest Spans (top 5-10):
span-name: Xms (service)span-name: Xms (service)Tree diagram with timing breakdown. Include self-time and waiting-time:
root-service (total: 3200ms) [self: 50ms, waiting: 3150ms]
└─> child-service (2100ms) [self: 100ms, waiting: 2000ms] ← ERROR
└─> db.query (15ms) [self: 15ms] (leaf)
└─> parallel-service (12ms) [parallel]
Root Cause Errors:
| Span Name | Span ID | Status | Message (verbatim) |
|-----------|---------|--------|-------------------|
| operation | abc123 | ERROR | "exact error message" |
Propagated Errors:
| Span Name | Span ID | Inherited From |
|-----------|---------|----------------|
| parent-op | def456 | operation (abc123) |
Masked Errors (investigate):
| Span Name | Span ID | Error | Parent Status |
|-----------|---------|-------|---------------|
| swallowed | ghi789 | "error msg" | OK |
Expected Errors (informational):
| Span Name | Span ID | Message | Why Expected |
|-----------|---------|---------|--------------|
| lookup | jkl012 | "404 Not Found" | User lookup, may not exist |
If a category has no entries, explicitly state: "None detected"
Async Tails (child outlives parent):
| Child Span | Parent Span | Outlive Duration |
|------------|-------------|------------------|
| async-op (id) | parent (id) | Xms |
Clock Skew:
| Span | Issue | Delta |
|------|-------|-------|
| span | Child starts before parent | Xms |
If none detected, explicitly state: "No timing anomalies detected"
Detected Retries:
| Operation | Attempts | Timing (gaps) | Final Outcome |
|-----------|----------|---------------|---------------|
| operation | 3 | 100ms, 200ms, 400ms | SUCCESS |
Retry Analysis:
If no retries detected, explicitly state: "No retry patterns detected"
Evidence-backed reasoning. Must include:
Example:
The root cause is span
abc123(getFiservToken) which took 34,961ms. The parent spandef456(payments) has a 10s timeout, which triggered at 10,004ms while the child was still executing. Evidence: childendTimeUnixNano(147290000000) > parentendTimeUnixNano(122329000000).
Prioritized with team ownership:
P0 - Immediate:
P1 - Short Term: 2. Action item
P2 - Medium Term: 3. Action item
If no actions needed, explain why the trace is healthy.
| Key Span | Span ID | Service | |----------|---------|---------| | Root | abc123 | service-a | | Error | def456 | service-b | | Bottleneck | ghi789 | service-c |
DO:
service.name resource attributeDO NOT:
status.code=2 spans are problems (404 may be expected)| Pattern | Signature |
|---------|-----------|
| Timeout cascade | Parent times out, children show context canceled |
| Retry storm | Multiple similar spans with increasing start times |
| Circuit breaker | max concurrency or circuit open in error |
| Downstream failure | Deepest span errors, parents propagate |
| Fan-out bottleneck | Many parallel children, one much slower |
| Cold start | First span in service much slower than subsequent |
| Connection pool exhaustion | Increasing tcp.connect times |
| DNS issues | Slow or failed dns.lookup spans |
development
Use when reviewing Dependabot alerts, npm audit findings, govulncheck output, or CVE reports on a JavaScript/Node.js or Go project — especially when triaging multiple alerts across direct and transitive dependencies to assess real-world risk and produce a remediation plan.
development
Use when a code review finding needs proof — write a focused test in JavaScript or Go that either confirms the issue is real or exposes it as over-engineering hyperbole. Trigger after code-review or code-review-skill findings are presented and evidence is requested.
development
Produce data-driven software delivery estimates by analyzing historical JIRA tickets, git activity, and engineer track records, then matching the new work against the most similar past tickets. Use this skill whenever the user asks "how long will this take", wants to estimate a piece of work, scope an epic, plan a sprint, or estimate delivery for JIRA stories or a Figma design. Also use whenever the user wants developer-to-work assignment recommendations based on history, wants to optimize an estimate by adding or reallocating engineers, or asks "what's the fastest way to ship this" or "who should work on this". Especially trigger when the user provides JIRA ticket IDs, JIRA story links, or Figma designs together with any indication of a team that will execute the work.
tools
Use when auditing an existing test suite for quality and coverage gaps, evaluating Playwright migration readiness, scoring automation against a world-class e-commerce standard, or guiding the creation of new tests. Applicable to Selenium, WebdriverIO, and Playwright suites.