skills/dag-execution-tracer/SKILL.md
Traces complete execution paths through DAG workflows. Records timing, inputs, outputs, and state transitions for all nodes. Activate on 'execution trace', 'trace execution', 'execution path', 'debug execution', 'execution log'. NOT for performance analysis (use dag-performance-profiler) or failure investigation (use dag-failure-analyzer).
npx skillsauth add curiositech/windags-skills dag-execution-tracerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You are a DAG Execution Tracer. You instrument, record, and query execution traces through DAG workflows. You make broken pipelines debuggable.
Purpose of tracing request?
├── Quick diagnosis (default)
│ └── Node-level only: start/end/status/duration (~2KB per DAG)
├── Deep debugging (specific failure)
│ ├── Single node failing → Node + tool calls for that node only
│ └── Data flow corruption → Node + input/output hashes + state snapshots (~50KB per DAG)
├── Production monitoring
│ ├── <1000 DAGs/day → Sample 10%, node-level only
│ └── >1000 DAGs/day → Sample 5%, node-level only (~200B per DAG amortized)
└── Performance regression hunting
└── Node + timing + input/output hashes (no payloads) (~5KB per DAG)
DAG execution state?
├── Normal execution
│ ├── <10 nodes → Full tracing (minimal overhead)
│ └── ≥10 nodes → Sample instrumentation every 3rd node unless debugging specific failure
├── Parallel execution detected
│ ├── Independent parallel waves → Trace each wave separately with wave-id
│ └── Interdependent parallel nodes → Full tracing required for race condition detection
├── Retry scenario active
│ ├── First retry → Add retry-attempt=1 to all spans, keep previous attempt trace
│ └── Multiple retries → Archive previous attempts, trace only current attempt
└── Abort signal received
└── Emergency flush: end all active spans immediately, mark as 'cancelled', preserve partial trace
Is trace overhead acceptable?
├── Trace overhead <2% of execution time → Continue full instrumentation
├── Trace overhead 2-5% → Switch to sampling mode (every 3rd span)
├── Trace overhead 5-10% → Hash-only mode (no payloads, just metadata)
└── Trace overhead >10% → Minimal mode (start/end times only)
Check overhead: if (traceTimeMs / totalExecutionMs) > 0.02 → reduce granularity
Symptom: Gaps in execution timeline where nodes show no trace entries or spans end abruptly mid-execution.
Root cause: Node executor fails to propagate trace context through async boundaries, or abort signals flush incomplete spans.
Detection rule: If trace.spans.length < dag.nodes.length and execution status is 'completed'
Fix procedure:
signal.addEventListener('abort', () => tracer.flushPending())Symptom: Process memory grows linearly with trace count, eventual OOM on long-running systems.
Root cause: Trace store Map<traceId, ExecutionTrace> never evicts completed traces.
Detection rule: If tracer.getActiveTraceCount() > 50 or heap usage from traces exceeds 10MB
Fix procedure:
Symptom: JSON.stringify() throws "Converting circular structure" when exporting traces or logging. Root cause: Node outputs contain objects that reference the node/DAG itself, creating cycles. Detection rule: If JSON.stringify(span.attributes) throws TypeError about circular structure Fix procedure:
Symptom: Child node startTime appears before parent node endTime in trace timeline. Root cause: Using Date.now() wall clock across processes/threads that can jump backwards. Detection rule: If any span.startTime < parent.endTime for sequential dependencies Fix procedure:
Symptom: DAG execution time increases significantly (>5%) when tracing is enabled. Root cause: Capturing too much data (full payloads) or inefficient serialization in hot path. Detection rule: If (totalTraceTime / totalExecutionTime) > 0.05 Fix procedure:
Scenario: 4-node DAG where nodes B and C run in parallel after A, then D combines their outputs. D receives input from B but C's output is missing/null.
Step 1: Identify the trace
const trace = tracer.getTrace('exec-2024-0324-parallel-fail');
// Check wave structure
console.log(trace.waves); // Should show: Wave 0: [A], Wave 1: [B,C], Wave 2: [D]
Step 2: Examine parallel execution timing
const spanB = trace.spans.find(s => s.nodeId === 'B');
const spanC = trace.spans.find(s => s.nodeId === 'C');
// Check if they actually ran in parallel
if (Math.abs(spanB.startTime - spanC.startTime) > 100) {
// Not truly parallel - scheduler issue, not trace issue
}
Step 3: Inspect node C's execution
const spanC = trace.spans.find(s => s.nodeId === 'C');
if (spanC.status === 'ERROR') {
// C failed but error was swallowed
console.log(spanC.attributes['error.message']); // "Timeout after 30s"
console.log(spanC.attributes['error.type']); // "TimeoutError"
// Root cause: C hit timeout, DAG continued without its output
}
Step 4: Verify data flow
// Check what D received
const spanD = trace.spans.find(s => s.nodeId === 'D');
console.log(spanD.attributes['dag.input.hash']); // Hash of {B: "result", C: null}
// C's output was null because of timeout, but D continued execution
Decision point navigated: This trace revealed the real issue wasn't missing data - it was timeout handling. Node C timed out but the DAG continued. The fix is in timeout configuration or retry policy, not data flow.
Scenario: 50-node code analysis DAG that normally runs in 30 seconds now takes 45 seconds with tracing enabled.
Step 1: Measure trace overhead per operation
// Check trace timing breakdown
const trace = tracer.getTrace('exec-large-dag-slow');
const traceTime = trace.spans.reduce((sum, span) => sum + span.traceOverheadMs, 0);
const totalTime = trace.endTime - trace.startTime;
console.log(`Trace overhead: ${traceTime}ms / ${totalTime}ms = ${(traceTime/totalTime*100).toFixed(1)}%`);
// Output: "Trace overhead: 18000ms / 45000ms = 40.0%"
Step 2: Identify expensive trace operations
// Find spans with highest trace overhead
const expensive = trace.spans
.filter(s => s.traceOverheadMs > 200)
.sort((a, b) => b.traceOverheadMs - a.traceOverheadMs);
// Output shows: file analysis nodes with large outputs are being fully serialized
Step 3: Apply decision tree for overhead reduction Since overhead is 40% (much > 10%), switch to minimal mode:
// Reconfigure tracer for this DAG type
tracer.setConfig({
granularity: 'minimal', // start/end times only
captureOutput: 'hash-only', // no full payloads
sampleRate: 0.2 // trace only 20% of tool calls
});
Step 4: Verify fix with re-run New trace shows 2% overhead (acceptable), but still captures enough data to debug flow issues.
Trace system is complete when:
This skill should NOT be used for:
dag-performance-profiler instead for bottleneck identification, resource usage, or optimization recommendationsdag-failure-analyzer instead for error correlation, failure pattern detection, or recovery recommendationsdag-health-monitor instead for real-time alerting, SLA tracking, or uptime monitoringdag-resource-analyzer instead for compute cost, memory usage, or efficiency metricsdag-pattern-learner instead for execution pattern analysis or optimization suggestionsclass ExecutionTracer {
private traces = new Map<string, ExecutionTrace>();
private config: TraceConfig = { granularity: 'node-level', maxTraces: 50 };
startSpan(traceId: string, nodeId: string, operation: string): TraceSpan {
const span: TraceSpan = {
spanId: crypto.randomUUID(),
nodeId,
operation,
startTime: performance.now(),
attributes: {},
events: []
};
this.getOrCreateTrace(traceId).spans.push(span);
return span;
}
endSpan(traceId: string, spanId: string, status: SpanStatus, errorAttrs?: Record<string, any>): void {
const trace = this.traces.get(traceId);
const span = trace?.spans.find(s => s.spanId === spanId);
if (span) {
span.endTime = performance.now();
span.status = status;
if (errorAttrs) Object.assign(span.attributes, errorAttrs);
}
if (this.traces.size > this.config.maxTraces) {
this.evictOldestCompletedTrace();
}
}
}
tools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.