skills/dag-ops/SKILL.md
Operations, debugging, and optimization for DAG workflows. Performs root cause analysis on failures, profiles execution performance, aggregates results from parallel branches, bridges context between nodes, and learns patterns from execution history. Activate on "DAG failed", "why did it fail", "root cause", "performance profile", "aggregate results", "merge branches", "execution patterns", "optimize DAG". NOT for planning DAGs (use dag-planner), executing DAGs (use dag-runtime), or validating outputs (use dag-quality).
npx skillsauth add curiositech/windags-skills dag-opsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Operations, debugging, and optimization for DAG workflows. Handles failure analysis, performance profiling, result aggregation, and pattern learning.
If failure is transient (timeout, rate limit):
├── Immediate retry with exponential backoff
├── If retries exhausted → escalate with timeline
If failure is model-related (refusal, format error):
├── Try alternative model with same tier
├── If all models fail → escalate with prompt review needed
If failure is contract violation (schema mismatch):
├── Check upstream node outputs for corruption
├── If upstream OK → retry with explicit schema validation
├── If upstream corrupt → trace to root cause
If failure is cascade (downstream propagation):
├── Find first failing node in dependency chain
├── If root cause confidence > 80% → auto-fix and re-execute
├── If root cause confidence < 80% → escalate with partial diagnosis
If failure is resource constraint (cost/token limits):
├── Check if downgrade possible (Sonnet → Haiku)
├── If downgrade viable → auto-apply and retry
├── If no viable downgrade → escalate with resource request
If bottleneck is on critical path:
├── Duration > 30s → check for parallelization opportunities
├── Cost > $0.50 per node → evaluate model downgrade
├── Retry count > 2 → flag for prompt optimization
If resource utilization is suboptimal:
├── Parallel capacity unused → recommend dag-planner restructure
├── Model overkill detected → auto-route to cheaper alternative
├── Queue wait time > execution time → flag resource scaling
If parallel branches produce same data type:
├── Content similarity > 90% → deduplicate and merge
├── Content similarity 50-90% → synthesize with conflict resolution
├── Content similarity < 50% → concatenate with clear attribution
If parallel branches produce different formats:
├── Compatible schemas → normalize and merge
├── Incompatible schemas → escalate for format reconciliation
If results conflict (contradictory facts):
├── Confidence scores available → select highest confidence
├── No confidence scores → escalate for human resolution
Detection: Multiple downstream failures after single upstream error, with separate remediation attempts for each failure. Diagnosis: Not tracing failures back to root cause, treating symptoms as independent problems. Fix: Always trace backward through dependency graph until finding first deviation from expected output.
Detection: Automatic remediation applied when root cause confidence < 70%, leading to repeated failures. Diagnosis: Acting on weak diagnosis without escalating for human review. Fix: Set confidence threshold at 80% for auto-fix, escalate below threshold with partial analysis.
Detection: Downstream nodes failing due to missing context that was available in earlier waves. Diagnosis: Not bridging context across non-adjacent nodes in the DAG. Fix: Maintain context registry with node dependencies, propagate relevant context forward.
Detection: Parallel branch results merged without conflict detection, producing incoherent output. Diagnosis: Assuming parallel results are always compatible without validation. Fix: Always run similarity analysis and conflict detection before merging parallel results.
Detection: Optimizing individual node performance while ignoring overall DAG efficiency. Diagnosis: Focusing on local metrics without considering critical path and resource allocation. Fix: Analyze critical path first, then optimize bottlenecks that actually impact total execution time.
DAG State: research-node → analysis-node → summary-node
Failure: summary-node returns "Error: Cannot summarize incoherent analysis"
Step 1: Trace backward
- Check analysis-node output: "The data is unclear and contradictory..."
- Check research-node output: Mix of valid research + API error responses
Step 2: Classify failure
- Root cause: research-node partially failed (got some API errors)
- Symptom: analysis-node tried to work with corrupted data
- Downstream: summary-node failed on corrupted analysis
Step 3: Calculate confidence
- Research-node error pattern clear (API timeouts) → confidence 85%
- Remediation path clear (retry research with backoff) → confidence 90%
- Overall confidence: 85% > 80% threshold
Step 4: Auto-remediation
- Retry research-node with exponential backoff
- Re-execute analysis-node and summary-node
- Result: Full DAG completion without escalation
Scenario: Parallel code review branches (security-review + performance-review)
Security output: "Function validate_input() needs input sanitization"
Performance output: "Function validate_input() should be removed for speed"
Step 1: Detect conflict
- Similarity analysis: Both mention same function → 60% overlap
- Contradiction detection: "needs X" vs "remove" → conflict flagged
Step 2: Conflict resolution routing
- No confidence scores in outputs
- Conflicting recommendations on same code element
- Route: Escalate for human resolution with structured conflict summary
Step 3: Structure escalation
- Conflict: Function validate_input() handling
- Security perspective: Add input sanitization
- Performance perspective: Remove for speed optimization
- Human decision needed: Security vs performance tradeoff
What this skill should NOT handle:
dag-planner insteaddag-runtime insteaddag-quality insteadDelegation rules:
dag-planner with optimization recommendationsdag-runtime with resource requirement updatesdag-quality with failure pattern analysistools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.