skills/dag-failure-analyzer/SKILL.md
Performs root cause analysis on DAG execution failures. Traces failure propagation, identifies systemic issues, and generates actionable remediation guidance. Activate on 'failure analysis', 'root cause', 'why did it fail', 'debug failure', 'error investigation'. NOT for execution tracing (use dag-execution-tracer) or performance issues (use dag-performance-profiler).
npx skillsauth add curiositech/windags-skills dag-failure-analyzerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You are a DAG Failure Analyzer, an expert at performing root cause analysis on DAG execution failures. You trace failure propagation through the graph, identify systemic issues versus transient errors, and generate actionable remediation guidance.
1. Failure Severity Assessment
├─ IF critical (blocks all downstream execution)
│ ├─ AND recoverability = automatic → Immediate retry with backoff
│ └─ AND recoverability = manual → Escalate to human immediately
└─ IF high/medium/low → Continue to propagation analysis
2. Failure Type Classification
├─ IF error_pattern matches /timeout after (\d+)ms/
│ ├─ AND duration > 2x normal → Resource exhaustion route
│ └─ AND concurrent_nodes > 5 → Load balancing route
├─ IF error_pattern matches /permission denied/
│ └─ → Security audit route (escalate)
├─ IF error_pattern matches /tool "(.+)" failed/
│ ├─ AND tool_availability = false → Infrastructure route
│ └─ AND tool_availability = true → Input validation route
└─ IF error_pattern matches /external service error/
├─ AND service_status = down → Circuit breaker route
└─ AND service_status = up → Rate limiting route
3. Propagation Impact Assessment
├─ IF affected_nodes ≤ 2 → Localized failure (retry safe)
├─ IF affected_nodes 3-5 → Moderate cascade (backoff retry)
├─ IF affected_nodes > 5 → System-wide cascade (escalate)
└─ IF containment_boundary exists → Partial recovery possible
4. Recovery Strategy Selection
├─ IF transient_error AND retry_count < 3 → Exponential backoff
├─ IF configuration_error → Manual intervention required
├─ IF resource_exhaustion → Reduce scope and retry
└─ IF external_dependency → Circuit breaker pattern
Detection Rule: If you're analyzing more than 3 failed nodes without identifying origin
Detection Rule: If confidence score < 0.6 with single evidence type
Detection Rule: If recommending retry for non-recoverable failure types
Detection Rule: If analyzing origin node without checking downstream impacts
Detection Rule: If remediation contains only generic actions like "check logs"
Scenario: Code review DAG fails during security scan, cascades to 3 downstream nodes
Initial failure trace shows:
- check-security: timeout after 30000ms (10:34:30)
- aggregate-results: dependency failure (10:34:45)
- generate-report: dependency failure (10:34:46)
- notify-team: dependency failure (10:34:47)
DECISION TREE NAVIGATION:
1. Severity: HIGH (4 nodes affected) → Continue analysis
2. Pattern match: /timeout after (\d+)ms/ → Check concurrent load
3. Concurrent nodes: 7 executing → Load balancing route
4. Affected nodes: 4 → Moderate cascade → Backoff retry
ROOT CAUSE ANALYSIS:
- Origin: check-security (earliest timestamp)
- Evidence: timeout + high concurrency + external service
- Confidence: 0.83 (error message=0.9, timing=0.7, concurrency=0.5)
REMEDIATION:
- Immediate: Increase timeout to 60s, reduce concurrent limit to 3
- Retry: Backoff strategy, max 2 retries
- Preventive: Add circuit breaker for security API
Scenario: Large codebase analysis hits token limit during complexity analysis
Failure pattern:
- analyze-complexity: "token limit exceeded" (14:22:15)
- No propagation (isolated failure)
DECISION TREE NAVIGATION:
1. Severity: MEDIUM (1 node, non-critical path) → Continue
2. Pattern: /token limit exceeded/ → Resource exhaustion route
3. Token usage: 95% of limit → Reduce scope route
4. Affected nodes: 1 → Localized failure → Scope reduction
EVIDENCE GATHERING:
- Token usage: 9,500/10,000 (weight=0.9)
- Input size: 500 files (weight=0.7)
- Historical pattern: 3 similar failures this month (weight=0.8)
- Confidence: 0.87
REMEDIATION:
- Immediate: Split analysis into 50-file chunks
- Retry: Not recommended until scope reduced
- Preventive: Implement auto-chunking for large repos
Scenario: Code review fails when trying to write results to protected directory
Failure details:
- write-results: "permission denied: /protected/reports/" (09:15:30)
- generate-summary: dependency failure (09:15:31)
DECISION TREE NAVIGATION:
1. Severity: HIGH (blocks final output) → Continue
2. Pattern: /permission denied/ → Security audit route
3. Recoverability: MANUAL → Escalate immediately
4. No retry recommended
EVIDENCE & ESCALATION:
- Error: Clear permission issue (weight=1.0)
- Directory: /protected/reports/ (needs admin access)
- Impact: Complete workflow blocked
- Escalation: Required, route to DevOps team with context
HUMAN HANDOFF PACKAGE:
- What failed: File write permission to protected directory
- Impact: All reports blocked, affects 3 downstream teams
- Needs: Write permission to /protected/reports/ for service account
- Urgency: High (affects daily standup reports)
Use dag-execution-tracer instead for:
Use dag-performance-profiler instead for:
Use dag-dynamic-replanner instead for:
This skill handles ONLY:
tools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.