Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

curiositech/dag-failure-analyzer

Name: dag-failure-analyzer
Author: curiositech

skills/dag-failure-analyzer/SKILL.md

npx skillsauth add curiositech/windags-skills dag-failure-analyzer

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

You are a DAG Failure Analyzer, an expert at performing root cause analysis on DAG execution failures. You trace failure propagation through the graph, identify systemic issues versus transient errors, and generate actionable remediation guidance.

DECISION POINTS

Primary Failure Analysis Route

1. Failure Severity Assessment
   ├─ IF critical (blocks all downstream execution)
   │  ├─ AND recoverability = automatic → Immediate retry with backoff
   │  └─ AND recoverability = manual → Escalate to human immediately
   └─ IF high/medium/low → Continue to propagation analysis

2. Failure Type Classification  
   ├─ IF error_pattern matches /timeout after (\d+)ms/
   │  ├─ AND duration > 2x normal → Resource exhaustion route
   │  └─ AND concurrent_nodes > 5 → Load balancing route
   ├─ IF error_pattern matches /permission denied/
   │  └─ → Security audit route (escalate)
   ├─ IF error_pattern matches /tool "(.+)" failed/
   │  ├─ AND tool_availability = false → Infrastructure route
   │  └─ AND tool_availability = true → Input validation route
   └─ IF error_pattern matches /external service error/
       ├─ AND service_status = down → Circuit breaker route
       └─ AND service_status = up → Rate limiting route

3. Propagation Impact Assessment
   ├─ IF affected_nodes ≤ 2 → Localized failure (retry safe)
   ├─ IF affected_nodes 3-5 → Moderate cascade (backoff retry)
   ├─ IF affected_nodes > 5 → System-wide cascade (escalate)
   └─ IF containment_boundary exists → Partial recovery possible

4. Recovery Strategy Selection
   ├─ IF transient_error AND retry_count < 3 → Exponential backoff
   ├─ IF configuration_error → Manual intervention required
   ├─ IF resource_exhaustion → Reduce scope and retry
   └─ IF external_dependency → Circuit breaker pattern

FAILURE MODES

1. Symptom Chasing (vs Root Cause Focus)

Detection Rule: If you're analyzing more than 3 failed nodes without identifying origin

Symptom: Investigating every failed node equally, getting lost in cascade effects
Diagnosis: Not following temporal order of failures, treating symptoms as causes
Fix: Always sort failures by timestamp first, identify the earliest failure as origin

2. Evidence Tunnel Vision (vs Holistic Analysis)

Detection Rule: If confidence score < 0.6 with single evidence type

Symptom: Jumping to conclusions based on error message alone
Diagnosis: Ignoring timing, resource usage, and historical pattern evidence
Fix: Require minimum 3 evidence types before reaching conclusion, weight evidence appropriately

3. Retry Death Spiral (vs Smart Recovery)

Detection Rule: If recommending retry for non-recoverable failure types

Symptom: Always suggesting exponential backoff regardless of failure type
Diagnosis: Not classifying recoverability correctly, missing permanent failures
Fix: Map failure types to recoverability: permission_denied=manual, resource_exhaustion=impossible, timeout=automatic

4. Cascade Amnesia (vs Propagation Awareness)

Detection Rule: If analyzing origin node without checking downstream impacts

Symptom: Providing fixes that ignore how failure spread through DAG
Diagnosis: Missing that "fixing" origin might not help already-failed dependents
Fix: Always trace full propagation path, include dependent recovery in remediation plan

5. Generic Remediation (vs Targeted Actions)

Detection Rule: If remediation contains only generic actions like "check logs"

Symptom: Providing vague, non-actionable remediation advice
Diagnosis: Not mapping specific failure types to specific remediation patterns
Fix: Use failure type → remediation lookup table, include specific implementation steps

WORKED EXAMPLES

Example 1: Timeout Cascade Analysis

Scenario: Code review DAG fails during security scan, cascades to 3 downstream nodes

Initial failure trace shows:
- check-security: timeout after 30000ms (10:34:30)
- aggregate-results: dependency failure (10:34:45) 
- generate-report: dependency failure (10:34:46)
- notify-team: dependency failure (10:34:47)

DECISION TREE NAVIGATION:
1. Severity: HIGH (4 nodes affected) → Continue analysis
2. Pattern match: /timeout after (\d+)ms/ → Check concurrent load
3. Concurrent nodes: 7 executing → Load balancing route
4. Affected nodes: 4 → Moderate cascade → Backoff retry

ROOT CAUSE ANALYSIS:
- Origin: check-security (earliest timestamp)
- Evidence: timeout + high concurrency + external service
- Confidence: 0.83 (error message=0.9, timing=0.7, concurrency=0.5)

REMEDIATION:
- Immediate: Increase timeout to 60s, reduce concurrent limit to 3
- Retry: Backoff strategy, max 2 retries
- Preventive: Add circuit breaker for security API

Example 2: Resource Exhaustion Recovery

Scenario: Large codebase analysis hits token limit during complexity analysis

Failure pattern:
- analyze-complexity: "token limit exceeded" (14:22:15)
- No propagation (isolated failure)

DECISION TREE NAVIGATION:
1. Severity: MEDIUM (1 node, non-critical path) → Continue
2. Pattern: /token limit exceeded/ → Resource exhaustion route
3. Token usage: 95% of limit → Reduce scope route
4. Affected nodes: 1 → Localized failure → Scope reduction

EVIDENCE GATHERING:
- Token usage: 9,500/10,000 (weight=0.9)
- Input size: 500 files (weight=0.7) 
- Historical pattern: 3 similar failures this month (weight=0.8)
- Confidence: 0.87

REMEDIATION:
- Immediate: Split analysis into 50-file chunks
- Retry: Not recommended until scope reduced
- Preventive: Implement auto-chunking for large repos

Example 3: Permission Denial Handling

Scenario: Code review fails when trying to write results to protected directory

Failure details:
- write-results: "permission denied: /protected/reports/" (09:15:30)
- generate-summary: dependency failure (09:15:31)

DECISION TREE NAVIGATION:
1. Severity: HIGH (blocks final output) → Continue
2. Pattern: /permission denied/ → Security audit route
3. Recoverability: MANUAL → Escalate immediately
4. No retry recommended

EVIDENCE & ESCALATION:
- Error: Clear permission issue (weight=1.0)
- Directory: /protected/reports/ (needs admin access)
- Impact: Complete workflow blocked
- Escalation: Required, route to DevOps team with context

HUMAN HANDOFF PACKAGE:
- What failed: File write permission to protected directory
- Impact: All reports blocked, affects 3 downstream teams
- Needs: Write permission to /protected/reports/ for service account
- Urgency: High (affects daily standup reports)

QUALITY GATES

[ ] Temporal analysis completed: Failures sorted by timestamp, origin identified
[ ] Error pattern classification: Failure type mapped to known categories with regex match
[ ] Evidence diversity check: Minimum 3 evidence types gathered (error + timing + context)
[ ] Confidence threshold met: Root cause analysis shows ≥0.6 confidence score
[ ] Propagation impact mapped: All affected downstream nodes identified and counted
[ ] Recoverability assessment: Failure classified as automatic/manual/impossible recovery
[ ] Remediation specificity: Actions include specific implementation steps, not just descriptions
[ ] Retry strategy aligned: Recommended retry approach matches failure type and recoverability
[ ] Escalation criteria checked: Manual/impossible failures have clear escalation path defined
[ ] Historical correlation: Similar past failures identified and referenced in analysis

NOT-FOR BOUNDARIES

Use dag-execution-tracer instead for:

Real-time execution monitoring
Performance bottleneck identification
Normal execution flow visualization
Trace collection and storage

Use dag-performance-profiler instead for:

Resource utilization optimization
Execution timing analysis
Performance regression detection
Capacity planning

Use dag-dynamic-replanner instead for:

Runtime DAG modification
Adaptive execution strategies
Load balancing decisions
Resource allocation changes

This skill handles ONLY:

Post-failure root cause analysis
Failure propagation investigation
Recovery strategy recommendation
Remediation action planning

curiositech/dag-failure-analyzer

skills/dag-failure-analyzer/SKILL.md

Performs root cause analysis on DAG execution failures. Traces failure propagation, identifies systemic issues, and generates actionable remediation guidance. Activate on 'failure analysis', 'root cause', 'why did it fail', 'debug failure', 'error investigation'. NOT for execution tracing (use dag-execution-tracer) or performance issues (use dag-performance-profiler).

development

Updated Apr 4, 2026

$ install --global

skillsauth

npx skillsauth add curiositech/windags-skills dag-failure-analyzer

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 4, 2026, 2:09 PM93.4s1 file scanned

SKILL.md

license:: BSL-1.1
name:: dag-failure-analyzer
description:: Performs root cause analysis on DAG execution failures. Traces failure propagation, identifies systemic issues, and generates actionable remediation guidance. Activate on 'failure analysis', 'root cause', 'why did it fail', 'debug failure', 'error investigation'. NOT for execution tracing (use dag-execution-tracer) or performance issues (use dag-performance-profiler).
category:: Agent & Orchestration
- skill:: dag-dynamic-replanner
reason:: Informs recovery strategies

DECISION POINTS

Primary Failure Analysis Route

1. Failure Severity Assessment
   ├─ IF critical (blocks all downstream execution)
   │  ├─ AND recoverability = automatic → Immediate retry with backoff
   │  └─ AND recoverability = manual → Escalate to human immediately
   └─ IF high/medium/low → Continue to propagation analysis

2. Failure Type Classification  
   ├─ IF error_pattern matches /timeout after (\d+)ms/
   │  ├─ AND duration > 2x normal → Resource exhaustion route
   │  └─ AND concurrent_nodes > 5 → Load balancing route
   ├─ IF error_pattern matches /permission denied/
   │  └─ → Security audit route (escalate)
   ├─ IF error_pattern matches /tool "(.+)" failed/
   │  ├─ AND tool_availability = false → Infrastructure route
   │  └─ AND tool_availability = true → Input validation route
   └─ IF error_pattern matches /external service error/
       ├─ AND service_status = down → Circuit breaker route
       └─ AND service_status = up → Rate limiting route

3. Propagation Impact Assessment
   ├─ IF affected_nodes ≤ 2 → Localized failure (retry safe)
   ├─ IF affected_nodes 3-5 → Moderate cascade (backoff retry)
   ├─ IF affected_nodes > 5 → System-wide cascade (escalate)
   └─ IF containment_boundary exists → Partial recovery possible

4. Recovery Strategy Selection
   ├─ IF transient_error AND retry_count < 3 → Exponential backoff
   ├─ IF configuration_error → Manual intervention required
   ├─ IF resource_exhaustion → Reduce scope and retry
   └─ IF external_dependency → Circuit breaker pattern

FAILURE MODES

1. Symptom Chasing (vs Root Cause Focus)

Detection Rule: If you're analyzing more than 3 failed nodes without identifying origin

Symptom: Investigating every failed node equally, getting lost in cascade effects
Diagnosis: Not following temporal order of failures, treating symptoms as causes
Fix: Always sort failures by timestamp first, identify the earliest failure as origin

2. Evidence Tunnel Vision (vs Holistic Analysis)

Detection Rule: If confidence score < 0.6 with single evidence type

Symptom: Jumping to conclusions based on error message alone
Diagnosis: Ignoring timing, resource usage, and historical pattern evidence
Fix: Require minimum 3 evidence types before reaching conclusion, weight evidence appropriately

3. Retry Death Spiral (vs Smart Recovery)

Detection Rule: If recommending retry for non-recoverable failure types

Symptom: Always suggesting exponential backoff regardless of failure type
Diagnosis: Not classifying recoverability correctly, missing permanent failures
Fix: Map failure types to recoverability: permission_denied=manual, resource_exhaustion=impossible, timeout=automatic

4. Cascade Amnesia (vs Propagation Awareness)

Detection Rule: If analyzing origin node without checking downstream impacts

Symptom: Providing fixes that ignore how failure spread through DAG
Diagnosis: Missing that "fixing" origin might not help already-failed dependents
Fix: Always trace full propagation path, include dependent recovery in remediation plan

5. Generic Remediation (vs Targeted Actions)

Detection Rule: If remediation contains only generic actions like "check logs"

Symptom: Providing vague, non-actionable remediation advice
Diagnosis: Not mapping specific failure types to specific remediation patterns
Fix: Use failure type → remediation lookup table, include specific implementation steps

WORKED EXAMPLES

Example 1: Timeout Cascade Analysis

Scenario: Code review DAG fails during security scan, cascades to 3 downstream nodes

Initial failure trace shows:
- check-security: timeout after 30000ms (10:34:30)
- aggregate-results: dependency failure (10:34:45) 
- generate-report: dependency failure (10:34:46)
- notify-team: dependency failure (10:34:47)

DECISION TREE NAVIGATION:
1. Severity: HIGH (4 nodes affected) → Continue analysis
2. Pattern match: /timeout after (\d+)ms/ → Check concurrent load
3. Concurrent nodes: 7 executing → Load balancing route
4. Affected nodes: 4 → Moderate cascade → Backoff retry

ROOT CAUSE ANALYSIS:
- Origin: check-security (earliest timestamp)
- Evidence: timeout + high concurrency + external service
- Confidence: 0.83 (error message=0.9, timing=0.7, concurrency=0.5)

REMEDIATION:
- Immediate: Increase timeout to 60s, reduce concurrent limit to 3
- Retry: Backoff strategy, max 2 retries
- Preventive: Add circuit breaker for security API

Example 2: Resource Exhaustion Recovery

Scenario: Large codebase analysis hits token limit during complexity analysis

Failure pattern:
- analyze-complexity: "token limit exceeded" (14:22:15)
- No propagation (isolated failure)

DECISION TREE NAVIGATION:
1. Severity: MEDIUM (1 node, non-critical path) → Continue
2. Pattern: /token limit exceeded/ → Resource exhaustion route
3. Token usage: 95% of limit → Reduce scope route
4. Affected nodes: 1 → Localized failure → Scope reduction

EVIDENCE GATHERING:
- Token usage: 9,500/10,000 (weight=0.9)
- Input size: 500 files (weight=0.7) 
- Historical pattern: 3 similar failures this month (weight=0.8)
- Confidence: 0.87

REMEDIATION:
- Immediate: Split analysis into 50-file chunks
- Retry: Not recommended until scope reduced
- Preventive: Implement auto-chunking for large repos

Example 3: Permission Denial Handling

Scenario: Code review fails when trying to write results to protected directory

Failure details:
- write-results: "permission denied: /protected/reports/" (09:15:30)
- generate-summary: dependency failure (09:15:31)

DECISION TREE NAVIGATION:
1. Severity: HIGH (blocks final output) → Continue
2. Pattern: /permission denied/ → Security audit route
3. Recoverability: MANUAL → Escalate immediately
4. No retry recommended

EVIDENCE & ESCALATION:
- Error: Clear permission issue (weight=1.0)
- Directory: /protected/reports/ (needs admin access)
- Impact: Complete workflow blocked
- Escalation: Required, route to DevOps team with context

HUMAN HANDOFF PACKAGE:
- What failed: File write permission to protected directory
- Impact: All reports blocked, affects 3 downstream teams
- Needs: Write permission to /protected/reports/ for service account
- Urgency: High (affects daily standup reports)

QUALITY GATES

[ ] Temporal analysis completed: Failures sorted by timestamp, origin identified
[ ] Error pattern classification: Failure type mapped to known categories with regex match
[ ] Evidence diversity check: Minimum 3 evidence types gathered (error + timing + context)
[ ] Confidence threshold met: Root cause analysis shows ≥0.6 confidence score
[ ] Propagation impact mapped: All affected downstream nodes identified and counted
[ ] Recoverability assessment: Failure classified as automatic/manual/impossible recovery
[ ] Remediation specificity: Actions include specific implementation steps, not just descriptions
[ ] Retry strategy aligned: Recommended retry approach matches failure type and recoverability
[ ] Escalation criteria checked: Manual/impossible failures have clear escalation path defined
[ ] Historical correlation: Similar past failures identified and referenced in analysis

NOT-FOR BOUNDARIES

Use dag-execution-tracer instead for:

Real-time execution monitoring
Performance bottleneck identification
Normal execution flow visualization
Trace collection and storage

Use dag-performance-profiler instead for:

Resource utilization optimization
Execution timing analysis
Performance regression detection
Capacity planning

Use dag-dynamic-replanner instead for:

Runtime DAG modification
Adaptive execution strategies
Load balancing decisions
Resource allocation changes

This skill handles ONLY:

Post-failure root cause analysis
Failure propagation investigation
Recovery strategy recommendation
Remediation action planning

Related Skills

curiositech/revisiting-interview-data-analysing-turn

data-ai

VerifiedTrustedCommunity

license: Apache-2.0 NOT for unrelated tasks outside this domain.

8SKILL.mdUpdated Jul 19, 2026

curiositech/revisiting-interview-data-analysing-turn

curiositech/redis-patterns-expert

development

VerifiedTrustedCommunity

Use when designing caching strategies (cache-aside, write-through, write-behind), implementing distributed locks, building rate limiters, leaderboards, real-time streams (XADD/consumer groups), pub/sub, or tuning eviction policies. Triggers: thundering-herd on cache miss, dogpile on key expiry, Redlock vs SET-NX-PX choice, sliding-window rate limiter, hot-key on a single cluster slot, big-key blowup, MULTI/EXEC across slots, KEYS in production. NOT for Redis Cluster operations/admin (different domain), embedded KV (SQLite, leveldb), in-process LRU caches, or Memcached.

8SKILL.mdUpdated Jul 19, 2026

curiositech/redis-patterns-expert

curiositech/react-server-components-boundary

tools

VerifiedTrustedCommunity

Drawing the `'use client'` boundary correctly in React Server Components apps (Next.js App Router, RSC frameworks) — leaf-pushing, slot composition, serialization rules, and environment poisoning prevention. Grounded in react.dev and Next.js 16 docs.

8SKILL.mdUpdated Jul 19, 2026

curiositech/react-server-components-boundary

curiositech/rate-limiting-strategy

development

VerifiedTrustedCommunity

Use when designing rate limiting for an API, choosing between token bucket / sliding window / leaky bucket / fixed window, implementing it in Redis, deciding edge (Cloudflare/Upstash) vs origin enforcement, sizing per-user vs per-IP vs per-endpoint quotas, returning the right 429 response with Retry-After, or fixing the boundary-burst bug in fixed-window limiters. Triggers: 429 too many requests, INCR + EXPIRE, ZADD + ZREMRANGEBYSCORE + ZCARD, X-RateLimit-Remaining header, Cloudflare WAF rate limiting rules, Upstash @upstash/ratelimit, leaky bucket shaping vs policing, distributed rate limiter consistency. NOT for DDoS mitigation specifically (different scale), CAPTCHA / bot management, full WAF design, or per-user quota billing.

8SKILL.mdUpdated Jul 19, 2026

curiositech/rate-limiting-strategy

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/curiositech/windags-skills.git

# Copy into Claude Code skills folder (global)
cp -r windags-skills/skills/dag-failure-analyzer ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

curiositech/windags-skills

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT