skills/windags-premortem/SKILL.md
Failure pattern scanning and risk assessment for WinDAGs DAGs. Runs after decomposition but before execution. Performs a lightweight scan on EVERY DAG (BC-PLAN-004), escalating to deep analysis when recognition confidence is low or known failure patterns are found. Activate on "premortem", "failure scan", "risk assessment", "DAG validation", "timing analysis", "failure patterns", "pre-execution check". NOT for post-execution learning (use windags-curator), retrospective analysis (use windags-looking-back), or DAG construction (use windags-architect).
npx skillsauth add curiositech/windags-skills windags-premortemInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Scan every DAG for failure patterns before execution begins. Run the lightweight scan unconditionally. Escalate to deep analysis when warranted. Produce a PreMortemResult that the Executor uses to decide whether to proceed, monitor, or halt.
Model Tier: Tier 1 (Haiku-class) Behavioral Contract: BC-PLAN-004
Use this skill when:
Do NOT use for:
windags-looking-back)windags-curator)windags-mutator)The PreMortem lightweight scan runs on EVERY DAG, including trivial depth-1 DAGs.
No exceptions. No "this DAG is too simple to scan." A depth-1 DAG with a single node still gets the lightweight scan. The cost is negligible (Tier 1 model, structured checks). The risk of skipping is not.
flowchart TD
DAG[DAG received from Decomposer] --> LW[Lightweight Scan]
LW --> RC{Recognition confidence >= 0.7?}
RC -->|Yes| FP{Failure patterns found?}
RC -->|No| DEEP[Deep Scan]
FP -->|None| PROCEED[Recommend: PROCEED]
FP -->|Found| DEEP
DEEP --> SEV{Severity assessment}
SEV -->|Low| MONITOR[Recommend: ACCEPT_WITH_MONITORING]
SEV -->|Medium| MONITOR
SEV -->|High| ESCALATE[Recommend: ESCALATE_TO_HUMAN]
Execute these checks in order. Each check is a structured pattern match, not a generative task.
Cascade Depth Check: Walk the DAG and measure the longest path. Flag if depth > 3 with no isolation boundaries between failure domains.
Shared Failure Domain Check: For each parallel batch (wave), identify nodes that share a failure domain (same model provider, same API, same file system resource). Flag if any batch has > 50% of nodes in the same failure domain.
Single Point of Failure Check: Identify any node where count(dependents) >= 3. That node's failure cascades to 3+ downstream nodes. Flag it.
Resource Contention Check: Identify nodes in the same wave that require the same expensive resource (GPU, large model, exclusive file lock, rate-limited API). Flag contention.
Timing Risk Check: Estimate per-node duration using historical data or heuristics. Identify nodes on the critical path where a 2x slowdown would push total execution past the user's time expectation.
Known Pattern Match: Compare the DAG topology against the failure pattern library (see Failure Pattern Categories below). Log any matches with confidence scores.
Trigger deep scan when:
Deep scan adds:
Dependency Chain Analysis: Trace every path from root to leaf. Score each path for fragility: fragility = (path_length * max_fan_out) / isolation_boundaries. Flag paths with fragility > 5.0.
Failure Propagation Simulation: For each flagged node, estimate the blast radius (how many downstream nodes fail if this node fails). Rank nodes by blast radius.
Resource Budget Projection: Sum estimated costs (tokens, API calls, time) across all nodes. Compare against the user's budget constraints. Flag if projected cost exceeds 80% of budget.
Alternative Topology Suggestions: If failure patterns are severe, propose specific mitigations:
Maintain and match against these five categories.
Pattern: Linear chain of depth > 3 with no isolation boundaries. Risk: A single early failure wastes all downstream computation. Mitigation: Insert checkpoint gates. Move high-risk nodes earlier. Add fallback paths.
Pattern: Multiple nodes in the same wave depend on the same external resource (API provider, model endpoint, file system). Risk: One provider outage takes down the entire wave. Mitigation: Distribute nodes across failure domains. Stagger API calls. Add circuit breakers per domain.
Pattern: A node with fan-out >= 3 (three or more nodes depend on it). Risk: This node's failure cascades broadly. Mitigation: Add retry logic with backoff. Consider running the critical node with a more reliable (higher-tier) model. Add a fallback skill.
Pattern: Multiple nodes in the same wave compete for the same scarce resource. Risk: Serialization, timeouts, or resource exhaustion. Mitigation: Stagger execution within the wave. Reduce parallelism for contended resources. Pre-allocate resources.
Pattern: A slow node sits on the critical path with tight-deadline dependents downstream. Risk: Delay cascades and the user's time expectation is violated. Mitigation: Estimate critical path duration. Flag if critical path > 80% of user's time expectation. Consider parallel alternatives or faster model tiers for bottleneck nodes.
Perform timing analysis on every DAG (lightweight level). This addresses the Chef's concern from the Constitutional Convention: users need realistic time expectations before execution begins.
Assign each node an estimated duration:
Compute the critical path using longest-path algorithm on the DAG.
Compute total wall-clock estimate: critical_path_duration + (wave_count * wave_overhead).
For each node on the critical path, compute cascade_impact:
cascade_impact = (node_duration / critical_path_duration) * count(downstream_nodes)
Flag nodes where cascade_impact > 0.3 -- these are the nodes where a delay hurts the most.
Compare estimated total duration against user expectation (if provided). Report one of:
WITHIN_BUDGET: Estimate < 80% of user expectationTIGHT: Estimate is 80-100% of user expectationOVER_BUDGET: Estimate > user expectation (flag specific bottleneck nodes)Produce a PreMortemResult with these fields:
PreMortemResult:
scan_level: "lightweight" | "deep"
failure_patterns_found:
- pattern: string # Category name
severity: "low" | "medium" | "high"
affected_nodes: [NodeId]
description: string # Human-readable explanation
mitigation: string # Suggested fix
timing_analysis:
critical_path_nodes: [NodeId]
estimated_duration_seconds: number
delay_cascade_nodes:
- node_id: NodeId
cascade_impact: number # 0.0 to 1.0
time_budget_status: "WITHIN_BUDGET" | "TIGHT" | "OVER_BUDGET"
resource_analysis:
contention_points:
- resource: string
competing_nodes: [NodeId]
wave: number
projected_cost: number # Estimated total cost in dollars
budget_utilization: number # 0.0 to 1.0
recommendation: "PROCEED" | "ACCEPT_WITH_MONITORING" | "ESCALATE_TO_HUMAN"
recommendation_rationale: string
flowchart TD
START[Scan complete] --> HIGH{Any HIGH severity patterns?}
HIGH -->|Yes| ESC[ESCALATE_TO_HUMAN]
HIGH -->|No| MED{Any MEDIUM severity patterns?}
MED -->|Yes| MON[ACCEPT_WITH_MONITORING]
MED -->|No| OVER{Time OVER_BUDGET?}
OVER -->|Yes| MON
OVER -->|No| LOW{Any LOW severity patterns?}
LOW -->|Yes| PRO_MON[PROCEED with notes]
LOW -->|No| PRO[PROCEED]
The PreMortem sits between the Decomposer and the Executor in the meta-DAG pipeline:
flowchart LR
SM[Sensemaker] --> DC[Decomposer]
DC --> PM[PreMortem]
PM -->|PROCEED| EX[Executor]
PM -->|ACCEPT_WITH_MONITORING| EX
PM -->|ESCALATE_TO_HUMAN| HG[Human Gate]
HG -->|Approved| EX
HG -->|Rejected| DC
When the PreMortem recommends ESCALATE_TO_HUMAN, the Executor pauses and presents:
| Operation | Target | |-----------|--------| | Lightweight scan | < 2s for DAGs with <= 20 nodes | | Deep scan | < 8s for DAGs with <= 20 nodes | | Failure pattern match | < 200ms per pattern category | | Timing estimation | < 500ms | | Total PreMortem overhead | < 3% of total execution cost |
The PreMortem must never become a bottleneck. If the scan itself takes longer than 10% of the estimated DAG execution time, truncate to lightweight-only and note the truncation in the result.
Maintain a persistent library of failure patterns observed across executions. The Curator updates this library post-execution. The PreMortem reads it pre-execution.
Each pattern entry contains:
pattern_id: Unique identifiercategory: One of the five categories abovetopology_signature: Graph structure that triggers the patternfrequency: How often this pattern has been observedseverity_distribution: Historical severity outcomeseffective_mitigations: Mitigations that worked in the pastThe library starts with the five built-in categories and grows through execution experience. This is part of the learning loop (Principle 10).
tools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.