skills/windags-evaluator/SKILL.md
Two-stage review engine with four-layer quality model for the WinDAGs meta-DAG. Receives completed node outputs and produces ReviewResult containing QualityVector. Stage 1 (Haiku) checks Floor + Wall on every node. Stage 2 (Sonnet) runs Ceiling evaluation conditionally using economic escalation formula. Enforces BC-EVAL-001 through BC-EVAL-006. Activate when operating as the Evaluator role in the meta-DAG, when reviewing node outputs, when computing quality vectors, or when deciding Stage 2 escalation.
npx skillsauth add curiositech/windags-skills windags-evaluatorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Receive completed node outputs. Produce a ReviewResult containing a QualityVector. Enforce the four-layer quality model as a runtime protocol gate. Run Stage 1 on every node. Escalate to Stage 2 only when the economic formula justifies it.
1. Node output received
├─ Floor check passes?
│ ├─ YES → Continue to Wall check
│ └─ NO → Return floor_failed ReviewResult (skip Wall, Ceiling, Envelope)
│
├─ Wall check passes?
│ ├─ YES → Evaluate Stage 2 escalation
│ └─ NO → Skip Ceiling, compute Envelope only
│
└─ Stage 2 escalation needed?
├─ YES (feeds human gate) → Run Stage 2 Ceiling
├─ YES (final deliverable) → Run Stage 2 Ceiling
├─ YES (irreversible) → Run Stage 2 Ceiling
├─ YES (P(fail) * waste > cost) → Run Stage 2 Ceiling
└─ NO → Stage 1 only, compute Envelope
Wall grade is 0.4 (partial pass):
├─ Grade >= 0.6? → Treat as PASS, continue to Stage 2 decision
├─ Grade < 0.3? → Treat as FAIL, skip Ceiling
└─ Grade 0.3-0.6? → Check downstream criticality:
├─ Node feeds critical path? → Treat as FAIL
└─ Node feeds non-critical? → Treat as PASS with warning
Peer=0.8, Downstream=0.3 (high disagreement):
├─ Difference > 0.4? → Flag for human review
├─ Downstream consumer is critical? → Weight Downstream 0.5, Peer 0.2
├─ Peer has domain expertise? → Weight Peer 0.4, Downstream 0.25
├─ Historical Peer reliability < 0.7? → Trust Downstream more
└─ Default → Use standard weights: Peer 0.25, Downstream 0.35
Cost-benefit calculation:
├─ downstreamWaste = sum(dependent_node_costs) * cascade_probability
├─ failureProbability = skill_failure_rate * stage_factor * breaker_penalty
├─ expectedWaste = failureProbability * downstreamWaste
└─ escalate = expectedWaste > stage2_cost
Symptoms: ReviewResult returned without envelope field populated Diagnosis: Floor check short-circuited the evaluation flow improperly Fix: Envelope computation is mandatory for ALL nodes regardless of Floor result; run envelope calculation before returning any ReviewResult
Symptoms: All evaluations scoring 0.9+ with minimal variation across different node types Diagnosis: Evaluator not applying layer-specific criteria; treating all outputs as equally good Fix: Implement per-layer scoring rubrics; require justification for scores above 0.8; cross-check with historical distributions
Symptoms: >60% of nodes escalating to Stage 2 when historical rate should be 15-20% Diagnosis: Economic formula over-weighting failure probability or downstream costs Fix: Recalibrate failure probability floor to 0.01; cap downstream waste calculation at 5x node cost; audit cascade_probability multiplier
Symptoms: Position-swapped evaluations showing >0.3 variance consistently
Diagnosis: Neural evaluators not properly applying bias mitigation techniques
Fix: Enforce position swapping on ALL pairwise comparisons; apply length normalization; flag and retry when position bias detected
Symptoms: Multi-dimensional quality getting converted to single scalar scores in downstream processing Diagnosis: Consuming systems incorrectly aggregating QualityVector dimensions Fix: Audit all QualityVector consumers; enforce per-dimension access patterns; remove any automatic scalar conversion utilities
Scenario: Code generation node produces syntactically invalid Python
Input: "Generate a Flask route handler"
Output: "def handle_route(
return jsonify({'status': 'ok')" // Missing closing paren, incomplete
Stage 1 Floor Check:
1. Parse against Python AST → FAIL (SyntaxError)
2. Schema validation → FAIL (incomplete function)
3. Contract check → FAIL (not valid Python)
Decision: Floor failed → Skip Wall, Ceiling, Envelope
Action: Return ReviewResult{stage1: {passed: false, floor: {passed: false, violations: ["SyntaxError", "IncompleteFunction"]}}, envelope: null}
Expert catches: Floor failure should immediately terminate evaluation - don't waste compute on Wall/Ceiling for fundamentally broken output Novice misses: Trying to evaluate "style" or "approach" of syntactically invalid code
Scenario: API documentation generation for internal service
Node: doc-generator (skill success_rate: 0.85, novice stage)
Downstream: 3 dependent nodes worth $0.12 total cost
Stage 2 review cost: $0.045
Calculation:
- failureProbability = (1 - 0.85) * 1.5 * 1.0 = 0.225
- downstreamWaste = $0.12 * 0.6 = $0.072
- expectedWaste = 0.225 * $0.072 = $0.016
- escalate? $0.016 < $0.045 → NO
Decision: Stage 1 only
Reasoning: Low downstream impact, novice penalty not enough to justify deep review
Expert catches: Economic formula must account for cascade probability (0.6) not just raw downstream costs Novice misses: Forgetting to apply stage factor for novice skills, leading to under-escalation
Scenario: SQL query optimization node evaluated by peer optimizer and downstream execution engine
Peer evaluation: 0.85 ("Elegant use of window functions, good indexing strategy")
Downstream evaluation: 0.35 ("Query times out on large datasets, missing LIMIT clause")
Resolution Decision Tree:
1. Difference = |0.85 - 0.35| = 0.5 > 0.4 → Flag for human review
2. Downstream is execution engine (critical for performance) → Weight downstream higher
3. Check peer historical reliability: 0.82 (good)
4. Final weights: Peer 0.15, Downstream 0.45
Composite Score: (0.15 * 0.85) + (0.45 * 0.35) = 0.285
Action: Mark as failing Wall check due to performance issues despite elegant structure
Expert catches: Performance trumps elegance; downstream execution failures are more costly than aesthetic suboptimality Novice misses: Over-weighting peer review because it "sounds more technical"; ignoring real-world execution constraints
Do not use windags-evaluator for:
Delegate to other skills when:
tools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.