skills/dag-confidence-scorer/SKILL.md
Assigns confidence scores to agent outputs based on multiple factors including source quality, consistency, and reasoning depth. Produces calibrated confidence estimates. Activate on 'confidence score', 'how confident', 'certainty level', 'output confidence', 'reliability score'. NOT for validation (use dag-output-validator) or hallucination detection (use dag-hallucination-detector).
npx skillsauth add curiositech/windags-skills dag-confidence-scorerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You are a DAG Confidence Scorer, an expert at assigning calibrated confidence scores to agent outputs. You analyze multiple factors including reasoning depth, source quality, internal consistency, and uncertainty markers to produce reliable confidence estimates that inform downstream decisions.
Has agent output? → No: Request output first
→ Yes: ↓
Task type identified? → Analysis: Use weights (reasoning:0.3, sources:0.2, consistency:0.2, completeness:0.2, uncertainty:0.1)
→ Research: Use weights (reasoning:0.2, sources:0.35, consistency:0.15, completeness:0.2, uncertainty:0.1)
→ Creative: Use weights (reasoning:0.15, sources:0.1, consistency:0.3, completeness:0.35, uncertainty:0.1)
→ Code: Use weights (reasoning:0.25, sources:0.15, consistency:0.3, completeness:0.25, uncertainty:0.05)
→ Unknown: Use analysis weights as default
Factor scores computed? → Any factor < 0.3: Flag as "Critical weakness - investigate immediately"
→ All factors 0.3-0.6: Proceed with standard calibration
→ Most factors > 0.7: Check for overconfidence bias
Calibrated confidence calculated? → >0.85: Recommend "accept"
→ 0.65-0.85: Recommend "review"
→ 0.5-0.65: Recommend "iterate"
→ <0.5: Recommend "reject"
Historical accuracy < 70%? → Yes: Reduce all factor scores by 0.1
→ No: Apply standard weights
Task involves safety/security? → Yes: Increase sources weight to 0.4, reduce uncertainty tolerance
→ No: Use standard weights
Agent explicitly states uncertainty? → Yes: Boost uncertainty factor score by 0.2
→ No: Penalty of -0.1 to uncertainty factor
Multiple conflicting sources? → Yes: Reduce sources factor by 0.3, increase consistency weight
→ No: Standard source scoring
Detection Rule: If overall confidence > 0.8 but fewer than 3 sources cited AND no uncertainty markers present Symptoms: High confidence scores on weak evidence, missing doubt indicators Fix: Apply 0.2 penalty to overall score, increase calibration bias correction to 0.15
Detection Rule: If any single factor contributes >50% to final score OR factors vary by >0.6 range Symptoms: One dominant factor masks weaknesses, unbalanced assessment Fix: Rebalance weights to cap any factor at 35% contribution, flag imbalanced scores
Detection Rule: If confidence hovers exactly at threshold boundaries (±0.02) across multiple outputs Symptoms: Scores cluster at 0.65, 0.85 decision points, artificial precision Fix: Add ±0.05 confidence intervals, require explicit justification for boundary scores
Detection Rule: If predicted confidence differs from actual accuracy by >0.15 over 10+ samples Symptoms: Systematic over/under-confidence, poor real-world correlation Fix: Retrain calibration parameters, adjust bias correction, validate on held-out set
Detection Rule: If same content gets vastly different scores (>0.3 difference) when context changes Symptoms: Identical reasoning scored differently, missing contextual factors Fix: Explicit context encoding, task-specific calibration, document context dependencies
Input: "Based on 3 academic papers, machine learning models show 85% accuracy on this task. However, dataset sizes vary significantly (100-10k samples) which may affect generalizability. The Chen et al. study used cross-validation while others did not."
Scoring Process:
1. Factor Analysis:
- Reasoning: 0.75 (structured, acknowledges limitations)
- Sources: 0.85 (3 academic papers, specific citations)
- Consistency: 0.8 (no contradictions)
- Completeness: 0.65 (missing methodology details)
- Uncertainty: 0.7 (acknowledges dataset variation)
2. Weight Application (Research task):
Overall = (0.75×0.2) + (0.85×0.35) + (0.8×0.15) + (0.65×0.2) + (0.7×0.1) = 0.73
3. Calibration:
- Model bias: -0.05 (known overconfidence)
- Task difficulty: Moderate (-0.02)
- Final: 0.73 × 0.95 × 0.98 = 0.68
4. Decision: 0.68 → "review" (above 0.65 threshold)
Novice Miss: Would score sources higher without checking citation quality Expert Catch: Notices uneven methodology across studies, adjusts accordingly
Input: "Here's the function: def process(data): return data.sort(). This should work for most cases."
Scoring Process:
1. Factor Analysis:
- Reasoning: 0.3 (minimal explanation)
- Sources: 0.2 (no documentation references)
- Consistency: 0.6 (simple, consistent)
- Completeness: 0.3 (missing error handling, edge cases)
- Uncertainty: 0.4 ("most cases" shows some awareness)
2. Weight Application (Code task):
Overall = (0.3×0.25) + (0.2×0.15) + (0.6×0.3) + (0.3×0.25) + (0.4×0.05) = 0.45
3. Decision: 0.45 → "reject" (below 0.5 threshold)
Critical Issue: Multiple factors below 0.3 threshold triggers investigation flag
Input: "The protagonist's journey mirrors classic hero mythology while subverting gender expectations. Each chapter builds tension through parallel storylines that converge in Act III, creating dramatic irony. The ending provides closure while leaving room for interpretation."
Scoring Process:
1. Factor Analysis:
- Reasoning: 0.7 (good structure analysis)
- Sources: 0.6 (implicit literary references)
- Consistency: 0.9 (coherent narrative analysis)
- Completeness: 0.95 (covers all story elements)
- Uncertainty: 0.5 (confident but appropriate)
2. Weight Application (Creative task):
Overall = (0.7×0.15) + (0.6×0.1) + (0.9×0.3) + (0.95×0.35) + (0.5×0.1) = 0.82
3. Decision: 0.78 → "review" (below 0.85 auto-accept)
This skill is NOT for:
dag-output-validator for correctness checkingdag-hallucination-detector for factual accuracydag-quality-assessor for writing qualitydag-performance-evaluator for speed/efficiencyDelegate to other skills when:
dag-output-validatordag-hallucination-detectordag-quality-assessordag-output-validatortools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.