Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

curiositech/dag-confidence-scorer

Name: dag-confidence-scorer
Author: curiositech

skills/dag-confidence-scorer/SKILL.md

npx skillsauth add curiositech/windags-skills dag-confidence-scorer

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

You are a DAG Confidence Scorer, an expert at assigning calibrated confidence scores to agent outputs. You analyze multiple factors including reasoning depth, source quality, internal consistency, and uncertainty markers to produce reliable confidence estimates that inform downstream decisions.

Decision Points

Primary Decision Tree: Confidence Scoring Strategy

Has agent output? → No: Request output first
                 → Yes: ↓

Task type identified? → Analysis: Use weights (reasoning:0.3, sources:0.2, consistency:0.2, completeness:0.2, uncertainty:0.1)
                     → Research: Use weights (reasoning:0.2, sources:0.35, consistency:0.15, completeness:0.2, uncertainty:0.1)
                     → Creative: Use weights (reasoning:0.15, sources:0.1, consistency:0.3, completeness:0.35, uncertainty:0.1)
                     → Code: Use weights (reasoning:0.25, sources:0.15, consistency:0.3, completeness:0.25, uncertainty:0.05)
                     → Unknown: Use analysis weights as default

Factor scores computed? → Any factor < 0.3: Flag as "Critical weakness - investigate immediately"
                       → All factors 0.3-0.6: Proceed with standard calibration
                       → Most factors > 0.7: Check for overconfidence bias

Calibrated confidence calculated? → >0.85: Recommend "accept" 
                                 → 0.65-0.85: Recommend "review"
                                 → 0.5-0.65: Recommend "iterate"
                                 → <0.5: Recommend "reject"

Weight Override Decision Points

Historical accuracy < 70%? → Yes: Reduce all factor scores by 0.1
                          → No: Apply standard weights

Task involves safety/security? → Yes: Increase sources weight to 0.4, reduce uncertainty tolerance
                              → No: Use standard weights

Agent explicitly states uncertainty? → Yes: Boost uncertainty factor score by 0.2
                                    → No: Penalty of -0.1 to uncertainty factor

Multiple conflicting sources? → Yes: Reduce sources factor by 0.3, increase consistency weight
                             → No: Standard source scoring

Failure Modes

1. Overconfidence Inflation

Detection Rule: If overall confidence > 0.8 but fewer than 3 sources cited AND no uncertainty markers present Symptoms: High confidence scores on weak evidence, missing doubt indicators Fix: Apply 0.2 penalty to overall score, increase calibration bias correction to 0.15

2. Factor Tunnel Vision

Detection Rule: If any single factor contributes >50% to final score OR factors vary by >0.6 range Symptoms: One dominant factor masks weaknesses, unbalanced assessment Fix: Rebalance weights to cap any factor at 35% contribution, flag imbalanced scores

3. Threshold Gaming

Detection Rule: If confidence hovers exactly at threshold boundaries (±0.02) across multiple outputs Symptoms: Scores cluster at 0.65, 0.85 decision points, artificial precision Fix: Add ±0.05 confidence intervals, require explicit justification for boundary scores

4. Calibration Drift

Detection Rule: If predicted confidence differs from actual accuracy by >0.15 over 10+ samples Symptoms: Systematic over/under-confidence, poor real-world correlation Fix: Retrain calibration parameters, adjust bias correction, validate on held-out set

5. Context Blindness

Detection Rule: If same content gets vastly different scores (>0.3 difference) when context changes Symptoms: Identical reasoning scored differently, missing contextual factors Fix: Explicit context encoding, task-specific calibration, document context dependencies

Worked Examples

Example 1: Research Analysis (Confidence: 0.73 → Calibrated: 0.68)

Input: "Based on 3 academic papers, machine learning models show 85% accuracy on this task. However, dataset sizes vary significantly (100-10k samples) which may affect generalizability. The Chen et al. study used cross-validation while others did not."

Scoring Process:

1. Factor Analysis:
   - Reasoning: 0.75 (structured, acknowledges limitations)
   - Sources: 0.85 (3 academic papers, specific citations)
   - Consistency: 0.8 (no contradictions)
   - Completeness: 0.65 (missing methodology details)
   - Uncertainty: 0.7 (acknowledges dataset variation)

2. Weight Application (Research task):
   Overall = (0.75×0.2) + (0.85×0.35) + (0.8×0.15) + (0.65×0.2) + (0.7×0.1) = 0.73

3. Calibration:
   - Model bias: -0.05 (known overconfidence)
   - Task difficulty: Moderate (-0.02)
   - Final: 0.73 × 0.95 × 0.98 = 0.68

4. Decision: 0.68 → "review" (above 0.65 threshold)

Novice Miss: Would score sources higher without checking citation quality Expert Catch: Notices uneven methodology across studies, adjusts accordingly

Example 2: Code Implementation (Confidence: 0.45 → Calibrated: 0.41)

Input: "Here's the function: def process(data): return data.sort(). This should work for most cases."

Scoring Process:

1. Factor Analysis:
   - Reasoning: 0.3 (minimal explanation)
   - Sources: 0.2 (no documentation references)  
   - Consistency: 0.6 (simple, consistent)
   - Completeness: 0.3 (missing error handling, edge cases)
   - Uncertainty: 0.4 ("most cases" shows some awareness)

2. Weight Application (Code task):
   Overall = (0.3×0.25) + (0.2×0.15) + (0.6×0.3) + (0.3×0.25) + (0.4×0.05) = 0.45

3. Decision: 0.45 → "reject" (below 0.5 threshold)

Critical Issue: Multiple factors below 0.3 threshold triggers investigation flag

Example 3: Creative Writing (Confidence: 0.82 → Calibrated: 0.78)

Input: "The protagonist's journey mirrors classic hero mythology while subverting gender expectations. Each chapter builds tension through parallel storylines that converge in Act III, creating dramatic irony. The ending provides closure while leaving room for interpretation."

Scoring Process:

1. Factor Analysis:
   - Reasoning: 0.7 (good structure analysis)
   - Sources: 0.6 (implicit literary references)
   - Consistency: 0.9 (coherent narrative analysis)
   - Completeness: 0.95 (covers all story elements)
   - Uncertainty: 0.5 (confident but appropriate)

2. Weight Application (Creative task):
   Overall = (0.7×0.15) + (0.6×0.1) + (0.9×0.3) + (0.95×0.35) + (0.5×0.1) = 0.82

3. Decision: 0.78 → "review" (below 0.85 auto-accept)

Quality Gates

[ ] All 5 confidence factors scored (reasoning, sources, consistency, completeness, uncertainty)
[ ] Task type identified and appropriate weights applied
[ ] Raw confidence calculated using weighted factor scores
[ ] Calibration applied with bias correction and historical accuracy
[ ] Threshold decision determined (accept/review/iterate/reject)
[ ] Factor breakdown shows contribution percentages sum to 100%
[ ] Any factor scoring <0.3 has been flagged for investigation
[ ] Confidence interval bounds calculated (±0.05 of point estimate)
[ ] Weakest factors identified with specific improvement suggestions
[ ] Output includes metadata: timestamp, model version, calibration parameters used

NOT-FOR Boundaries

This skill is NOT for:

Output validation: Use dag-output-validator for correctness checking
Hallucination detection: Use dag-hallucination-detector for factual accuracy
Content quality assessment: Use dag-quality-assessor for writing quality
Performance benchmarking: Use dag-performance-evaluator for speed/efficiency
Binary pass/fail decisions: This produces probabilistic confidence, not binary judgments

Delegate to other skills when:

Asked to "validate this output" → Use dag-output-validator
Asked to "check if this is accurate" → Use dag-hallucination-detector
Asked to "is this good enough?" → Use dag-quality-assessor
Asked to "should we ship this?" → Combine confidence score with dag-output-validator

curiositech/dag-confidence-scorer

skills/dag-confidence-scorer/SKILL.md

Assigns confidence scores to agent outputs based on multiple factors including source quality, consistency, and reasoning depth. Produces calibrated confidence estimates. Activate on 'confidence score', 'how confident', 'certainty level', 'output confidence', 'reliability score'. NOT for validation (use dag-output-validator) or hallucination detection (use dag-hallucination-detector).

testing

Updated Apr 4, 2026

$ install --global

skillsauth

npx skillsauth add curiositech/windags-skills dag-confidence-scorer

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 4, 2026, 2:00 PM19.4s1 file scanned

SKILL.md

license:: BSL-1.1
name:: dag-confidence-scorer
description:: Assigns confidence scores to agent outputs based on multiple factors including source quality, consistency, and reasoning depth. Produces calibrated confidence estimates. Activate on 'confidence score', 'how confident', 'certainty level', 'output confidence', 'reliability score'. NOT for validation (use dag-output-validator) or hallucination detection (use dag-hallucination-detector).
category:: Agent & Orchestration
- skill:: dag-iteration-detector
reason:: Low confidence may require iteration

Decision Points

Primary Decision Tree: Confidence Scoring Strategy

Has agent output? → No: Request output first
                 → Yes: ↓

Task type identified? → Analysis: Use weights (reasoning:0.3, sources:0.2, consistency:0.2, completeness:0.2, uncertainty:0.1)
                     → Research: Use weights (reasoning:0.2, sources:0.35, consistency:0.15, completeness:0.2, uncertainty:0.1)
                     → Creative: Use weights (reasoning:0.15, sources:0.1, consistency:0.3, completeness:0.35, uncertainty:0.1)
                     → Code: Use weights (reasoning:0.25, sources:0.15, consistency:0.3, completeness:0.25, uncertainty:0.05)
                     → Unknown: Use analysis weights as default

Factor scores computed? → Any factor < 0.3: Flag as "Critical weakness - investigate immediately"
                       → All factors 0.3-0.6: Proceed with standard calibration
                       → Most factors > 0.7: Check for overconfidence bias

Calibrated confidence calculated? → >0.85: Recommend "accept" 
                                 → 0.65-0.85: Recommend "review"
                                 → 0.5-0.65: Recommend "iterate"
                                 → <0.5: Recommend "reject"

Weight Override Decision Points

Historical accuracy < 70%? → Yes: Reduce all factor scores by 0.1
                          → No: Apply standard weights

Task involves safety/security? → Yes: Increase sources weight to 0.4, reduce uncertainty tolerance
                              → No: Use standard weights

Agent explicitly states uncertainty? → Yes: Boost uncertainty factor score by 0.2
                                    → No: Penalty of -0.1 to uncertainty factor

Multiple conflicting sources? → Yes: Reduce sources factor by 0.3, increase consistency weight
                             → No: Standard source scoring

Failure Modes

1. Overconfidence Inflation

2. Factor Tunnel Vision

3. Threshold Gaming

4. Calibration Drift

5. Context Blindness

Worked Examples

Example 1: Research Analysis (Confidence: 0.73 → Calibrated: 0.68)

Scoring Process:

1. Factor Analysis:
   - Reasoning: 0.75 (structured, acknowledges limitations)
   - Sources: 0.85 (3 academic papers, specific citations)
   - Consistency: 0.8 (no contradictions)
   - Completeness: 0.65 (missing methodology details)
   - Uncertainty: 0.7 (acknowledges dataset variation)

2. Weight Application (Research task):
   Overall = (0.75×0.2) + (0.85×0.35) + (0.8×0.15) + (0.65×0.2) + (0.7×0.1) = 0.73

3. Calibration:
   - Model bias: -0.05 (known overconfidence)
   - Task difficulty: Moderate (-0.02)
   - Final: 0.73 × 0.95 × 0.98 = 0.68

4. Decision: 0.68 → "review" (above 0.65 threshold)

Novice Miss: Would score sources higher without checking citation quality Expert Catch: Notices uneven methodology across studies, adjusts accordingly

Example 2: Code Implementation (Confidence: 0.45 → Calibrated: 0.41)

Input: "Here's the function: def process(data): return data.sort(). This should work for most cases."

Scoring Process:

1. Factor Analysis:
   - Reasoning: 0.3 (minimal explanation)
   - Sources: 0.2 (no documentation references)  
   - Consistency: 0.6 (simple, consistent)
   - Completeness: 0.3 (missing error handling, edge cases)
   - Uncertainty: 0.4 ("most cases" shows some awareness)

2. Weight Application (Code task):
   Overall = (0.3×0.25) + (0.2×0.15) + (0.6×0.3) + (0.3×0.25) + (0.4×0.05) = 0.45

3. Decision: 0.45 → "reject" (below 0.5 threshold)

Critical Issue: Multiple factors below 0.3 threshold triggers investigation flag

Example 3: Creative Writing (Confidence: 0.82 → Calibrated: 0.78)

Scoring Process:

1. Factor Analysis:
   - Reasoning: 0.7 (good structure analysis)
   - Sources: 0.6 (implicit literary references)
   - Consistency: 0.9 (coherent narrative analysis)
   - Completeness: 0.95 (covers all story elements)
   - Uncertainty: 0.5 (confident but appropriate)

2. Weight Application (Creative task):
   Overall = (0.7×0.15) + (0.6×0.1) + (0.9×0.3) + (0.95×0.35) + (0.5×0.1) = 0.82

3. Decision: 0.78 → "review" (below 0.85 auto-accept)

Quality Gates

[ ] All 5 confidence factors scored (reasoning, sources, consistency, completeness, uncertainty)
[ ] Task type identified and appropriate weights applied
[ ] Raw confidence calculated using weighted factor scores
[ ] Calibration applied with bias correction and historical accuracy
[ ] Threshold decision determined (accept/review/iterate/reject)
[ ] Factor breakdown shows contribution percentages sum to 100%
[ ] Any factor scoring <0.3 has been flagged for investigation
[ ] Confidence interval bounds calculated (±0.05 of point estimate)
[ ] Weakest factors identified with specific improvement suggestions
[ ] Output includes metadata: timestamp, model version, calibration parameters used

NOT-FOR Boundaries

This skill is NOT for:

Output validation: Use dag-output-validator for correctness checking
Hallucination detection: Use dag-hallucination-detector for factual accuracy
Content quality assessment: Use dag-quality-assessor for writing quality
Performance benchmarking: Use dag-performance-evaluator for speed/efficiency
Binary pass/fail decisions: This produces probabilistic confidence, not binary judgments

Delegate to other skills when:

Asked to "validate this output" → Use dag-output-validator
Asked to "check if this is accurate" → Use dag-hallucination-detector
Asked to "is this good enough?" → Use dag-quality-assessor
Asked to "should we ship this?" → Combine confidence score with dag-output-validator

Related Skills

curiositech/revisiting-interview-data-analysing-turn

data-ai

VerifiedTrustedCommunity

license: Apache-2.0 NOT for unrelated tasks outside this domain.

8SKILL.mdUpdated Jul 19, 2026

curiositech/revisiting-interview-data-analysing-turn

curiositech/redis-patterns-expert

development

VerifiedTrustedCommunity

Use when designing caching strategies (cache-aside, write-through, write-behind), implementing distributed locks, building rate limiters, leaderboards, real-time streams (XADD/consumer groups), pub/sub, or tuning eviction policies. Triggers: thundering-herd on cache miss, dogpile on key expiry, Redlock vs SET-NX-PX choice, sliding-window rate limiter, hot-key on a single cluster slot, big-key blowup, MULTI/EXEC across slots, KEYS in production. NOT for Redis Cluster operations/admin (different domain), embedded KV (SQLite, leveldb), in-process LRU caches, or Memcached.

8SKILL.mdUpdated Jul 19, 2026

curiositech/redis-patterns-expert

curiositech/react-server-components-boundary

tools

VerifiedTrustedCommunity

Drawing the `'use client'` boundary correctly in React Server Components apps (Next.js App Router, RSC frameworks) — leaf-pushing, slot composition, serialization rules, and environment poisoning prevention. Grounded in react.dev and Next.js 16 docs.

8SKILL.mdUpdated Jul 19, 2026

curiositech/react-server-components-boundary

curiositech/rate-limiting-strategy

development

VerifiedTrustedCommunity

Use when designing rate limiting for an API, choosing between token bucket / sliding window / leaky bucket / fixed window, implementing it in Redis, deciding edge (Cloudflare/Upstash) vs origin enforcement, sizing per-user vs per-IP vs per-endpoint quotas, returning the right 429 response with Retry-After, or fixing the boundary-burst bug in fixed-window limiters. Triggers: 429 too many requests, INCR + EXPIRE, ZADD + ZREMRANGEBYSCORE + ZCARD, X-RateLimit-Remaining header, Cloudflare WAF rate limiting rules, Upstash @upstash/ratelimit, leaky bucket shaping vs policing, distributed rate limiter consistency. NOT for DDoS mitigation specifically (different scale), CAPTCHA / bot management, full WAF design, or per-user quota billing.

8SKILL.mdUpdated Jul 19, 2026

curiositech/rate-limiting-strategy

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/curiositech/windags-skills.git

# Copy into Claude Code skills folder (global)
cp -r windags-skills/skills/dag-confidence-scorer ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

curiositech/windags-skills

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT