Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

curiositech/wei-2022-chain-of-thought

Name: wei-2022-chain-of-thought
Author: curiositech

skills/wei-2022-chain-of-thought/SKILL.md

npx skillsauth add curiositech/windags-skills wei-2022-chain-of-thought

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Chain-of-Thought Reasoning for Agent Systems

Strategic framework for eliciting and orchestrating reasoning in LLM-based agents through structured decomposition

Decision Points

Primary Routing Decision: Decompose vs. Direct

Task Complexity × Agent Capability → Action

IF agent_parameters < 100B AND task_requires_multi_step_reasoning:
  → DIRECT prompting (decomposition hurts below emergence threshold)
  
ELIF baseline_accuracy > 80% AND reasoning_steps < 3:
  → DIRECT prompting (minimal gains, added latency cost)
  
ELIF task_complexity = HIGH AND multi_step_reasoning = TRUE:
  → CHAIN_OF_THOUGHT prompting (gains scale with complexity)
  
ELIF task_distribution = MIXED_COMPLEXITY:
  → SELF_SELECT mode (provide examples, let agent choose when to decompose)

Failure Classification Decision Tree

Observe failure type:

IF error_type IN [calculator_mistake, symbol_mapping_error, missing_single_step]:
  → SHALLOW failure
  → Add external tools (calculator, validator)
  → Implement consistency checks
  → Re-run with augmentation
  
ELIF error_type IN [semantic_misunderstanding, incoherent_logic, fundamental_reasoning_flaw]:
  → DEEP failure
  → Check agent capability threshold
  → Re-route to more capable agent OR reject task
  
ELIF correct_answer = TRUE:
  → Verify reasoning chain (98% of correct answers have sound reasoning)
  → High confidence in output quality

Reasoning Structure Selection Matrix

Task Type × Semantic Requirements → Structure

IF task = SYMBOLIC_MANIPULATION:
  → HYBRID: Natural language setup → formal operations → natural language verification
  
ELIF semantic_understanding_across_steps = TRUE:
  → FULL natural language chain-of-thought
  → Don't optimize for brevity (semantic grounding requires explicit articulation)
  
ELIF multi_agent_coordination = TRUE:
  → Natural language for coordination medium
  → Formal protocols within individual agents only

Failure Modes

1. Premature Decomposition Syndrome

Detection Rule: If small model (<100B params) + chain-of-thought prompting → worse performance than baseline Symptom: Fluent but illogical reasoning chains, performance degradation Fix: Verify emergence threshold empirically before deploying decomposition; use direct prompting below threshold

2. Raw Compute Fallacy

Detection Rule: If adding compute time/tokens without structured intermediate steps → no performance gain Symptom: Longer outputs with dots/padding but same accuracy Fix: Structure compute through meaningful semantic intermediate states, not just duration

3. Uniform Error Handling

Detection Rule: If same recovery strategy applied to calculator errors and semantic failures Symptom: Tool augmentation fails on deep reasoning errors; model re-routing wastes cycles on shallow errors Fix: Classify failures first (shallow vs. deep), then route to appropriate intervention

4. Optimization Premature Optimization

Detection Rule: If natural language reasoning shortened to save tokens → coherence loss Symptom: Broken semantic grounding, reasoning chain loses logical connection Fix: Preserve explicit articulation; the "inefficiency" maintains semantic coherence

5. Brittleness Blindness

Detection Rule: If demo performance (95%) ≫ production performance (70%) across task variations Symptom: High variance across annotators/exemplar sets on low-complexity tasks Fix: Test robustness envelope before production; expect brittleness on tasks with high baseline accuracy

Worked Examples

Example 1: Math Word Problem Routing Decision

Scenario: Multi-step arithmetic problem arrives at LaMDA 68B agent

Decision Process:

Check capability: 68B > 100B threshold? No, but close to emergence
Check task complexity: 4 reasoning steps required = HIGH
Check baseline accuracy: ~6% without chain-of-thought
Decision: Use chain-of-thought (expected jump from 6% to ~14%)

Execution:

Prompt: "Let's think step by step. [problem]"
Agent produces: "First I need to find... Then I calculate... Next I..."
Expert catches: Verify each calculation step has semantic grounding
Novice misses: Would skip intermediate verification, trust fluent output

Result: 14% accuracy vs. 6% baseline (emergence threshold effect)

Example 2: Failure Classification and Recovery

Scenario: Agent produces wrong answer: "The total cost is $47" (correct: $52)

Analysis Process:

Examine reasoning chain: "First, 3 items × $12 = $36. Then add $15 tax = $51. Wait, I mean $47."
Classify failure: Calculator error in final step (36 + 15 = 51, but output 47)
Failure type: SHALLOW (arithmetic mistake, reasoning structure sound)
Decision: Add calculator tool, re-run

Recovery:

Augment with external calculator
Re-run same reasoning chain with tool verification
Expert insight: Don't re-route to bigger model for arithmetic errors
Novice error: Would assume "model isn't smart enough" and upgrade unnecessarily

Example 3: Natural Language vs. Formal Protocol Choice

Scenario: Multi-agent system coordinating complex financial analysis

Decision Point: Use formal API calls or natural language coordination?

Analysis:

Task requires semantic understanding across 6 reasoning steps
Each step builds on previous semantic context
Formal protocol would lose grounding between symbolic operations and real-world meaning

Decision: Natural language coordination despite token overhead

Implementation:

Agent A: "Given the Q3 earnings show 15% revenue growth but 8% margin compression, 
         I need to analyze if this indicates sustainable growth or pricing pressure..."
Agent B: "Building on your margin analysis, the compression aligns with our competitive 
         positioning data showing 3 new market entrants..."

Expert reasoning: The "inefficiency" of natural language maintains semantic coherence Novice mistake: Would optimize for concise formal protocols and lose reasoning grounding

Example 4: Production Robustness Testing

Scenario: Prompting technique achieves 95% accuracy on test exemplars

Pre-deployment Process:

Variance test: Run across 5 different annotator sets
Results: Performance range 71% - 96% (high variance = brittleness indicator)
Task analysis: Coin flip reasoning (low complexity, high baseline)
Decision: Brittle application, implement stabilization or accept variance

Production Strategy:

Deploy with variance monitoring
Flag high-variance outputs for review
Expert insight: Brittleness correlates with low reasoning complexity
Novice assumption: Would expect consistent performance based on demo results

Quality Gates

Capability Assessment

[ ] Agent parameter count verified (>100B for chain-of-thought effectiveness)
[ ] Emergence threshold confirmed empirically on sample tasks
[ ] Baseline accuracy measured without decomposition
[ ] Task complexity categorized (steps required, semantic depth)

Failure Mode Preparation

[ ] Error classification system implemented (shallow vs. deep detection rules)
[ ] Tool augmentation ready for shallow failures (calculator, validator)
[ ] Re-routing paths defined for deep failures
[ ] Recovery strategy mapped to failure type

Reasoning Structure Validation

[ ] Natural language vs. formal protocol choice justified
[ ] Semantic grounding maintained across reasoning steps
[ ] Intermediate states contain meaningful semantic content
[ ] Brevity optimization balanced against coherence requirements

Production Readiness

[ ] Robustness tested across multiple annotator/exemplar sets
[ ] Variance envelope mapped for task distribution
[ ] Brittleness indicators identified and monitored
[ ] Performance degradation thresholds defined

Quality Assurance

[ ] Sample outputs manually verified for reasoning chain coherence
[ ] Correct answers verified to have sound reasoning (98% confidence rule)
[ ] Edge cases tested at capability boundaries
[ ] Fallback strategies defined for out-of-distribution tasks

NOT-FOR Boundaries

This skill is NOT for:

Simple lookup tasks: Use direct retrieval instead of reasoning decomposition
Tasks with >90% baseline accuracy: Minimal gains don't justify latency overhead
Sub-100B parameter models on complex reasoning: Use [model-scaling-strategy] to upgrade capability first
Pure symbolic computation: Use [formal-reasoning-systems] for equation solving without semantic context
Real-time applications: Decomposition latency incompatible with <100ms response requirements

Delegate to other skills:

For model selection decisions → Use [capability-threshold-analysis]
For tool integration strategy → Use [agent-tool-coordination]
For multi-agent protocol design → Use [agent-communication-patterns]
For performance optimization → Use [latency-accuracy-tradeoffs]
For error monitoring systems → Use [agent-observability-patterns]

curiositech/wei-2022-chain-of-thought

skills/wei-2022-chain-of-thought/SKILL.md

Strategic framework for eliciting and orchestrating reasoning in LLM-based agents through structured decomposition

development

Updated Apr 4, 2026

$ install --global

skillsauth

npx skillsauth add curiositech/windags-skills wei-2022-chain-of-thought

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 4, 2026, 2:55 PM4.3s14 files scanned

SKILL.md

license:: Apache-2.0
name:: wei-2022-chain-of-thought
description:: Strategic framework for eliciting and orchestrating reasoning in LLM-based agents through structured decomposition
category:: Research & Academic

Chain-of-Thought Reasoning for Agent Systems

Strategic framework for eliciting and orchestrating reasoning in LLM-based agents through structured decomposition

Decision Points

Primary Routing Decision: Decompose vs. Direct

Task Complexity × Agent Capability → Action

IF agent_parameters < 100B AND task_requires_multi_step_reasoning:
  → DIRECT prompting (decomposition hurts below emergence threshold)
  
ELIF baseline_accuracy > 80% AND reasoning_steps < 3:
  → DIRECT prompting (minimal gains, added latency cost)
  
ELIF task_complexity = HIGH AND multi_step_reasoning = TRUE:
  → CHAIN_OF_THOUGHT prompting (gains scale with complexity)
  
ELIF task_distribution = MIXED_COMPLEXITY:
  → SELF_SELECT mode (provide examples, let agent choose when to decompose)

Failure Classification Decision Tree

Observe failure type:

IF error_type IN [calculator_mistake, symbol_mapping_error, missing_single_step]:
  → SHALLOW failure
  → Add external tools (calculator, validator)
  → Implement consistency checks
  → Re-run with augmentation
  
ELIF error_type IN [semantic_misunderstanding, incoherent_logic, fundamental_reasoning_flaw]:
  → DEEP failure
  → Check agent capability threshold
  → Re-route to more capable agent OR reject task
  
ELIF correct_answer = TRUE:
  → Verify reasoning chain (98% of correct answers have sound reasoning)
  → High confidence in output quality

Reasoning Structure Selection Matrix

Task Type × Semantic Requirements → Structure

IF task = SYMBOLIC_MANIPULATION:
  → HYBRID: Natural language setup → formal operations → natural language verification
  
ELIF semantic_understanding_across_steps = TRUE:
  → FULL natural language chain-of-thought
  → Don't optimize for brevity (semantic grounding requires explicit articulation)
  
ELIF multi_agent_coordination = TRUE:
  → Natural language for coordination medium
  → Formal protocols within individual agents only

Failure Modes

1. Premature Decomposition Syndrome

2. Raw Compute Fallacy

3. Uniform Error Handling

4. Optimization Premature Optimization

5. Brittleness Blindness

Worked Examples

Example 1: Math Word Problem Routing Decision

Scenario: Multi-step arithmetic problem arrives at LaMDA 68B agent

Decision Process:

Check capability: 68B > 100B threshold? No, but close to emergence
Check task complexity: 4 reasoning steps required = HIGH
Check baseline accuracy: ~6% without chain-of-thought
Decision: Use chain-of-thought (expected jump from 6% to ~14%)

Execution:

Prompt: "Let's think step by step. [problem]"
Agent produces: "First I need to find... Then I calculate... Next I..."
Expert catches: Verify each calculation step has semantic grounding
Novice misses: Would skip intermediate verification, trust fluent output

Result: 14% accuracy vs. 6% baseline (emergence threshold effect)

Example 2: Failure Classification and Recovery

Scenario: Agent produces wrong answer: "The total cost is $47" (correct: $52)

Analysis Process:

Examine reasoning chain: "First, 3 items × $12 = $36. Then add $15 tax = $51. Wait, I mean $47."
Classify failure: Calculator error in final step (36 + 15 = 51, but output 47)
Failure type: SHALLOW (arithmetic mistake, reasoning structure sound)
Decision: Add calculator tool, re-run

Recovery:

Augment with external calculator
Re-run same reasoning chain with tool verification
Expert insight: Don't re-route to bigger model for arithmetic errors
Novice error: Would assume "model isn't smart enough" and upgrade unnecessarily

Example 3: Natural Language vs. Formal Protocol Choice

Scenario: Multi-agent system coordinating complex financial analysis

Decision Point: Use formal API calls or natural language coordination?

Analysis:

Task requires semantic understanding across 6 reasoning steps
Each step builds on previous semantic context
Formal protocol would lose grounding between symbolic operations and real-world meaning

Decision: Natural language coordination despite token overhead

Implementation:

Agent A: "Given the Q3 earnings show 15% revenue growth but 8% margin compression, 
         I need to analyze if this indicates sustainable growth or pricing pressure..."
Agent B: "Building on your margin analysis, the compression aligns with our competitive 
         positioning data showing 3 new market entrants..."

Expert reasoning: The "inefficiency" of natural language maintains semantic coherence Novice mistake: Would optimize for concise formal protocols and lose reasoning grounding

Example 4: Production Robustness Testing

Scenario: Prompting technique achieves 95% accuracy on test exemplars

Pre-deployment Process:

Variance test: Run across 5 different annotator sets
Results: Performance range 71% - 96% (high variance = brittleness indicator)
Task analysis: Coin flip reasoning (low complexity, high baseline)
Decision: Brittle application, implement stabilization or accept variance

Production Strategy:

Deploy with variance monitoring
Flag high-variance outputs for review
Expert insight: Brittleness correlates with low reasoning complexity
Novice assumption: Would expect consistent performance based on demo results

Quality Gates

Capability Assessment

[ ] Agent parameter count verified (>100B for chain-of-thought effectiveness)
[ ] Emergence threshold confirmed empirically on sample tasks
[ ] Baseline accuracy measured without decomposition
[ ] Task complexity categorized (steps required, semantic depth)

Failure Mode Preparation

[ ] Error classification system implemented (shallow vs. deep detection rules)
[ ] Tool augmentation ready for shallow failures (calculator, validator)
[ ] Re-routing paths defined for deep failures
[ ] Recovery strategy mapped to failure type

Reasoning Structure Validation

[ ] Natural language vs. formal protocol choice justified
[ ] Semantic grounding maintained across reasoning steps
[ ] Intermediate states contain meaningful semantic content
[ ] Brevity optimization balanced against coherence requirements

Production Readiness

[ ] Robustness tested across multiple annotator/exemplar sets
[ ] Variance envelope mapped for task distribution
[ ] Brittleness indicators identified and monitored
[ ] Performance degradation thresholds defined

Quality Assurance

[ ] Sample outputs manually verified for reasoning chain coherence
[ ] Correct answers verified to have sound reasoning (98% confidence rule)
[ ] Edge cases tested at capability boundaries
[ ] Fallback strategies defined for out-of-distribution tasks

NOT-FOR Boundaries

This skill is NOT for:

Simple lookup tasks: Use direct retrieval instead of reasoning decomposition
Tasks with >90% baseline accuracy: Minimal gains don't justify latency overhead
Sub-100B parameter models on complex reasoning: Use [model-scaling-strategy] to upgrade capability first
Pure symbolic computation: Use [formal-reasoning-systems] for equation solving without semantic context
Real-time applications: Decomposition latency incompatible with <100ms response requirements

Delegate to other skills:

For model selection decisions → Use [capability-threshold-analysis]
For tool integration strategy → Use [agent-tool-coordination]
For multi-agent protocol design → Use [agent-communication-patterns]
For performance optimization → Use [latency-accuracy-tradeoffs]
For error monitoring systems → Use [agent-observability-patterns]

Related Skills

curiositech/revisiting-interview-data-analysing-turn

data-ai

VerifiedTrustedCommunity

license: Apache-2.0 NOT for unrelated tasks outside this domain.

8SKILL.mdUpdated Jul 19, 2026

curiositech/revisiting-interview-data-analysing-turn

curiositech/redis-patterns-expert

development

VerifiedTrustedCommunity

Use when designing caching strategies (cache-aside, write-through, write-behind), implementing distributed locks, building rate limiters, leaderboards, real-time streams (XADD/consumer groups), pub/sub, or tuning eviction policies. Triggers: thundering-herd on cache miss, dogpile on key expiry, Redlock vs SET-NX-PX choice, sliding-window rate limiter, hot-key on a single cluster slot, big-key blowup, MULTI/EXEC across slots, KEYS in production. NOT for Redis Cluster operations/admin (different domain), embedded KV (SQLite, leveldb), in-process LRU caches, or Memcached.

8SKILL.mdUpdated Jul 19, 2026

curiositech/redis-patterns-expert

curiositech/react-server-components-boundary

tools

VerifiedTrustedCommunity

Drawing the `'use client'` boundary correctly in React Server Components apps (Next.js App Router, RSC frameworks) — leaf-pushing, slot composition, serialization rules, and environment poisoning prevention. Grounded in react.dev and Next.js 16 docs.

8SKILL.mdUpdated Jul 19, 2026

curiositech/react-server-components-boundary

curiositech/rate-limiting-strategy

development

VerifiedTrustedCommunity

Use when designing rate limiting for an API, choosing between token bucket / sliding window / leaky bucket / fixed window, implementing it in Redis, deciding edge (Cloudflare/Upstash) vs origin enforcement, sizing per-user vs per-IP vs per-endpoint quotas, returning the right 429 response with Retry-After, or fixing the boundary-burst bug in fixed-window limiters. Triggers: 429 too many requests, INCR + EXPIRE, ZADD + ZREMRANGEBYSCORE + ZCARD, X-RateLimit-Remaining header, Cloudflare WAF rate limiting rules, Upstash @upstash/ratelimit, leaky bucket shaping vs policing, distributed rate limiter consistency. NOT for DDoS mitigation specifically (different scale), CAPTCHA / bot management, full WAF design, or per-user quota billing.

8SKILL.mdUpdated Jul 19, 2026

curiositech/rate-limiting-strategy

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/curiositech/windags-skills.git

# Copy into Claude Code skills folder (global)
cp -r windags-skills/skills/wei-2022-chain-of-thought ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

curiositech/windags-skills

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT