skills/huang-et-al-2026-six-sigma-agent/SKILL.md
Application of Six Sigma quality methodology to AI agent process improvement and reliability
npx skillsauth add curiositech/windags-skills huang-et-al-2026-six-sigma-agentInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
license: Apache-2.0
name: six-sigma-agent
description: Enterprise-grade reliability engineering for LLM agent systems through consensus-driven decomposed execution
triggers:
- "agent reliability"
- "production AI systems"
- "multi-step workflow"
- "agent error rates"
- "consensus voting"
- "task decomposition"
- "six sigma"
- "Byzantine fault tolerance"
- "agent coordination"
- "enterprise AI deployment"
version: 1.0
source: "The Six Sigma Agent (Lyzr Research, 2024)"
Load this skill when facing:
This framework applies when you need mathematical guarantees about system reliability rather than empirical "it usually works" confidence.
The fundamental challenge: In sequential workflows, reliability degrades exponentially with each step.
P(workflow_success) = (1 - p)^m
where p = per-step error rate, m = number of steps
Implications:
Key insight: The problem isn't that models are bad; it's that multiplicative compound decay dominates at scale, regardless of base accuracy.
The mathematical breakthrough: Sample n independent executions and take the majority vote.
P(system_error) = O(p^⌈n/2⌉)
where p = individual agent error rate, n = number of voters (odd)
Concrete example:
Why it works:
Critical requirement: Only works when tasks are truly atomic (see Mental Model #3).
The core principle: Consensus voting requires tasks decomposed to minimal, verifiable, functionally deterministic units.
Why atomicity matters:
Bad decomposition (too coarse):
Task: "Analyze this dataset and create a report"
Problem: Too many degrees of freedom — agents will produce incomparable outputs
Good decomposition (atomic):
Task 1: "Extract the numerical values from column 'Revenue' in rows 10-20"
Task 2: "Calculate the mean of these values"
Task 3: "Compare this mean to 150000 and return 'above' or 'below'"
Quality criteria:
The adaptive mechanism: Use initial vote disagreement as an uncertainty signal.
How it works:
Why this is elegant:
Practical insight: Contested votes reveal genuinely ambiguous cases where additional verification is worthwhile — the system's "doubt" is information.
The counterintuitive finding: Multi-agent systems where agents perform different roles often underperform single agents.
Why collaboration fails:
Why redundancy succeeds:
Design principle: Identical parallel execution > Differentiated collaborative execution
IF you have a multi-step workflow with reliability requirements, THEN:
Calculate compound error risk:
Evaluate atomicity feasibility:
atomic-decomposition-consensus-effectiveness.md)Assess independence assumption:
IF implementing consensus voting, THEN:
Start with n=5 (standard baseline)
Use dynamic scaling if:
Use fixed n=13 if:
IF optimizing cost vs. reliability, THEN:
Counterintuitive result: 5× cheap models often > 1× expensive model
Decision matrix:
Frontier model alone (GPT-4 Turbo, Claude 3.5):
Mid-tier consensus (5× GPT-3.5 or similar):
cost-efficiency-through-model-diversity.md for detailed economicsHybrid approach:
IF your existing multi-agent system underperforms expectations, THEN:
Identify failure category (see MAST-Data taxonomy):
Consider architectural pivot:
multi-agent-coordination-failures.md for detailed analysisIf coordination is necessary:
| File | Description | Load When... |
|------|-------------|--------------|
| error-compounding-and-workflow-reliability.md | Mathematical foundations of exponential decay in multi-step workflows; formal proofs of why model improvement alone cannot achieve Six Sigma | Designing production workflows; justifying architectural investments; calculating reliability requirements |
| consensus-voting-exponential-reliability.md | Core mathematical framework for consensus voting; proofs of O(p^⌈n/2⌉) reliability; analysis of error correlation constraints | Implementing consensus mechanisms; choosing redundancy levels; understanding independence requirements |
| atomic-decomposition-consensus-effectiveness.md | Detailed criteria for task decomposition; formalization of atomicity properties; examples of good vs. bad decomposition | Breaking down complex tasks; troubleshooting why consensus isn't working; training agents to decompose effectively |
| dynamic-scaling-uncertainty-detection.md | Mechanism design for adaptive redundancy; contested vote patterns as uncertainty signals; cost optimization strategies | Implementing dynamic scaling; balancing cost and reliability; handling variable-difficulty tasks |
| cost-efficiency-through-model-diversity.md | Economic analysis of model selection; case studies showing when cheap consensus beats expensive single models; ROI calculations | Budget planning; model selection decisions; justifying consensus overhead to stakeholders |
| multi-agent-coordination-failures.md | Empirical evidence that collaboration often fails; taxonomy of failure modes; comparison of collaboration vs. redundancy architectures | Diagnosing multi-agent system failures; deciding between collaborative vs. consensus architectures; understanding coordination overhead |
| task-verification-failure-prevention.md | Architectural patterns for preventing verification failures; relationship between atomicity and verifiability; design principles for testable actions | Designing verification mechanisms; troubleshooting false positives/negatives; ensuring atomic actions are truly verifiable |
Mistake: Believing GPT-5 or Claude-4 will solve reliability problems.
Why it fails: Error compounding is exponential. Even 0.1% per-step error → 90.5% at 100 steps. You can't model your way out of O(p^m) decay.
Correct approach: Accept that all models are probabilistic; build architecture for reliability.
Mistake: Applying consensus voting to complex, multi-faceted tasks.
Example: "Write a financial analysis report" × 5 agents → vote on best report
Why it fails: Outputs are too diverse to meaningfully "vote" on. No clear majority emerges. Consensus requires comparable atomic outputs.
Correct approach: Decompose to atomic units first, then apply consensus. Vote on specific extractable facts, calculations, or classifications.
Mistake: Designing systems where agents have differentiated roles (researcher, writer, critic) without considering redundancy.
Why it fails: Coordination overhead, task verification failures, and handoff brittleness often negate benefits. Empirically underperforms in many cases.
Correct approach: Start with redundant consensus architecture. Only add collaboration when coordination complexity is demonstrably worth it.
Mistake: Assuming consensus works with any set of agents, ignoring that they might make the same systematic errors.
Example: All agents misinterpret ambiguous instruction identically → consensus on wrong answer.
Why it fails: Mathematical guarantees assume independent errors. High correlation (ρ > 0.99) breaks exponential improvement.
Correct approach: Ensure task specifications are unambiguous. Test for systematic errors. Framework tolerates ρ ≤ 0.99, but lower is better.
Mistake: Always using n=13 agents (maximum reliability) for every action.
Why it fails: Massive unnecessary cost. Most actions don't need that much redundancy.
Correct approach: Use dynamic scaling. Start with n=5, scale only on contested votes. Achieves Six Sigma with 89% of actions at base redundancy.
Mistake: Feeding complex, end-to-end tasks directly to consensus system.
Example: "Build a customer segmentation model" × 5 agents → vote
Why it fails: Task is not atomic. Agents will diverge in approach. No meaningful consensus possible.
Correct approach: Decompose ruthlessly. Every action should be minimal, verifiable, functionally deterministic.
Surface: "We need 99% accuracy, so let's improve the model."
Deep: "With 99% per-step accuracy, we're at 36% for 100 steps due to (0.99)^100. The exponential dominates everything."
Surface: "Let's use GPT-4 for reliability."
Deep: "First, can we decompose this to atomic actions? If yes, 5× GPT-3.5 with consensus will likely outperform 1× GPT-4 and cost less."
Surface: "We have a multi-agent system with 5 agents working together."
Deep: "Are they executing different tasks collaboratively or identical tasks redundantly? The mathematics only hold for redundant execution with consensus."
Surface: "The vote was 3-2, so majority wins."
Deep: "A 3-2 vote signals genuine uncertainty. We should scale to n=7 for additional verification on this specific action."
Surface: "Each agent has 5% error rate."
Deep: "With n=5 consensus, our system error is (0.05)^3 = 0.0125%. That's the number that matters for workflow reliability."
Surface: "This task is too complex for the model."
Deep: "Is this task atomic enough that we can verify correctness? If not, how do we decompose it until verification becomes feasible?"
Surface: "Consensus voting solves everything."
Deep: "Consensus requires deterministic atomic tasks. For creative tasks with legitimate multiple valid outputs, we need different approaches. This is a scalpel, not a hammer."
Surface: "Agents need to be independent."
Deep: "The framework achieves Six Sigma even with error correlation ρ ≤ 0.99. Perfect independence isn't required—we need to stay below that threshold."
Error Compounding:
P(workflow_success) = (1 - p)^m
Consensus Reliability:
P(system_error) = O(p^⌈n/2⌉)
Six Sigma Target:
3.4 DPMO = 0.00034% error rate = 99.99966% reliability
Cost-Reliability Trade-off:
5 agents @ 5% error > 1 agent @ 1% error (14,700× more reliable)
This framework connects to:
For implementation details, mathematical proofs, and empirical validation, load the appropriate reference file from the table above.
tools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.