skills/adaptive-confidence-gating-multi-agent/SKILL.md
Multi-agent code generation using structured debate with adaptive confidence gating. Three specialized agents (User/Product, Technical, QA) debate before generating code, with a 95% confidence threshold to skip debate on simple tasks and a reviewer-guided debugging loop for post-generation refinement. Use when: 'generate code with multi-agent debate', 'use DebateCoder approach', 'code generation with confidence gating', 'multi-agent collaborative coding', 'debate-driven code synthesis', 'adaptive confidence code generation'.
npx skillsauth add ndpvt-web/arxiv-claude-skills adaptive-confidence-gating-multi-agentInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill implements the DebateCoder framework for code generation: a three-agent collaborative protocol where a User Agent (product owner), Technical Agent (architect), and QA Agent (quality engineer) engage in structured pre-generation debate to produce high-quality code. An Adaptive Confidence Gating mechanism with a 95% threshold lets simple problems bypass debate entirely, while complex problems get up to 3 rounds of deliberation. After code generation, a reviewer-guided debugging loop analyzes failures and directs targeted fixes rather than blind rewrites. This approach improves code quality by 11+ percentage points over direct generation while reducing unnecessary computation on easy tasks by ~35%.
Pre-Generation Debate with Confidence Gating: Before writing any code, simulate three agent perspectives on the problem. Each agent independently produces a candidate plan and a confidence score (0-100) reflecting perceived task solvability. Compute the collective confidence as the mean: Gamma = (1/|A|) * sum(c_i). If Gamma >= 95, the problem is simple enough to skip debate and proceed directly to code synthesis. Otherwise, run up to 3 rounds of deliberation where each agent reviews the others' plans, identifies weaknesses, and refines its own approach. This gating mechanism avoids wasting reasoning effort on straightforward problems like "return the sum of a list" while investing deeply in problems with tricky edge cases or algorithmic complexity.
Orthogonal Pre/Post Refinement: The framework separates concerns into two phases. Pre-generation debate resolves design disagreements (algorithm choice, edge case strategy, data structure selection) before any code is written. Post-generation refinement uses a reviewer-guided debugging loop: when code fails tests, a "Code Reviewer" role analyzes the failure triplet (problem prompt + code + error/test log) to produce a root cause analysis and fix plan—critically, it does NOT rewrite the code itself. A separate "Debugging Agent" then applies the targeted fix. This separation prevents the common failure pattern where a model blindly rewrites code and introduces new bugs.
Role Specialization: The three agents have distinct evaluation priorities. The User Agent focuses on functionality completeness and usability (does the code do what was asked?). The Technical Agent prioritizes technical feasibility and performance efficiency (is the algorithm correct and fast?). The QA Agent emphasizes robustness and reliability (what about edge cases, invalid inputs, boundary conditions?). This structured disagreement surfaces problems that a single-perspective approach misses.
Parse the problem statement. Extract the function signature, input/output types, constraints, and any provided test cases. Identify ambiguities or underspecified requirements.
Generate three independent plans with confidence scores. For each agent role, produce a candidate approach and rate your confidence (0-100):
Compute collective confidence and gate. Average the three scores. If the mean >= 95, the problem is straightforward—skip to step 6 (direct synthesis). Otherwise, proceed to deliberation.
Run multi-turn deliberation (up to 3 rounds). In each round:
Resolve remaining disagreements. If agents still disagree after 3 rounds, prioritize by: correctness (Technical Agent) > robustness (QA Agent) > usability (User Agent). Choose the approach that satisfies the most critical concerns.
Synthesize the final plan into code. Merge the terminal plans, resolving any contradictions by keeping the most robust and functionally complete components. Write the code implementing the agreed-upon approach.
Validate against test cases. Run the generated code against all available test cases (provided examples and inferred edge cases).
If tests fail, invoke the reviewer-guided debugging loop:
<= not < because the range is inclusive"). Do NOT rewrite the code.Return the final code with a brief summary of the debate outcome: which perspectives influenced the final solution and what edge cases were caught during deliberation.
Example 1: Simple problem (confidence gating triggers early exit)
User: Write a function that returns the reverse of a string.
Agent Plans:
- User Agent (confidence: 98): Use slicing s[::-1]. Handles all string types.
- Technical Agent (confidence: 99): s[::-1] is O(n) time, O(n) space. Optimal.
- QA Agent (confidence: 96): Edge cases: empty string returns "", single char returns itself. Slicing handles both.
Collective confidence: 97.7 >= 95 → SKIP DEBATE
Synthesized output:
def reverse_string(s: str) -> str:
return s[::-1]
Example 2: Complex problem (full debate cycle)
User: Write a function that finds the longest substring without repeating characters.
Round 1 Plans:
- User Agent (confidence: 70): Brute force check all substrings. Simple but complete.
- Technical Agent (confidence: 75): Sliding window with a set. O(n) time.
- QA Agent (confidence: 65): What about empty strings? Unicode? Must track window boundaries carefully.
Collective confidence: 70 < 95 → ENTER DEBATE
Round 2:
- User Agent: "Technical Agent's sliding window is better than brute force. Adopting it.
But the window shrink logic needs care—must move left pointer past the duplicate, not just by one."
- Technical Agent: "QA Agent raises valid point about empty input. Adding explicit empty check.
Using a dict to store last-seen index for O(1) left-pointer jumps."
- QA Agent: "Technical Agent's dict approach handles the shrink correctly.
Adding test cases: empty string, all-same chars ('aaaa'), full-unique ('abcd'), duplicate at end."
Collective confidence: 92 → continue to Round 3
Round 3:
- All agents converge on sliding window with dict. Confidence: 96 → STOP
Synthesized output:
def length_of_longest_substring(s: str) -> int:
seen = {}
start = 0
max_len = 0
for end, char in enumerate(s):
if char in seen and seen[char] >= start:
start = seen[char] + 1
seen[char] = end
max_len = max(max_len, end - start + 1)
return max_len
Example 3: Debugging loop after test failure
User: Write a function to merge two sorted lists into one sorted list.
[After debate, code generated:]
def merge_sorted(a, b):
result = []
i = j = 0
while i < len(a) and j < len(b):
if a[i] < b[j]:
result.append(a[i])
i += 1
else:
result.append(b[j])
j += 1
return result
Test failure: merge_sorted([1,3,5], [2,4,6]) returns [1,2,3,4,5] — missing 6.
Reviewer analysis:
Root cause: After the while loop exits, remaining elements in the non-exhausted
list are never appended. The loop exits when either pointer reaches the end,
but leftover elements are discarded.
Fix plan: After the while loop, append a[i:] and b[j:] to result.
Debugger applies fix:
def merge_sorted(a, b):
result = []
i = j = 0
while i < len(a) and j < len(b):
if a[i] <= b[j]: # also fixed: use <= for stability
result.append(a[i])
i += 1
else:
result.append(b[j])
j += 1
result.extend(a[i:])
result.extend(b[j:])
return result
Re-test: All cases pass.
return a + b is wasteful. For truly trivial one-liners, skip this framework entirely.Adaptive Confidence Gating in Multi-Agent Collaboration for Efficient and Optimized Code Generation — Zhang et al. (2026). Look for: Section 3 (DebateCoder framework architecture), the confidence formula Gamma = mean(c_i) with tau=95 threshold, Table 1 (Pass@1 results showing 70.12% on HumanEval), and Table 3 (ablation study showing +4.87% from debate alone, +7.92% from the debugging loop).
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".