skills/corefine-confidence-guided-self-refinement-adaptiv/SKILL.md
Confidence-guided self-refinement for adaptive reasoning. Implements the CoRefine pattern: assess confidence in each reasoning step, then decide whether to halt, re-examine, or try a different approach -- reducing wasted compute while maintaining accuracy. Use when: 'solve this step by step with self-correction', 'refine your reasoning until confident', 'adaptively debug this problem', 'use confidence-guided refinement', 'self-correct with backtracking', 'try different approaches if stuck'.
npx skillsauth add ndpvt-web/arxiv-claude-skills corefine-confidence-guided-self-refinement-adaptivInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to apply the CoRefine self-refinement protocol from [Jin et al., 2026] when tackling complex reasoning, debugging, or multi-step problem-solving tasks. Instead of generating many candidate solutions in parallel or blindly iterating, Claude monitors its own confidence at each reasoning step and uses that signal to decide: halt (accept the answer), re-examine (revisit the current approach with fresh scrutiny), or pivot (abandon the current path and try a fundamentally different strategy). This produces higher-quality answers with far fewer wasted tokens, mirroring the paper's ~190x token reduction over brute-force parallel sampling.
Confidence as a control signal, not a correctness guarantee. The core insight of CoRefine is that tracking how confident you are across a chain of reasoning steps reveals when you are likely wrong -- even without ground truth. A sudden drop in confidence mid-derivation, hedging language, or reliance on unverified assumptions are all observable signals that the current path may be failing. Rather than completing a shaky chain and hoping for the best, CoRefine uses these signals to trigger corrective actions in real time.
Three-action controller. At each reasoning checkpoint, exactly one action is chosen: (1) Halt -- confidence is high and stable across recent steps; accept the current answer. (2) Re-examine -- confidence dipped on the most recent step but the overall approach seems sound; re-derive or verify that specific step before continuing. (3) Pivot -- confidence has been declining across multiple steps or a fundamental assumption looks wrong; abandon the current strategy and try a qualitatively different approach. The paper's Conv1D controller learns these decision boundaries from confidence trajectories; in our LLM-native adaptation, Claude applies the same logic by explicitly tracking and reasoning about its confidence at each checkpoint.
CoRefine-Tree for exploration vs. exploitation. When a problem admits multiple valid solution strategies, the refinement loop can branch: maintain 2-3 live approaches in parallel (exploration), but invest deeper refinement only in the most promising branch (exploitation). This hybrid avoids both tunnel vision on a single bad path and the waste of fully parallel sampling.
Parse the problem and identify checkpoints. Break the task into discrete reasoning steps or subtasks. Each step becomes a confidence checkpoint. For code, this means: understand requirements, design approach, implement each function, verify correctness. For math, this means: each derivation step.
Generate the first reasoning step and assess confidence. Produce the initial step. Then explicitly evaluate: "How confident am I in this step? What assumptions am I making? Is anything uncertain?" Assign a qualitative confidence level: HIGH (no doubts, well-understood domain), MEDIUM (plausible but unverified), or LOW (guessing, unfamiliar territory, multiple valid alternatives).
Apply the three-action decision rule.
Track the confidence trajectory. Maintain a running record of confidence levels across steps. A pattern like [HIGH, HIGH, MEDIUM, LOW] is a clear pivot signal. A pattern like [HIGH, HIGH, HIGH, MEDIUM, HIGH] after re-examination indicates successful recovery.
On re-examine: verify the specific step. Do not repeat the entire chain. Isolate the uncertain step. Test it with a concrete input, check it against a known identity, or derive it via an independent method. If it holds up, upgrade confidence and continue. If it fails, fix it or pivot.
On pivot: try a qualitatively different approach. Do not make minor tweaks to the same strategy. Change the algorithm, the data structure, the mathematical technique, or the interpretation of the problem. Reference the failed approach to avoid repeating the same mistake.
Halt when confidence is HIGH across all steps. Once every checkpoint in the chain reads HIGH, present the final answer. State the key verification that gave you confidence.
Cap refinement at 4 iterations. If you have pivoted 3 times without reaching high confidence, present the best available answer with an explicit statement of remaining uncertainty. The paper averages 2.7 refinement steps; exceeding 4 suggests the problem may need external input.
For multi-strategy problems, use CoRefine-Tree. When the problem clearly admits 2-3 valid approaches (e.g., recursive vs. iterative, SQL vs. pandas, analytical vs. numerical), briefly sketch each approach (exploration), then invest full refinement in the most promising one (exploitation). Fall back to the second approach only if the first fails after re-examination.
Summarize the refinement trace. After halting, briefly report: how many refinement steps were taken, what was re-examined or pivoted, and why the final answer is trustworthy. This gives the user transparency into the reasoning process.
Example 1: Debugging a subtle off-by-one error
User: "This function is supposed to return the k-th largest element but it's returning wrong results for some inputs. Fix it."
def kth_largest(nums, k):
nums.sort()
return nums[len(nums) - k + 1]
Approach:
nums = [1,2,3,4,5], k=1 (largest). len(nums) - k + 1 = 5. Index 5 is out of bounds. Confidence: LOW -- the index formula is wrong.len(nums) - k, which gives index 4 = value 5. Verify: k=2 gives index 3 = value 4. Confidence: HIGH.Output:
def kth_largest(nums, k):
nums.sort()
return nums[len(nums) - k] # Fixed: removed erroneous +1
Refinement trace: 1 re-examination on index math. Verified with k=1 and k=2 on [1,2,3,4,5].
Example 2: Solving an algorithmic problem with a pivot
User: "Given a list of intervals, merge all overlapping intervals. Optimize for clarity and correctness."
Approach:
interval[0]. Initialize result with first interval. For each subsequent interval, if it overlaps with the last merged interval (start <= last_end), extend; otherwise append. Confidence: HIGH.Output:
def merge_intervals(intervals):
if not intervals:
return []
intervals.sort(key=lambda x: x[0])
merged = [intervals[0]]
for start, end in intervals[1:]:
if start <= merged[-1][1]:
merged[-1][1] = max(merged[-1][1], end)
else:
merged.append([start, end])
return merged
Refinement trace: 1 pivot from brute-force to sort-then-sweep for clarity. Edge cases verified.
Example 3: CoRefine-Tree for a design decision
User: "Implement a rate limiter that allows 100 requests per minute per user."
Approach (Tree -- explore 2 strategies):
Exploit Branch B (more robust for production use, handles bursts gracefully):
Output:
import time
class RateLimiter:
def __init__(self, capacity=100, refill_rate=100/60):
self.capacity = capacity
self.refill_rate = refill_rate # tokens per second
self.buckets = {} # user_id -> (tokens, last_check)
def allow(self, user_id):
now = time.monotonic()
tokens, last_check = self.buckets.get(user_id, (self.capacity, now))
elapsed = now - last_check
tokens = min(self.capacity, tokens + elapsed * self.refill_rate)
if tokens >= 1:
self.buckets[user_id] = (tokens - 1, now)
return True
self.buckets[user_id] = (tokens, now)
return False
Refinement trace: CoRefine-Tree explored sliding window vs. token bucket, invested in token bucket. 1 re-examination on thread safety (documented, not over-engineered). 0 pivots.
CoRefine: Confidence-Guided Self-Refinement for Adaptive Test-Time Compute -- Jin, Tanno, Diethe, Teare (2026). Read Section 3 for the three-action controller design and Section 4 for CoRefine-Tree's hybrid exploration-exploitation strategy.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".