skills/discovering-process-outcome-credit-multi-step/SKILL.md
Apply Step-wise Marginal Information Gain (MIG) credit assignment to multi-step reasoning tasks. Evaluates each reasoning step by its marginal contribution toward the correct answer rather than by position or final outcome alone. Use this skill when asked to: "evaluate my chain-of-thought", "score each reasoning step", "find where my logic goes wrong", "credit assign my solution steps", "debug my multi-step reasoning", "identify which steps actually matter in this derivation".
npx skillsauth add ndpvt-web/arxiv-claude-skills discovering-process-outcome-credit-multi-stepInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to evaluate, score, and improve multi-step reasoning chains by applying the Marginal Information Gain (MIG) framework from Wang et al. (2026). Instead of judging reasoning only by its final answer (outcome credit) or by step position (naive process credit), this approach measures each step's intrinsic semantic contribution — how much closer it moves toward the correct solution compared to the best progress achieved so far. This produces precise, position-independent credit assignment that identifies pivotal breakthroughs, flags redundant steps, and pinpoints where reasoning derails.
Traditional evaluation gives binary feedback: the final answer is right or wrong. This discards all signal about which intermediate steps drove correctness or failure. Position-based heuristics (weighting later steps higher) also fail — empirically they correlate poorly (Spearman rho=0.254) with true step value because pivotal breakthroughs often happen early.
The MIG framework instead computes a step-conditioned likelihood for each step: how much does the reasoning prefix through this step increase confidence in the correct answer? Crucially, each step is compared against a Monotonic Historical Watermark — the highest confidence achieved by any prior step. Only steps that exceed this watermark receive positive credit. This means:
The framework separates two concerns: (1) process quality — are the reasoning steps individually sound and progressive? and (2) outcome correctness — does the final answer match ground truth? These are evaluated with independent masks. This decoupling prevents a correct final answer from masking flawed intermediate reasoning, and prevents a wrong final answer from discrediting individually valid reasoning steps. A third gate activates only when both structural validity and answer correctness hold, identifying gold-standard reasoning chains suitable for direct reuse.
Parse the reasoning chain into discrete steps. Segment the solution at natural boundaries: sentence breaks, numbered items, paragraph breaks, or explicit markers like <step> tags. Each segment becomes an independently evaluable unit.
Identify the target outcome. Determine the correct final answer or desired conclusion. If the user provides it, use it directly. If not, solve the problem independently first to establish ground truth.
Compute step-conditioned confidence for each step. For each step k, evaluate: "Given steps 1 through k, how likely is the correct answer?" Rate this on a 0-1 scale based on how much of the necessary information, logic, and computation has been established so far.
Maintain the Historical Watermark. Track the running maximum of step-conditioned confidence. Initialize at 0. After evaluating each step, update: watermark_k = max(watermark_{k-1}, confidence_k).
Compute Marginal Information Gain per step. For each step: MIG_k = max(0, confidence_k - watermark_{k-1}). Steps that exceed the prior watermark get positive credit. Steps that don't get zero. Steps where confidence drops are flagged as potentially harmful.
Flag regression points. Any step where confidence_k < watermark_{k-1} indicates a reasoning regression — the step introduced error, confusion, or irrelevant content. Mark these explicitly for the user.
Apply the dual-gate assessment. Check two independent conditions: (a) Is the reasoning chain structurally valid (well-formed steps, logical connectives, no gaps)? (b) Is the final answer correct? Report the conjunction — chains passing both gates are high-quality exemplars; chains failing one reveal whether the problem is process or outcome.
Produce the annotated credit report. For each step, output: the step text, its MIG score, whether it's a breakthrough/neutral/regression, and a brief explanation of why.
Recommend improvements. Based on the credit distribution, suggest: removing zero-credit steps (redundant), strengthening regression steps (erroneous), and potentially reordering for clarity (though the method is position-agnostic for scoring).
Verify with equivalence awareness. If the problem admits multiple valid answer forms (e.g., "1.6" vs "8/5" vs "1 3/5"), ensure confidence scoring accepts all semantically equivalent forms to avoid false negatives.
Example 1: Debugging a Math Derivation
User: "Score each step of my solution to: Solve 3x + 7 = 22"
Step 1: "We need to solve for x in 3x + 7 = 22"
Step 2: "First, note that 3x + 7 is a linear equation"
Step 3: "Subtract 7 from both sides: 3x = 15"
Step 4: "Recall that division is the inverse of multiplication"
Step 5: "Divide both sides by 3: x = 5"
Approach:
Credit Report:
Step 1: MIG = 0.05 | Neutral — Restates the problem, minimal new progress
Watermark: 0.05
Step 2: MIG = 0.00 | Redundant — Classifies equation type, does not advance solution
Watermark: 0.05
Step 3: MIG = 0.55 | BREAKTHROUGH — Key algebraic move, isolates 3x
Watermark: 0.60
Step 4: MIG = 0.00 | Redundant — States a general fact without applying it
Watermark: 0.60
Step 5: MIG = 0.40 | Breakthrough — Completes the solution, reaches x = 5
Watermark: 1.00
Dual-gate: Structure PASS | Outcome PASS (x = 5 is correct)
Summary: Steps 3 and 5 carry 95% of the credit. Steps 2 and 4 can be
removed without loss. The reasoning is correct but inefficient.
Example 2: Finding Where Code Debugging Logic Derails
User: "I'm debugging why my binary search returns -1 for target=7 in [1,3,5,7,9]. Walk through my reasoning and find the flaw."
Step 1: "Initialize low=0, high=4"
Step 2: "mid = (0+4)//2 = 2, arr[2]=5, 5 < 7, so search right: low=3"
Step 3: "mid = (3+4)//2 = 3, arr[3]=7, found! Return 3"
Step 4: "Wait, my code returns -1, so maybe the array is 0-indexed wrong"
Step 5: "Let me assume 1-indexed: arr[1]=1, arr[2]=3, arr[3]=5, arr[4]=7, arr[5]=9"
Step 6: "With 1-indexing: mid=2, arr[2]=3, 3<7, low=3, mid=3, arr[3]=5, 5<7, low=4, mid=4, arr[4]=7, found"
Step 7: "So indexing isn't the problem. The bug must be in my comparison operator."
Credit Report:
Step 1: MIG = 0.10 | Neutral — Correct setup
Watermark: 0.10
Step 2: MIG = 0.20 | Progress — Correct trace of first iteration
Watermark: 0.30
Step 3: MIG = 0.25 | Progress — Correctly shows algorithm should find target
Watermark: 0.55
Step 4: MIG = 0.00 | REGRESSION flag — Introduces a false hypothesis (indexing
error) when step 3 already proved the algorithm logic works.
Confidence drops to 0.35.
Watermark: 0.55 (unchanged)
Step 5: MIG = 0.00 | Redundant — Pursues the false hypothesis from step 4
Watermark: 0.55
Step 6: MIG = 0.00 | Redundant — Confirms 1-indexing also works, which was
already evident
Watermark: 0.55
Step 7: MIG = 0.15 | Partial recovery — Correctly discards the false hypothesis
but jumps to another guess without evidence
Watermark: 0.70
Dual-gate: Structure PASS | Outcome FAIL (root cause not identified)
Recommendation: After step 3 established the algorithm is correct in
theory, the user should examine their actual code (comparison operator,
loop bounds, return statement) rather than hypothesizing about indexing.
Steps 4-6 are a dead-end detour.
Example 3: Comparing Two Solution Paths
User: "Which proof that sqrt(2) is irrational is more efficient — proof A or proof B?"
Approach:
Output summary:
Proof A: 6 steps, 4 breakthroughs, 0 regressions, credit density = 0.78
Proof B: 9 steps, 4 breakthroughs, 1 regression, credit density = 0.49
Proof A is more efficient — same number of key insights delivered in
fewer steps with no regressions. Proof B's steps 5 and 7 are redundant
restatements, and step 8 introduces an unnecessary case split that is
immediately resolved.
max(0, ...) — regressions are flagged separately but scored as zero credit, not negative credit.Wang, X., Wang, W., Chen, K., Nimalsiri, N., & Halgamuge, S. (2026). Discovering Process-Outcome Credit in Multi-Step LLM Reasoning. arXiv:2602.01034v1. https://arxiv.org/abs/2602.01034v1
Key insight to look for: The Monotonic Historical Watermark mechanism and the empirical finding that content-aware step valuation (rho=0.623) dramatically outperforms position-based heuristics (rho=0.254) for identifying reasoning breakthroughs.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".