skills/enhancing-mathematical-problem-solving/SKILL.md
Solve mathematical problems using IIPC (Iteratively Improved Program Construction) -- a dual-branch approach that combines iterative code execution with independent chain-of-thought reasoning, then merges both for a verified answer. Trigger phrases: "solve this math problem with code verification", "math reasoning with execution feedback", "verify this calculation programmatically", "solve and check with code", "iterative math solving", "math problem with dual reasoning"
npx skillsauth add ndpvt-web/arxiv-claude-skills enhancing-mathematical-problem-solvingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to solve mathematical problems using Iteratively Improved Program Construction (IIPC), a technique from Basarkar et al. (2026). Instead of generating a single answer or a single code solution, IIPC runs two parallel reasoning branches -- an executable program branch that iteratively refines code with real execution feedback, and an independent chain-of-thought branch that reasons without seeing program output. The two branches merge only at the final step, preventing over-reliance on either approach and catching errors that one branch alone would miss.
Dual-branch architecture with execution-driven refinement. IIPC treats generated programs not as disposable scripts but as representations of the model's reasoning chain. The program branch generates executable code, runs it, inspects the output, and iteratively corrects errors using a failure memory that prevents repeating mistakes. Simultaneously, an independent chain-of-thought branch reasons through the problem in natural language without seeing any program output. This independence is critical -- it prevents the "program bias" where an LLM anchors on a possibly-wrong code result and rationalizes it instead of catching the error.
Failure memory and iterative refinement. Each iteration that produces an error or a failed validation generates a failure descriptor stored in memory M_t. On the next iteration, the model sees these descriptors alongside the problem, preventing it from regenerating the same flawed approach. This transforms blind trial-and-error into informed trial-and-improvement. The system allows up to two validation checks and two error corrections per validation, creating a bounded but thorough refinement loop.
Structured integration at the final step. Only after both branches complete does a structured integration prompt combine their outputs. The model weighs evidence from the deterministic execution branch against the independent reasoning branch, resolving conflicts and producing a final verified answer. This late fusion is what gives IIPC its edge over methods that mix code and reasoning throughout.
Extract propositions from the problem. Parse the math problem into its core components: given quantities, constraints, what is being asked, and the mathematical domain (algebra, geometry, number theory, combinatorics, probability, etc.).
Generate an executable program (Branch 1). Write Python code that computes the answer. Use sympy for symbolic math, itertools for combinatorics, numpy for numerical work. The code must print a clear final answer.
Execute the program and capture output. Run the code, capturing both stdout and any errors (syntax errors, runtime exceptions, wrong types).
Validate or correct based on execution result.
Update failure memory. Append each failure descriptor to the memory store. On subsequent iterations, include these descriptors in context so the same mistakes are not repeated.
Produce independent chain-of-thought reasoning (Branch 2). Solve the same problem using pure mathematical reasoning -- no reference to any program output. Work through the problem step by step with symbolic manipulation, logical deduction, and mathematical identities.
Integrate both branches. Compare the program's computed answer with the chain-of-thought answer:
Format and present the final answer. State the answer clearly, show the verified computation, and briefly explain the reasoning path that confirmed it.
Example 1: Competition-style number theory
User: How many positive integers n < 1000 satisfy n^2 + n + 1 being divisible by 7?
Approach:
Branch 1 -- Program:
count = 0
for n in range(1, 1000):
if (n**2 + n + 1) % 7 == 0:
count += 1
print(count) # Output: 285
Execution succeeds. Output: 285. Validation: answer is a positive integer less than 999, plausible.
Branch 2 -- Chain-of-thought: n^2 + n + 1 = 0 (mod 7). Solving the quadratic mod 7: discriminant = 1 - 4 = -3 = 4 (mod 7). sqrt(4) mod 7 = 2 or 5. So n = (-1 + 2)/2 = 1/2 or n = (-1 + 5)/2 = 2 (mod 7). 1/2 mod 7 = 4 (since 2*4=8=1 mod 7). So n = 2 or n = 4 (mod 7). Among 1..999: floor(999/7)2 + adjustments = 1422 + 1 = 285.
Integration: Both branches produce 285. Answer: 285.
Example 2: Geometry with iterative correction
User: A triangle has sides 13, 14, 15. Find the radius of its inscribed circle.
Approach:
Branch 1 -- Program (iteration 1):
import math
a, b, c = 13, 14, 15
s = (a + b + c) / 2 # semi-perimeter = 21
area = math.sqrt(s * (s-a) * (s-b) * (s-c)) # Heron's formula
r = area / s
print(r) # Output: 4.0
Execution succeeds. Output: 4.0. Validation: positive, less than shortest side, plausible.
Branch 2 -- Chain-of-thought: s = (13+14+15)/2 = 21. Area by Heron's: sqrt(21 * 8 * 7 * 6) = sqrt(7056) = 84. Inradius r = Area/s = 84/21 = 4.
Integration: Both branches yield r = 4. Answer: 4.
Example 3: Probability with failure recovery
User: In a random permutation of {1,...,10}, what is the probability that no element is in its original position (a derangement)?
Approach:
Branch 1 -- Program (iteration 1):
import math
n = 10
# Derangement count D(n) = n! * sum_{k=0}^{n} (-1)^k / k!
D = 0
for k in range(n + 1):
D += (-1)**k / math.factorial(k)
D *= math.factorial(n)
print(D / math.factorial(n))
Output: 0.3678794642857143. Validation: this is an approximation due to floating point. The exact answer should be a fraction.
Failure descriptor: "Floating point arithmetic produced approximate result. Need exact integer arithmetic for D(n), then express as fraction."
Branch 1 -- Program (iteration 2):
from fractions import Fraction
n = 10
D = 0
for k in range(n + 1):
D += Fraction((-1)**k, 1) / Fraction(1, 1) * Fraction(1, 1)
# Cleaner: D(n) = n! * sum (-1)^k / k!
from math import factorial
derangements = sum((-1)**k * factorial(n) // factorial(k) for k in range(n+1))
prob = Fraction(derangements, factorial(n))
print(prob) # Output: 1334961/3628800
Output: 1334961/3628800. Validation: 1334961/3628800 ~ 0.3679, matches known D(10)/10! value.
Branch 2 -- Chain-of-thought: D(10) = 10! * (1 - 1 + 1/2! - 1/3! + ... + 1/10!) = 1334961. P = 1334961 / 3628800.
Integration: Both agree. Answer: 1334961/3628800.
fractions.Fraction, sympy.Rational, integer arithmetic) for problems requiring exact answers. Floating point is a common source of subtle errors.| Error Type | Detection | Recovery | |---|---|---| | Syntax/runtime error in code | Execution fails with traceback | Record error, fix specific issue, re-execute | | Incorrect output (wrong type/range) | Validation against constraints | Add failure descriptor, regenerate with corrected logic | | Branch disagreement | Integration step finds mismatch | Trace both branches step-by-step to find the error; re-run the faulty branch | | Timeout on large computation | Execution exceeds time limit | Optimize algorithm (reduce brute-force range, use mathematical shortcuts) | | Problem ambiguity | Both branches produce different valid interpretations | Present both interpretations and answers, ask user to clarify |
When failure memory accumulates 3+ descriptors for the same subproblem, abandon the current approach and try a fundamentally different mathematical formulation.
Basarkar, A., Tabarsi, B., Barnes, T., & Xu, D. (2026). Enhancing Mathematical Problem Solving in LLMs through Execution-Driven Reasoning Augmentation. arXiv:2602.03950v2. https://arxiv.org/abs/2602.03950v2
Key insight from the paper: the dual-branch architecture with late fusion outperforms single-pipeline approaches (PoT, CoT, ReAct) because execution feedback provides deterministic verification while independent reasoning prevents program bias -- look for the ablation study showing that removing either branch degrades performance.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".