skills/skillxiv-v0.0.2-claude-opus-4.6/cort-code-reasoning/SKILL.md
Enhance reasoning models by integrating executable code within thinking traces, enabling grounded computation verification and reducing hallucination in mathematical and logical reasoning.
npx skillsauth add ADu2021/skillXiv cort-code-reasoningInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
CoRT augments reasoning models' thinking process by embedding executable code directly into reasoning traces. Rather than pure symbolic reasoning prone to arithmetic errors and logical fallacies, models generate reasoning steps with executable code blocks that are immediately validated. This grounds abstract reasoning in concrete computation, reducing hallucination and improving accuracy on mathematical, logical, and programming tasks. The approach enables self-correction when code execution results contradict reasoning assumptions.
import torch
import torch.nn as nn
from typing import Dict, List, Tuple
import subprocess
import tempfile
import os
class CodeIntegratedThinkingModule(nn.Module):
"""
Augments reasoning traces with executable code blocks.
Interleaves natural language reasoning with Python code for verification.
"""
def __init__(self, base_model, execution_timeout=5):
super().__init__()
self.base_model = base_model
self.execution_timeout = execution_timeout
self.execution_history = []
def generate_with_code_reasoning(self, question: str, max_thinking_tokens: int = 8000):
"""
Generate reasoning trace with embedded code blocks.
Format: <think>
Natural language reasoning...
```python
# Code block
code here
```
More reasoning...
</think>
"""
thinking_prompt = f"""
Solve this problem with integrated reasoning and code verification.
Use this format:
<think>
Explain your approach.
```python
# Code to verify computations
result = ...
print(f"Result: {{result}}")
```
Interpret the code output and continue reasoning...
</think>
Problem: {question}
"""
# Generate thinking trace
thinking_trace = self._generate_thinking(
thinking_prompt,
max_tokens=max_thinking_tokens
)
# Extract and execute code blocks
code_blocks = self._extract_code_blocks(thinking_trace)
execution_results = []
for code in code_blocks:
result = self._execute_code_block(code)
execution_results.append({
'code': code,
'output': result['output'],
'error': result['error'],
'success': result['success']
})
self.execution_history.append(result)
return {
'thinking_trace': thinking_trace,
'code_blocks': code_blocks,
'execution_results': execution_results
}
def _generate_thinking(self, prompt: str, max_tokens: int) -> str:
"""Generate extended thinking trace."""
# Placeholder: would call actual model
return "<think>\nReasoning process...\n</think>"
def _extract_code_blocks(self, thinking_trace: str) -> List[str]:
"""Extract Python code blocks from thinking trace."""
import re
pattern = r"```python\n(.*?)\n```"
matches = re.findall(pattern, thinking_trace, re.DOTALL)
return matches
def _execute_code_block(self, code: str) -> Dict:
"""
Execute Python code block safely with timeout.
Returns output, error status, and results.
"""
try:
# Write code to temporary file
with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
f.write(code)
temp_path = f.name
try:
# Execute with timeout
result = subprocess.run(
['python', temp_path],
capture_output=True,
text=True,
timeout=self.execution_timeout
)
return {
'output': result.stdout,
'error': result.stderr,
'return_code': result.returncode,
'success': result.returncode == 0
}
finally:
# Clean up temporary file
if os.path.exists(temp_path):
os.remove(temp_path)
except subprocess.TimeoutExpired:
return {
'output': '',
'error': 'Execution timeout exceeded',
'success': False
}
except Exception as e:
return {
'output': '',
'error': str(e),
'success': False
}
class SelfCorrectionMechanism:
"""
Detects contradictions between reasoning and code execution results.
Triggers re-reasoning when assumptions prove incorrect.
"""
def __init__(self, model, max_correction_rounds=3):
self.model = model
self.max_correction_rounds = max_correction_rounds
def identify_contradictions(self, reasoning_trace: str,
execution_results: List[Dict]) -> List[Dict]:
"""
Identify where reasoning contradicts code execution results.
Returns list of contradiction locations.
"""
contradictions = []
for i, exec_result in enumerate(execution_results):
if not exec_result['success']:
contradictions.append({
'block_index': i,
'type': 'execution_error',
'error': exec_result['error'],
'severity': 'high'
})
continue
# Parse output for claims
output = exec_result['output']
# Check if output contradicts preceding reasoning
if self._contradicts_reasoning(reasoning_trace, output):
contradictions.append({
'block_index': i,
'type': 'logical_contradiction',
'output': output,
'severity': 'medium'
})
return contradictions
def trigger_correction(self, original_question: str,
initial_reasoning: str,
contradictions: List[Dict]) -> Dict:
"""
When contradictions detected, regenerate reasoning with guidance.
"""
if not contradictions:
return {'corrected': False}
# Build correction prompt highlighting issues
correction_prompt = f"""
Previous reasoning attempt:
{initial_reasoning}
Issues identified:
"""
for contradiction in contradictions[:3]: # Focus on top 3
if contradiction['type'] == 'execution_error':
correction_prompt += f"\n- Code error: {contradiction['error']}"
else:
correction_prompt += f"\n- Output contradicts reasoning: {contradiction['output'][:100]}"
correction_prompt += f"""
Please revise your reasoning, being more careful about:
1. Arithmetic and logical operations
2. Variable definitions and scope
3. Proper code syntax
Original problem: {original_question}
"""
# Generate corrected reasoning
corrected_trace = self.model.generate(correction_prompt, max_tokens=5000)
return {
'corrected': True,
'original_trace': initial_reasoning,
'corrected_trace': corrected_trace,
'contradictions_addressed': len(contradictions)
}
def _contradicts_reasoning(self, reasoning: str, code_output: str) -> bool:
"""Check if code output contradicts stated reasoning."""
# Placeholder: would use semantic similarity or rule-based checking
return False
def iterative_correction(self, question: str, max_rounds: int = 3) -> Dict:
"""
Iteratively correct reasoning until no contradictions remain
or max_rounds exceeded.
"""
current_reasoning = None
execution_results = []
correction_count = 0
for round_num in range(max_rounds):
# Generate reasoning with code
if current_reasoning is None:
# Initial generation
result = self._generate_initial(question)
else:
# Correction round
result = self._generate_correction(question, current_reasoning)
current_reasoning = result['thinking']
execution_results = result['execution_results']
# Check for contradictions
contradictions = self.identify_contradictions(
current_reasoning,
execution_results
)
if not contradictions:
# No contradictions; converged
break
correction_count += 1
return {
'final_reasoning': current_reasoning,
'execution_results': execution_results,
'correction_rounds': correction_count,
'converged': len(contradictions) == 0
}
def _generate_initial(self, question: str) -> Dict:
"""Generate initial reasoning."""
return {'thinking': '', 'execution_results': []}
def _generate_correction(self, question: str, previous: str) -> Dict:
"""Generate correction based on previous attempt."""
return {'thinking': '', 'execution_results': []}
import ast
import typing
class CodeValidator:
"""
Validates code blocks before execution for common errors.
Catches logical issues and improves error messages.
"""
def __init__(self):
self.allowed_builtins = {
'print', 'len', 'range', 'sum', 'min', 'max',
'int', 'float', 'str', 'list', 'dict', 'set',
'abs', 'round', 'sorted', 'enumerate', 'zip'
}
self.forbidden_imports = {'os', 'sys', 'subprocess', '__main__'}
def validate_code_block(self, code: str) -> Dict:
"""
Comprehensive validation before execution.
Returns validation status and identified issues.
"""
issues = []
# Check 1: Parse validity
try:
tree = ast.parse(code)
except SyntaxError as e:
issues.append({
'type': 'syntax_error',
'message': str(e),
'severity': 'critical'
})
return {
'valid': False,
'issues': issues,
'safe_to_execute': False
}
# Check 2: Forbidden imports
for node in ast.walk(tree):
if isinstance(node, ast.Import):
for alias in node.names:
if alias.name in self.forbidden_imports:
issues.append({
'type': 'forbidden_import',
'module': alias.name,
'severity': 'high'
})
elif isinstance(node, ast.ImportFrom):
if node.module in self.forbidden_imports:
issues.append({
'type': 'forbidden_import',
'module': node.module,
'severity': 'high'
})
# Check 3: Infinite loops (heuristic)
for node in ast.walk(tree):
if isinstance(node, ast.While):
# Check if while condition is 'True'
if isinstance(node.test, ast.Constant) and node.test.value is True:
issues.append({
'type': 'infinite_loop',
'message': 'Detected while True without break',
'severity': 'high'
})
# Check 4: Undefined variables (simple check)
defined_vars = set()
used_vars = set()
for node in ast.walk(tree):
if isinstance(node, ast.Assign):
for target in node.targets:
if isinstance(target, ast.Name):
defined_vars.add(target.id)
elif isinstance(node, ast.Name) and isinstance(node.ctx, ast.Load):
used_vars.add(node.id)
undefined = used_vars - defined_vars - set(self.allowed_builtins)
for var in undefined:
issues.append({
'type': 'undefined_variable',
'variable': var,
'severity': 'medium'
})
safe = len([i for i in issues if i['severity'] in ['high', 'critical']]) == 0
return {
'valid': True,
'issues': issues,
'safe_to_execute': safe,
'defined_variables': defined_vars,
'used_variables': used_vars
}
class CodeVerifiedReasoning:
"""
High-level orchestration of code-integrated reasoning.
Manages generation, validation, execution, and correction.
"""
def __init__(self, model):
self.model = model
self.thinking_module = CodeIntegratedThinkingModule(model)
self.correction_mechanism = SelfCorrectionMechanism(model)
self.validator = CodeValidator()
def reason(self, question: str, max_attempts: int = 3) -> Dict:
"""
Complete reasoning pipeline with code verification.
"""
attempt = 0
current_result = None
while attempt < max_attempts:
# Step 1: Generate reasoning with code
current_result = self.thinking_module.generate_with_code_reasoning(question)
# Step 2: Validate code blocks
validation_status = {}
for i, code in enumerate(current_result['code_blocks']):
validation = self.validator.validate_code_block(code)
validation_status[i] = validation
if not validation['safe_to_execute']:
# Unsafe code; need correction
attempt += 1
break
else:
# All code blocks valid; check for logical contradictions
contradictions = self.correction_mechanism.identify_contradictions(
current_result['thinking_trace'],
current_result['execution_results']
)
if not contradictions:
# Success: no errors, no contradictions
return {
'success': True,
'result': current_result,
'attempts': attempt + 1
}
attempt += 1
return {
'success': False,
'result': current_result,
'attempts': max_attempts
}
Code-Reasoning Integration:
Code Execution Safety:
Correction Strategy:
Performance Improvements:
When to Use CoRT:
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.