skills/skillxiv-v0.0.2-claude-opus-4.6/dissecting-tool-integrated-reasoning/SKILL.md
Evaluate and optimize tool-integrated reasoning in LLMs through empirical benchmarking, performance-cost metrics (PAC, AUC-PCC), and measurement frameworks for diverse reasoning tasks.
npx skillsauth add ADu2021/skillXiv dissecting-tool-integrated-reasoningInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Tool-Integrated Reasoning (TIR) enables language models to offload computational tasks to external tools (calculators, code interpreters, etc.) rather than solving everything through token generation. This approach reduces "overthinking" and produces more streamlined reasoning traces while improving accuracy. The key innovation is measuring TIR effectiveness beyond raw accuracy through efficiency metrics like Performance-At-Cost (PAC) and Area-Under-Curve-PCC (AUC-PCC), enabling systematic evaluation across diverse reasoning domains.
Create comprehensive benchmark covering diverse reasoning types:
from enum import Enum
from dataclasses import dataclass
class ReasoningCategory(Enum):
ARITHMETIC = "arithmetic"
ALGEBRA = "algebra"
GEOMETRY = "geometry"
LOGIC = "logic"
COUNTING = "counting"
KNOWLEDGE = "knowledge"
SYMBOLIC = "symbolic"
COMMONSENSE = "commonsense"
MULTI_STEP = "multi_step"
@dataclass
class BenchmarkTask:
task_id: str
category: ReasoningCategory
question: str
expected_answer: str
tools_required: list[str] # e.g., ["calculator", "code_interpreter"]
requires_reasoning: bool
expected_tool_calls: int
def create_reasonzoo_benchmark() -> list[BenchmarkTask]:
"""
Construct comprehensive benchmark across nine reasoning categories.
"""
tasks = []
# Arithmetic examples
tasks.append(BenchmarkTask(
task_id="arith_001",
category=ReasoningCategory.ARITHMETIC,
question="What is 47 * 13 + 892?",
expected_answer="1503",
tools_required=["calculator"],
requires_reasoning=False,
expected_tool_calls=1
))
# Multi-step reasoning
tasks.append(BenchmarkTask(
task_id="multi_001",
category=ReasoningCategory.MULTI_STEP,
question="If a store has 50 items and sells 12 on Monday and 18 on Tuesday, how many remain?",
expected_answer="20",
tools_required=["calculator"],
requires_reasoning=True,
expected_tool_calls=2
))
return tasks
Create abstraction for external tool binding:
from abc import ABC, abstractmethod
from typing import Any, Dict
class ExternalTool(ABC):
@abstractmethod
def execute(self, input_str: str) -> str:
pass
class Calculator(ExternalTool):
def execute(self, expression: str) -> str:
"""Evaluate mathematical expressions safely."""
try:
result = eval(expression, {"__builtins__": {}}, {})
return str(result)
except Exception as e:
return f"Error: {str(e)}"
class CodeInterpreter(ExternalTool):
def execute(self, code: str) -> str:
"""Execute Python code safely in sandbox."""
import subprocess
try:
result = subprocess.run(
["python", "-c", code],
capture_output=True,
text=True,
timeout=5
)
return result.stdout or result.stderr
except Exception as e:
return f"Error: {str(e)}"
class ToolRouter:
def __init__(self):
self.tools = {
"calculator": Calculator(),
"code_interpreter": CodeInterpreter(),
"search": SearchTool(),
}
def route_and_execute(self, tool_name: str, tool_input: str) -> str:
"""Route execution request to appropriate tool."""
if tool_name not in self.tools:
return f"Unknown tool: {tool_name}"
return self.tools[tool_name].execute(tool_input)
Capture detailed reasoning execution for analysis:
@dataclass
class ReasoningTrace:
task_id: str
model_response: str
tool_calls: list[Dict[str, Any]]
reasoning_steps: list[str]
final_answer: str
correct: bool
tokens_generated: int
tool_execution_time: float
total_time: float
class ReasoningTraceCollector:
def __init__(self, tool_router: ToolRouter):
self.tool_router = tool_router
self.traces = []
def collect_trace(
self,
task: BenchmarkTask,
model: "LLM"
) -> ReasoningTrace:
"""
Execute task and collect detailed reasoning trace.
"""
trace = ReasoningTrace(
task_id=task.task_id,
model_response="",
tool_calls=[],
reasoning_steps=[],
final_answer="",
correct=False,
tokens_generated=0,
tool_execution_time=0.0,
total_time=0.0
)
# Generate response with tool integration
start_time = time.time()
response, tool_calls, steps = model.generate_with_tools(
task.question,
available_tools=task.tools_required
)
trace.model_response = response
trace.tokens_generated = len(response.split())
trace.reasoning_steps = steps
# Execute tool calls
tool_start = time.time()
for tool_call in tool_calls:
result = self.tool_router.route_and_execute(
tool_call["tool_name"],
tool_call["input"]
)
tool_call["result"] = result
trace.tool_calls.append(tool_call)
trace.tool_execution_time = time.time() - tool_start
trace.total_time = time.time() - start_time
# Extract final answer and check correctness
trace.final_answer = extract_answer(response)
trace.correct = trace.final_answer == task.expected_answer
return trace
Measure efficiency beyond accuracy:
def compute_pac_metric(
traces: list[ReasoningTrace],
cost_weights: Dict[str, float] = None
) -> float:
"""
Compute Performance-At-Cost metric balancing accuracy and efficiency.
PAC = Accuracy / (1 + normalized_cost)
where cost includes tokens and tool calls
"""
if cost_weights is None:
cost_weights = {
"token": 0.001, # per token cost
"tool_call": 0.1, # per tool invocation cost
"time": 0.01 # per second cost
}
total_accuracy = 0.0
total_cost = 0.0
for trace in traces:
# Accuracy component
total_accuracy += 1.0 if trace.correct else 0.0
# Cost component
token_cost = trace.tokens_generated * cost_weights["token"]
tool_cost = len(trace.tool_calls) * cost_weights["tool_call"]
time_cost = trace.total_time * cost_weights["time"]
total_cost += token_cost + tool_cost + time_cost
avg_accuracy = total_accuracy / len(traces)
normalized_cost = total_cost / len(traces)
pac = avg_accuracy / (1.0 + normalized_cost)
return pac
Measure efficiency frontier:
from sklearn.metrics import auc
def compute_auc_pcc(
traces: list[ReasoningTrace],
cost_budgets: list[float]
) -> float:
"""
Compute AUC of Performance vs Cost Curve.
Shows how accuracy scales with different cost budgets
(token limits, time limits, tool call limits)
"""
accuracies = []
costs = []
# Sort by total cost
sorted_traces = sorted(traces, key=lambda t: t.total_time)
cumulative_cost = 0.0
cumulative_correct = 0
for i, trace in enumerate(sorted_traces):
cumulative_cost += trace.total_time
if trace.correct:
cumulative_correct += 1
accuracy = cumulative_correct / (i + 1)
accuracies.append(accuracy)
costs.append(cumulative_cost / (i + 1))
# Normalize costs to [0, 1]
min_cost = min(costs)
max_cost = max(costs)
normalized_costs = [(c - min_cost) / (max_cost - min_cost) for c in costs]
# Compute AUC
auc_pcc = auc(normalized_costs, accuracies)
return auc_pcc
def evaluate_tool_integration(
model: "LLM",
benchmark: list[BenchmarkTask]
) -> Dict[str, float]:
"""
Comprehensive evaluation of tool-integrated reasoning.
"""
collector = ReasoningTraceCollector(ToolRouter())
traces = [collector.collect_trace(task, model) for task in benchmark]
metrics = {
"accuracy": sum(1.0 for t in traces if t.correct) / len(traces),
"avg_tokens": sum(t.tokens_generated for t in traces) / len(traces),
"avg_tool_calls": sum(len(t.tool_calls) for t in traces) / len(traces),
"avg_time": sum(t.total_time for t in traces) / len(traces),
"pac": compute_pac_metric(traces),
"auc_pcc": compute_auc_pcc(traces, cost_budgets=[0.5, 1.0, 2.0])
}
return metrics
Researchers. (2024). Dissecting Tool-Integrated Reasoning: An Empirical Study and Analysis. arXiv preprint arXiv:2508.15754.
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.