skills/skillxiv-v0.0.2-claude-opus-4.6/alphapollo-reasoning/SKILL.md
Enable LLMs to solve complex problems through multi-turn agentic reasoning with tool-assisted verification and iterative refinement loops. Trigger: improve reasoning reliability on long-horizon tasks by combining RL with verification.
npx skillsauth add ADu2021/skillXiv alphapollo-reasoningInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
AlphaApollo frames complex problem-solving as a multi-turn agentic process with three integrated levels: reasoning (multi-turn interactions), learning (turn-level RL), and evolution (multi-round refinement with verification). By separating actions from tool responses during RL training and implementing a propose-judge-update loop, the system achieves reliable reasoning on tasks requiring tool use, verification, and iterative correction.
The key insight: Decoupling actions from responses and learning from turn-level feedback enables models to build strategies that outlast individual tool failures.
Define how the model interacts with tools and receives feedback.
class AgentAction:
"""Represents a single turn's action."""
def __init__(self, action_type, content, tool_call=None):
self.action_type = action_type # "reason", "call_tool", "output"
self.content = content # Text of reasoning or tool name
self.tool_call = tool_call # Tool parameters if applicable
def to_prompt(self):
if self.action_type == "call_tool":
return f"TOOL_CALL: {self.tool_call['name']}({self.tool_call['args']})"
else:
return f"{self.action_type.upper()}: {self.content}"
class ToolResponse:
"""Represents tool execution result."""
def __init__(self, tool_name, status, result, error=None):
self.tool_name = tool_name
self.status = status # "success", "error", "timeout"
self.result = result
self.error = error
def to_prompt(self):
if self.status == "success":
return f"Tool {self.tool_name} returned: {self.result}"
else:
return f"Tool {self.tool_name} error: {self.error}"
class MultiTurnTrajectory:
"""Tracks a complete problem-solving episode."""
def __init__(self, problem):
self.problem = problem
self.turns = [] # List of (action, response) pairs
self.turn_rewards = [] # Reward per turn
self.final_reward = None
self.solution = None
def add_turn(self, action, response, immediate_reward=0):
self.turns.append((action, response))
self.turn_rewards.append(immediate_reward)
def finalize(self, final_reward, solution):
self.final_reward = final_reward
self.solution = solution
The agent reasons iteratively, calling tools when needed and refining based on responses.
class MultiTurnReasoner:
def __init__(self, model, tools_registry, max_turns=10):
self.model = model
self.tools = tools_registry
self.max_turns = max_turns
def reason(self, problem):
"""
Multi-turn reasoning loop.
Args:
problem: Problem statement
Returns:
MultiTurnTrajectory with complete episode
"""
trajectory = MultiTurnTrajectory(problem)
context = f"Problem: {problem}\n\nReasoning:\n"
for turn in range(self.max_turns):
# Model generates reasoning and decides next action
output = self.model.generate(
context,
max_tokens=256,
stop_tokens=["TOOL_CALL:", "OUTPUT:"]
)
# Parse output to determine action type
if "TOOL_CALL:" in output:
# Extract tool call
tool_spec = parse_tool_call(output)
action = AgentAction(
"call_tool",
tool_spec["name"],
tool_call=tool_spec
)
# Execute tool
try:
result = self.tools[tool_spec["name"]](**tool_spec["args"])
response = ToolResponse(
tool_spec["name"],
"success",
result
)
immediate_reward = 0.1 # Reward for attempting tool
except Exception as e:
response = ToolResponse(
tool_spec["name"],
"error",
None,
str(e)
)
immediate_reward = -0.05
elif "OUTPUT:" in output:
# Final answer
action = AgentAction("output", output.split("OUTPUT:")[-1].strip())
response = None
immediate_reward = 0 # Evaluated at finalization
trajectory.add_turn(action, response, immediate_reward)
break
else:
# Continue reasoning
action = AgentAction("reason", output)
response = None
immediate_reward = 0
trajectory.add_turn(action, response, immediate_reward)
# Update context for next turn
context += f"\n{action.to_prompt()}"
if response:
context += f"\n{response.to_prompt()}"
return trajectory
Train the model using rewards at each turn, not just the final outcome.
def compute_turn_rewards(trajectory, final_correctness_reward):
"""
Distribute final reward across turns with discounting.
Key principle: Successful tool use gets credit;
poor decisions reduce earlier turn gradients.
"""
num_turns = len(trajectory.turns)
turn_rewards = []
# Backward credit assignment with discount
discount_factor = 0.99
accumulated_reward = final_correctness_reward
for t in reversed(range(num_turns)):
action, response = trajectory.turns[t]
# Tool calls get intermediate rewards if they succeeded
if action.action_type == "call_tool":
if response.status == "success":
turn_reward = 0.1 * discount_factor ** (num_turns - t)
else:
turn_reward = -0.05 * discount_factor ** (num_turns - t)
else:
# Reasoning turns get discounted final reward
turn_reward = accumulated_reward * (discount_factor ** (num_turns - t))
turn_rewards.insert(0, turn_reward)
accumulated_reward = turn_reward
return turn_rewards
def train_on_trajectory(model, trajectory, optimizer, final_reward):
"""
Compute RL loss and update model parameters.
"""
# Compute per-turn rewards
turn_rewards = compute_turn_rewards(trajectory, final_reward)
total_loss = 0
for t, (action, response) in enumerate(trajectory.turns):
# Get log probability of this action
# (computed during generation; cached in trajectory)
log_prob = action.log_prob
# Policy gradient: higher reward → higher gradient
loss = -turn_rewards[t] * log_prob
# Add entropy bonus to encourage exploration
entropy_bonus = -0.01 * action.entropy
total_loss += loss + entropy_bonus
# Backward pass
optimizer.zero_grad()
total_loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
return total_loss.item()
For complex problems, iterate: generate solution → verify → refine.
class IterativeRefinement:
def __init__(self, model, tools_registry, judge_model):
self.reasoner = MultiTurnReasoner(model, tools_registry)
self.judge = judge_model # Separate model for verification
def propose_judge_update(self, problem, max_rounds=3):
"""
Iteratively refine solution through propose-judge-update cycles.
Args:
problem: Problem statement
max_rounds: Maximum refinement iterations
Returns:
Best solution found across rounds
"""
best_solution = None
best_score = -1.0
refinement_history = []
for round_num in range(max_rounds):
# PROPOSE: Generate solution
trajectory = self.reasoner.reason(problem)
solution = trajectory.solution
# JUDGE: Verify solution quality
verification_result = self.judge.verify(problem, solution)
score = verification_result["score"]
critique = verification_result["critique"]
refinement_history.append({
"round": round_num,
"solution": solution,
"score": score,
"critique": critique
})
if score > best_score:
best_score = score
best_solution = solution
# UPDATE: If not perfect, refine
if score < 1.0:
# Add critique to problem context for next round
problem = f"{problem}\n\nPrior attempt critique: {critique}\nRefined approach:"
else:
# Perfect solution found
break
return {
"best_solution": best_solution,
"best_score": best_score,
"refinement_history": refinement_history
}
Combine multi-turn RL with propose-judge-update refinement.
def train_alphapollo(model, judge_model, dataset, config):
"""
Full AlphaApollo training pipeline.
"""
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)
reasoner = MultiTurnReasoner(model, tools_registry=config.tools)
refiner = IterativeRefinement(model, config.tools, judge_model)
for epoch in range(config.num_epochs):
for problem_id, problem in enumerate(dataset):
# Phase 1: Multi-turn agentic reasoning
trajectory = reasoner.reason(problem)
# Phase 2: Multi-round agentic evolution (propose-judge-update)
evolution_result = refiner.propose_judge_update(problem)
# Phase 3: Train on trajectories
final_reward = evolution_result["best_score"]
loss = train_on_trajectory(
model,
trajectory,
optimizer,
final_reward
)
# Logging
if problem_id % 100 == 0:
print(f"Epoch {epoch}, Problem {problem_id}: "
f"loss={loss:.4f}, best_score={final_reward:.4f}")
return model
Hyperparameters:
When to Use:
When NOT to Use:
AlphaApollo: Orchestrating Foundation Models and Tools for Agentic Reasoning — arXiv:2510.06261
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.