skills/skillxiv-v0.0.2-claude-opus-4.6/at2po-agentic-tree-search-optimization/SKILL.md
Optimize multi-turn agent policies via entropy-guided tree expansion and turn-level credit assignment. AT²PO addresses exploration diversity, sparse credit signal, and policy misalignment problems in LLM agents through structured tree search and turn-aware policy updates.
npx skillsauth add ADu2021/skillXiv at2po-agentic-tree-search-optimizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Multi-turn agent reinforcement learning faces three critical challenges: (1) limited exploration diversity when policy entropy is low, (2) sparse credit assignment where only final success provides feedback (no signal for intermediate steps), and (3) misaligned policy optimization—token-level policy updates may not reflect the turn-level decisions agents actually make. These problems compound in multi-hop reasoning where agents must gather information across multiple tool calls before arriving at answers.
Combine entropy-guided tree exploration with turn-aware policy optimization that aligns updates to actual agent decision structure.
class AT2PO:
def __init__(self, model, tree_depth=2, branching_factor=10):
self.model = model
self.tree_depth = tree_depth
self.max_branches = branching_factor
def entropy_guided_tree_expansion(self, root_state, num_iterations=2):
"""Expand tree from uncertain turns to promote diverse exploration"""
expanded_nodes = []
for iteration in range(num_iterations):
# Score all leaf nodes by policy entropy
leaf_scores = []
for leaf in self.tree.leaves:
action_logits = self.model(leaf.state)
entropy = compute_entropy(action_logits)
# Apply branching penalty to prevent over-expansion
score = entropy - BRANCHING_PENALTY * leaf.depth
leaf_scores.append((leaf, score))
# Select K highest-entropy nodes (most uncertain)
K = min(6, len(leaf_scores))
selected_leaves = sorted(leaf_scores, key=lambda x: x[1], reverse=True)[:K]
# Expand each selected leaf
for leaf, _ in selected_leaves:
# Sample M diverse continuations from high-entropy positions
for _ in range(self.max_branches):
new_trajectory = self.sample_continuation(leaf)
expanded_nodes.append(new_trajectory)
return expanded_nodes
def turn_wise_credit_assignment(self, tree):
"""Compute node values via Monte Carlo bootstrapping"""
node_values = {}
# Bottom-up value propagation
for node in reversed(tree.nodes):
if node.is_leaf:
# Terminal value: task success/failure
node_values[node] = node.reward
else:
# Internal node value: entropy-weighted aggregate of descendants
descendant_rewards = [
node_values[child] for child in node.children
]
entropy_weights = [
compute_entropy(child.action_logits)
for child in node.children
]
# Weighted average emphasizes uncertain paths
node_values[node] = weighted_average(
descendant_rewards, entropy_weights
)
return node_values
def turn_based_policy_optimization(self, trajectories, node_values):
"""Importance sampling + clipping at TURN level (not token level)"""
turn_losses = []
for trajectory in trajectories:
for turn_idx, turn in enumerate(trajectory.turns):
# Collect all tokens in this turn
turn_tokens = turn.tokens
turn_old_logprobs = turn.old_log_probs
# Compute advantages at turn level
turn_advantage = node_values[turn.end_node] - baseline(turn.start_node)
# Apply clipping to prevent divergence
turn_ratio = torch.exp(
turn.new_log_probs - turn_old_logprobs
)
clipped_ratio = torch.clamp(
turn_ratio, 1 - CLIP_EPS, 1 + CLIP_EPS
)
# Minimize clipped surrogate loss at turn granularity
turn_loss = -torch.min(
turn_ratio * turn_advantage,
clipped_ratio * turn_advantage
)
turn_losses.append(turn_loss)
return torch.mean(torch.cat(turn_losses))
Architecture & Training Configuration:
Reward Function: Binary exact-match scoring with format validation constraints:
def compute_turn_reward(trajectory, gold_answer):
"""Reward depends on task success and format compliance"""
if trajectory.final_answer == gold_answer:
# Bonus if reached answer efficiently
efficiency_bonus = 1.0 - (num_turns / max_turns) * 0.2
return 1.0 + efficiency_bonus
elif trajectory.final_answer_format_valid():
# Partial credit for correct format, wrong answer
return 0.5
else:
return 0.0
Entropy-Guided Expansion Rationale: Rather than uniform random branching, prioritize exploring turns where policy entropy is high (model uncertainty). This concentrates search effort on genuinely difficult decisions while skipping confident, low-uncertainty tokens.
Multi-Hop Reasoning (HotpotQA, 2WikiMultiHop, MuSiQue):
Single-Hop Tasks (NQ, TriviaQA, PopQA):
Benchmark Average:
The authors released code supporting HotpotQA, 2WikiMultiHop, and BAMBOOGLE benchmarks with pre-trained checkpoints.
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.