skills/skillxiv-v0.0.2-claude-opus-4.6/aspo-advantage-shaping-policy-optimization/SKILL.md
Train LLMs to effectively integrate tools through advantage shaping, directly modifying advantage functions to guide policy without compromising training stability.
npx skillsauth add ADu2021/skillXiv aspo-advantage-shaping-policy-optimizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
ASPO enables LLMs to develop effective tool integration strategies through novel advantage shaping that directly modifies the advantage function guiding policy behavior. Unlike standard RL approaches that may ignore tool-specific patterns, ASPO explicitly shapes advantages to reward early tool invocation, interactive turns, and strategic tool selection. The method achieves superior tool usage patterns and mathematical performance without training instability.
Create advantages that encourage tool usage:
import torch
import torch.nn.functional as F
from typing import Dict, List, Tuple
class ToolAwareAdvantageShaper:
"""Shape advantages to encourage effective tool integration."""
def __init__(
self,
tool_names: List[str],
early_invocation_bonus: float = 0.5,
interactivity_bonus: float = 0.3,
strategic_penalty: float = 0.1
):
self.tool_names = tool_names
self.early_invocation_bonus = early_invocation_bonus
self.interactivity_bonus = interactivity_bonus
self.strategic_penalty = strategic_penalty
def shape_advantages(
self,
trajectory: Dict, # Contains states, actions, rewards, etc.
base_advantages: torch.Tensor, # (trajectory_length,)
gamma: float = 0.99
) -> torch.Tensor:
"""
Shape advantages to encourage tool integration patterns.
"""
shaped_advantages = base_advantages.clone()
states = trajectory["states"]
actions = trajectory["actions"]
rewards = trajectory["rewards"]
for t, (state, action, reward) in enumerate(zip(states, actions, rewards)):
# Bonus 1: Early tool invocation
if self._is_tool_call(action):
# Bonus inversely proportional to step number
early_bonus = self.early_invocation_bonus / (1.0 + t / 10.0)
shaped_advantages[t] += early_bonus
# Bonus 2: Interactivity (tool use + reasoning)
if t > 0 and self._is_reasoning_step(actions[t-1]) and self._is_tool_call(action):
shaped_advantages[t] += self.interactivity_bonus
# Penalty: Poor tool selection (using wrong tool)
if self._is_tool_call(action) and not self._is_appropriate_tool(action, state):
shaped_advantages[t] -= self.strategic_penalty
# Advantage is cumulative: early decisions affect downstream
if t < len(shaped_advantages) - 1:
shaped_advantages[t+1] += 0.5 * shaped_advantages[t]
return shaped_advantages
def _is_tool_call(self, action: str) -> bool:
"""Check if action is a tool invocation."""
for tool_name in self.tool_names:
if tool_name in action.lower():
return True
return False
def _is_reasoning_step(self, action: str) -> bool:
"""Check if action is a thinking/reasoning step."""
reasoning_keywords = ["think", "reason", "because", "let's", "note that"]
return any(kw in action.lower() for kw in reasoning_keywords)
def _is_appropriate_tool(self, action: str, state: Dict) -> bool:
"""Check if tool selection matches task context."""
task_description = state.get("task", "")
# Rule-based matching
if "math" in task_description.lower() and "calculator" in action.lower():
return True
if "code" in task_description.lower() and "interpreter" in action.lower():
return True
if "search" in task_description.lower() and "search" in action.lower():
return True
return False
Perform policy optimization with shaped advantages:
class ASPOOptimizer:
"""Advantage Shaping Policy Optimization."""
def __init__(
self,
model: "LLM",
advantage_shaper: ToolAwareAdvantageShaper,
learning_rate: float = 1e-5,
clip_ratio: float = 0.2,
entropy_coef: float = 0.01
):
self.model = model
self.advantage_shaper = advantage_shaper
self.optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
self.clip_ratio = clip_ratio
self.entropy_coef = entropy_coef
def compute_aspo_loss(
self,
trajectory: Dict,
old_log_probs: torch.Tensor, # Probability under old policy
shaped_advantages: torch.Tensor,
value_estimates: torch.Tensor = None
) -> Tuple[torch.Tensor, Dict[str, float]]:
"""
Compute ASPO loss using shaped advantages.
"""
states = trajectory["states"]
actions = trajectory["actions"]
returns = trajectory["returns"]
# Get new log probabilities under current policy
new_log_probs_list = []
entropy = 0.0
for state, action in zip(states, actions):
logits = self.model.get_action_logits(state)
log_probs = F.log_softmax(logits, dim=-1)
action_idx = self._encode_action(action)
new_log_probs = log_probs[action_idx]
new_log_probs_list.append(new_log_probs)
# Entropy for exploration
entropy += -(log_probs * torch.exp(log_probs)).sum()
new_log_probs = torch.stack(new_log_probs_list)
# PPO-style clipped objective
ratio = torch.exp(new_log_probs - old_log_probs)
# Clipped objective with shaped advantages
surr1 = ratio * shaped_advantages
surr2 = torch.clamp(ratio, 1 - self.clip_ratio, 1 + self.clip_ratio) * shaped_advantages
policy_loss = -torch.min(surr1, surr2).mean()
# Value loss (if using value estimates)
value_loss = 0.0
if value_estimates is not None:
value_loss = F.mse_loss(value_estimates, returns)
# Entropy bonus for exploration
entropy_loss = -self.entropy_coef * entropy / len(states)
# Total loss
total_loss = policy_loss + 0.5 * value_loss + entropy_loss
metrics = {
"policy_loss": policy_loss.item(),
"value_loss": value_loss.item() if isinstance(value_loss, torch.Tensor) else value_loss,
"entropy": entropy.item() / len(states),
"total_loss": total_loss.item(),
"avg_ratio": ratio.mean().item()
}
return total_loss, metrics
def train_step(
self,
trajectory: Dict,
num_updates: int = 3
) -> Dict[str, float]:
"""
Execute ASPO training step with multiple passes over trajectory.
"""
# Compute base advantages (TD or GAE)
base_advantages = self._compute_advantages(trajectory)
# Shape advantages for tool integration
shaped_advantages = self.advantage_shaper.shape_advantages(
trajectory, base_advantages
)
# Normalize advantages
shaped_advantages = (shaped_advantages - shaped_advantages.mean()) / \
(shaped_advantages.std() + 1e-8)
# Store old log probs
old_log_probs = self._get_old_log_probs(trajectory)
# Multiple passes for stable optimization
all_metrics = []
for update in range(num_updates):
loss, metrics = self.compute_aspo_loss(
trajectory,
old_log_probs,
shaped_advantages
)
self.optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)
self.optimizer.step()
all_metrics.append(metrics)
# Average metrics across updates
avg_metrics = {
key: sum(m[key] for m in all_metrics) / len(all_metrics)
for key in all_metrics[0].keys()
}
return avg_metrics
def _compute_advantages(
self,
trajectory: Dict,
gamma: float = 0.99,
lambda_: float = 0.95
) -> torch.Tensor:
"""Compute Generalized Advantage Estimation (GAE)."""
rewards = torch.tensor(trajectory["rewards"], dtype=torch.float32)
values = self._estimate_values(trajectory["states"])
advantages = []
gae = 0
for t in reversed(range(len(rewards))):
if t == len(rewards) - 1:
next_value = 0
else:
next_value = values[t + 1]
delta = rewards[t] + gamma * next_value - values[t]
gae = delta + gamma * lambda_ * gae
advantages.insert(0, gae)
return torch.tensor(advantages, dtype=torch.float32)
def _estimate_values(self, states: List[Dict]) -> torch.Tensor:
"""Estimate state values for advantage computation."""
values = []
for state in states:
# Use model's internal value estimate if available
value = self.model.estimate_value(state)
values.append(value)
return torch.tensor(values)
def _get_old_log_probs(self, trajectory: Dict) -> torch.Tensor:
"""Get log probs of actions under policy at collection time."""
# In practice: these would be stored during trajectory collection
return torch.zeros(len(trajectory["actions"]))
def _encode_action(self, action: str) -> int:
"""Encode action string to index."""
# In practice: implement action encoding
return 0
Measure tool usage quality:
class ToolIntegrationEvaluator:
"""Evaluate quality of tool integration patterns."""
def __init__(self, task_verifier: "TaskVerifier"):
self.task_verifier = task_verifier
def evaluate_tool_usage(
self,
trajectory: Dict,
task_description: str
) -> Dict[str, float]:
"""
Evaluate tool integration quality.
"""
states = trajectory["states"]
actions = trajectory["actions"]
final_answer = trajectory.get("final_answer", "")
metrics = {
"early_invocation_step": None,
"num_tool_calls": 0,
"num_interactive_turns": 0,
"tool_appropriateness": 0.0,
"task_success": False,
"efficiency": 0.0
}
# Count tool invocations
tool_call_steps = []
interactive_turns = 0
for t, action in enumerate(actions):
if self._is_tool_call(action):
metrics["num_tool_calls"] += 1
tool_call_steps.append(t)
# Check if preceded by reasoning
if t > 0 and self._is_reasoning(actions[t-1]):
interactive_turns += 1
# Early invocation metric
if tool_call_steps:
metrics["early_invocation_step"] = tool_call_steps[0]
metrics["efficiency"] = 1.0 / (1.0 + tool_call_steps[0] / 10.0)
metrics["num_interactive_turns"] = interactive_turns
# Tool appropriateness (simplified)
metrics["tool_appropriateness"] = min(
1.0, metrics["num_tool_calls"] / 3.0 # Target: 3 tool calls
)
# Task success
metrics["task_success"] = self.task_verifier.verify(final_answer, task_description)
return metrics
def _is_tool_call(self, action: str) -> bool:
tool_keywords = ["calculator", "interpreter", "search", "api", "tool"]
return any(kw in action.lower() for kw in tool_keywords)
def _is_reasoning(self, action: str) -> bool:
reasoning_keywords = ["think", "reason", "note", "consider", "because"]
return any(kw in action.lower() for kw in reasoning_keywords)
def evaluate_on_benchmark(
self,
model: "LLM",
benchmark: List[Dict],
num_trajectories_per_task: int = 5
) -> Dict[str, float]:
"""Evaluate on benchmark dataset."""
all_metrics = []
for task in benchmark:
for _ in range(num_trajectories_per_task):
trajectory = model.collect_trajectory(task["description"])
metrics = self.evaluate_tool_usage(trajectory, task["description"])
all_metrics.append(metrics)
# Aggregate
return {
"avg_early_invocation": sum(m["early_invocation_step"] or float('inf') for m in all_metrics) / len(all_metrics),
"avg_tool_calls": sum(m["num_tool_calls"] for m in all_metrics) / len(all_metrics),
"avg_interactive_turns": sum(m["num_interactive_turns"] for m in all_metrics) / len(all_metrics),
"success_rate": sum(1 for m in all_metrics if m["task_success"]) / len(all_metrics),
"avg_efficiency": sum(m["efficiency"] for m in all_metrics) / len(all_metrics)
}
Researchers. (2024). Understanding Tool-Integrated Reasoning. arXiv preprint arXiv:2508.19201.
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.