ASPO: Advantage Shaping Policy Optimization

Core Concept

ASPO enables LLMs to develop effective tool integration strategies through novel advantage shaping that directly modifies the advantage function guiding policy behavior. Unlike standard RL approaches that may ignore tool-specific patterns, ASPO explicitly shapes advantages to reward early tool invocation, interactive turns, and strategic tool selection. The method achieves superior tool usage patterns and mathematical performance without training instability.

Architecture Overview

Advantage Function Shaping: Direct modification of advantage estimates
Tool-Specific Reward Design: Early invocation and interactivity bonuses
Policy Optimization: Stable gradient updates based on shaped advantages
Tool Integration Patterns: Early code invocation, interactive reasoning
Training Stability: No degradation vs. standard RL

Implementation Steps

1. Design Tool-Aware Advantage Shaping

Create advantages that encourage tool usage:

import torch
import torch.nn.functional as F
from typing import Dict, List, Tuple

class ToolAwareAdvantageShaper:
    """Shape advantages to encourage effective tool integration."""

    def __init__(
        self,
        tool_names: List[str],
        early_invocation_bonus: float = 0.5,
        interactivity_bonus: float = 0.3,
        strategic_penalty: float = 0.1
    ):
        self.tool_names = tool_names
        self.early_invocation_bonus = early_invocation_bonus
        self.interactivity_bonus = interactivity_bonus
        self.strategic_penalty = strategic_penalty

    def shape_advantages(
        self,
        trajectory: Dict,  # Contains states, actions, rewards, etc.
        base_advantages: torch.Tensor,  # (trajectory_length,)
        gamma: float = 0.99
    ) -> torch.Tensor:
        """
        Shape advantages to encourage tool integration patterns.
        """
        shaped_advantages = base_advantages.clone()
        states = trajectory["states"]
        actions = trajectory["actions"]
        rewards = trajectory["rewards"]

        for t, (state, action, reward) in enumerate(zip(states, actions, rewards)):
            # Bonus 1: Early tool invocation
            if self._is_tool_call(action):
                # Bonus inversely proportional to step number
                early_bonus = self.early_invocation_bonus / (1.0 + t / 10.0)
                shaped_advantages[t] += early_bonus

            # Bonus 2: Interactivity (tool use + reasoning)
            if t > 0 and self._is_reasoning_step(actions[t-1]) and self._is_tool_call(action):
                shaped_advantages[t] += self.interactivity_bonus

            # Penalty: Poor tool selection (using wrong tool)
            if self._is_tool_call(action) and not self._is_appropriate_tool(action, state):
                shaped_advantages[t] -= self.strategic_penalty

            # Advantage is cumulative: early decisions affect downstream
            if t < len(shaped_advantages) - 1:
                shaped_advantages[t+1] += 0.5 * shaped_advantages[t]

        return shaped_advantages

    def _is_tool_call(self, action: str) -> bool:
        """Check if action is a tool invocation."""
        for tool_name in self.tool_names:
            if tool_name in action.lower():
                return True
        return False

    def _is_reasoning_step(self, action: str) -> bool:
        """Check if action is a thinking/reasoning step."""
        reasoning_keywords = ["think", "reason", "because", "let's", "note that"]
        return any(kw in action.lower() for kw in reasoning_keywords)

    def _is_appropriate_tool(self, action: str, state: Dict) -> bool:
        """Check if tool selection matches task context."""
        task_description = state.get("task", "")

        # Rule-based matching
        if "math" in task_description.lower() and "calculator" in action.lower():
            return True
        if "code" in task_description.lower() and "interpreter" in action.lower():
            return True
        if "search" in task_description.lower() and "search" in action.lower():
            return True

        return False

2. Implement ASPO Optimization

Perform policy optimization with shaped advantages:

class ASPOOptimizer:
    """Advantage Shaping Policy Optimization."""

    def __init__(
        self,
        model: "LLM",
        advantage_shaper: ToolAwareAdvantageShaper,
        learning_rate: float = 1e-5,
        clip_ratio: float = 0.2,
        entropy_coef: float = 0.01
    ):
        self.model = model
        self.advantage_shaper = advantage_shaper
        self.optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
        self.clip_ratio = clip_ratio
        self.entropy_coef = entropy_coef

    def compute_aspo_loss(
        self,
        trajectory: Dict,
        old_log_probs: torch.Tensor,  # Probability under old policy
        shaped_advantages: torch.Tensor,
        value_estimates: torch.Tensor = None
    ) -> Tuple[torch.Tensor, Dict[str, float]]:
        """
        Compute ASPO loss using shaped advantages.
        """
        states = trajectory["states"]
        actions = trajectory["actions"]
        returns = trajectory["returns"]

        # Get new log probabilities under current policy
        new_log_probs_list = []
        entropy = 0.0

        for state, action in zip(states, actions):
            logits = self.model.get_action_logits(state)
            log_probs = F.log_softmax(logits, dim=-1)
            action_idx = self._encode_action(action)

            new_log_probs = log_probs[action_idx]
            new_log_probs_list.append(new_log_probs)

            # Entropy for exploration
            entropy += -(log_probs * torch.exp(log_probs)).sum()

        new_log_probs = torch.stack(new_log_probs_list)

        # PPO-style clipped objective
        ratio = torch.exp(new_log_probs - old_log_probs)

        # Clipped objective with shaped advantages
        surr1 = ratio * shaped_advantages
        surr2 = torch.clamp(ratio, 1 - self.clip_ratio, 1 + self.clip_ratio) * shaped_advantages

        policy_loss = -torch.min(surr1, surr2).mean()

        # Value loss (if using value estimates)
        value_loss = 0.0
        if value_estimates is not None:
            value_loss = F.mse_loss(value_estimates, returns)

        # Entropy bonus for exploration
        entropy_loss = -self.entropy_coef * entropy / len(states)

        # Total loss
        total_loss = policy_loss + 0.5 * value_loss + entropy_loss

        metrics = {
            "policy_loss": policy_loss.item(),
            "value_loss": value_loss.item() if isinstance(value_loss, torch.Tensor) else value_loss,
            "entropy": entropy.item() / len(states),
            "total_loss": total_loss.item(),
            "avg_ratio": ratio.mean().item()
        }

        return total_loss, metrics

    def train_step(
        self,
        trajectory: Dict,
        num_updates: int = 3
    ) -> Dict[str, float]:
        """
        Execute ASPO training step with multiple passes over trajectory.
        """
        # Compute base advantages (TD or GAE)
        base_advantages = self._compute_advantages(trajectory)

        # Shape advantages for tool integration
        shaped_advantages = self.advantage_shaper.shape_advantages(
            trajectory, base_advantages
        )

        # Normalize advantages
        shaped_advantages = (shaped_advantages - shaped_advantages.mean()) / \
                           (shaped_advantages.std() + 1e-8)

        # Store old log probs
        old_log_probs = self._get_old_log_probs(trajectory)

        # Multiple passes for stable optimization
        all_metrics = []
        for update in range(num_updates):
            loss, metrics = self.compute_aspo_loss(
                trajectory,
                old_log_probs,
                shaped_advantages
            )

            self.optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)
            self.optimizer.step()

            all_metrics.append(metrics)

        # Average metrics across updates
        avg_metrics = {
            key: sum(m[key] for m in all_metrics) / len(all_metrics)
            for key in all_metrics[0].keys()
        }

        return avg_metrics

    def _compute_advantages(
        self,
        trajectory: Dict,
        gamma: float = 0.99,
        lambda_: float = 0.95
    ) -> torch.Tensor:
        """Compute Generalized Advantage Estimation (GAE)."""
        rewards = torch.tensor(trajectory["rewards"], dtype=torch.float32)
        values = self._estimate_values(trajectory["states"])

        advantages = []
        gae = 0

        for t in reversed(range(len(rewards))):
            if t == len(rewards) - 1:
                next_value = 0
            else:
                next_value = values[t + 1]

            delta = rewards[t] + gamma * next_value - values[t]
            gae = delta + gamma * lambda_ * gae
            advantages.insert(0, gae)

        return torch.tensor(advantages, dtype=torch.float32)

    def _estimate_values(self, states: List[Dict]) -> torch.Tensor:
        """Estimate state values for advantage computation."""
        values = []
        for state in states:
            # Use model's internal value estimate if available
            value = self.model.estimate_value(state)
            values.append(value)
        return torch.tensor(values)

    def _get_old_log_probs(self, trajectory: Dict) -> torch.Tensor:
        """Get log probs of actions under policy at collection time."""
        # In practice: these would be stored during trajectory collection
        return torch.zeros(len(trajectory["actions"]))

    def _encode_action(self, action: str) -> int:
        """Encode action string to index."""
        # In practice: implement action encoding
        return 0

3. Implement Tool Integration Evaluation

Measure tool usage quality:

class ToolIntegrationEvaluator:
    """Evaluate quality of tool integration patterns."""

    def __init__(self, task_verifier: "TaskVerifier"):
        self.task_verifier = task_verifier

    def evaluate_tool_usage(
        self,
        trajectory: Dict,
        task_description: str
    ) -> Dict[str, float]:
        """
        Evaluate tool integration quality.
        """
        states = trajectory["states"]
        actions = trajectory["actions"]
        final_answer = trajectory.get("final_answer", "")

        metrics = {
            "early_invocation_step": None,
            "num_tool_calls": 0,
            "num_interactive_turns": 0,
            "tool_appropriateness": 0.0,
            "task_success": False,
            "efficiency": 0.0
        }

        # Count tool invocations
        tool_call_steps = []
        interactive_turns = 0

        for t, action in enumerate(actions):
            if self._is_tool_call(action):
                metrics["num_tool_calls"] += 1
                tool_call_steps.append(t)

                # Check if preceded by reasoning
                if t > 0 and self._is_reasoning(actions[t-1]):
                    interactive_turns += 1

        # Early invocation metric
        if tool_call_steps:
            metrics["early_invocation_step"] = tool_call_steps[0]
            metrics["efficiency"] = 1.0 / (1.0 + tool_call_steps[0] / 10.0)

        metrics["num_interactive_turns"] = interactive_turns

        # Tool appropriateness (simplified)
        metrics["tool_appropriateness"] = min(
            1.0, metrics["num_tool_calls"] / 3.0  # Target: 3 tool calls
        )

        # Task success
        metrics["task_success"] = self.task_verifier.verify(final_answer, task_description)

        return metrics

    def _is_tool_call(self, action: str) -> bool:
        tool_keywords = ["calculator", "interpreter", "search", "api", "tool"]
        return any(kw in action.lower() for kw in tool_keywords)

    def _is_reasoning(self, action: str) -> bool:
        reasoning_keywords = ["think", "reason", "note", "consider", "because"]
        return any(kw in action.lower() for kw in reasoning_keywords)

    def evaluate_on_benchmark(
        self,
        model: "LLM",
        benchmark: List[Dict],
        num_trajectories_per_task: int = 5
    ) -> Dict[str, float]:
        """Evaluate on benchmark dataset."""
        all_metrics = []

        for task in benchmark:
            for _ in range(num_trajectories_per_task):
                trajectory = model.collect_trajectory(task["description"])
                metrics = self.evaluate_tool_usage(trajectory, task["description"])
                all_metrics.append(metrics)

        # Aggregate
        return {
            "avg_early_invocation": sum(m["early_invocation_step"] or float('inf') for m in all_metrics) / len(all_metrics),
            "avg_tool_calls": sum(m["num_tool_calls"] for m in all_metrics) / len(all_metrics),
            "avg_interactive_turns": sum(m["num_interactive_turns"] for m in all_metrics) / len(all_metrics),
            "success_rate": sum(1 for m in all_metrics if m["task_success"]) / len(all_metrics),
            "avg_efficiency": sum(m["efficiency"] for m in all_metrics) / len(all_metrics)
        }

Practical Guidance

When to Use ASPO

Training models for tool integration
Mathematical reasoning with calculators
Code generation with interpreters
Complex multi-tool workflows
Tasks requiring external computation

When NOT to Use

Pure language generation without tools
Real-time systems with strict latency
Single-tool or no-tool scenarios
Offline policy optimization only

Key Hyperparameters

early_invocation_bonus: 0.3-0.7
interactivity_bonus: 0.2-0.5
strategic_penalty: 0.05-0.2
clip_ratio: 0.2 standard PPO value
entropy_coef: 0.001-0.01

Performance Expectations

Tool Call Improvement: Early and interactive patterns
Task Success: Measurable improvement on math benchmarks
Stability: No degradation vs. standard RL
Convergence: Typically 10-50 training episodes

Reference

Researchers. (2024). Understanding Tool-Integrated Reasoning. arXiv preprint arXiv:2508.19201.

ASPO: Advantage Shaping Policy Optimization

Core Concept

Architecture Overview

Advantage Function Shaping: Direct modification of advantage estimates
Tool-Specific Reward Design: Early invocation and interactivity bonuses
Policy Optimization: Stable gradient updates based on shaped advantages
Tool Integration Patterns: Early code invocation, interactive reasoning
Training Stability: No degradation vs. standard RL

Implementation Steps

1. Design Tool-Aware Advantage Shaping

Create advantages that encourage tool usage:

import torch
import torch.nn.functional as F
from typing import Dict, List, Tuple

class ToolAwareAdvantageShaper:
    """Shape advantages to encourage effective tool integration."""

    def __init__(
        self,
        tool_names: List[str],
        early_invocation_bonus: float = 0.5,
        interactivity_bonus: float = 0.3,
        strategic_penalty: float = 0.1
    ):
        self.tool_names = tool_names
        self.early_invocation_bonus = early_invocation_bonus
        self.interactivity_bonus = interactivity_bonus
        self.strategic_penalty = strategic_penalty

    def shape_advantages(
        self,
        trajectory: Dict,  # Contains states, actions, rewards, etc.
        base_advantages: torch.Tensor,  # (trajectory_length,)
        gamma: float = 0.99
    ) -> torch.Tensor:
        """
        Shape advantages to encourage tool integration patterns.
        """
        shaped_advantages = base_advantages.clone()
        states = trajectory["states"]
        actions = trajectory["actions"]
        rewards = trajectory["rewards"]

        for t, (state, action, reward) in enumerate(zip(states, actions, rewards)):
            # Bonus 1: Early tool invocation
            if self._is_tool_call(action):
                # Bonus inversely proportional to step number
                early_bonus = self.early_invocation_bonus / (1.0 + t / 10.0)
                shaped_advantages[t] += early_bonus

            # Bonus 2: Interactivity (tool use + reasoning)
            if t > 0 and self._is_reasoning_step(actions[t-1]) and self._is_tool_call(action):
                shaped_advantages[t] += self.interactivity_bonus

            # Penalty: Poor tool selection (using wrong tool)
            if self._is_tool_call(action) and not self._is_appropriate_tool(action, state):
                shaped_advantages[t] -= self.strategic_penalty

            # Advantage is cumulative: early decisions affect downstream
            if t < len(shaped_advantages) - 1:
                shaped_advantages[t+1] += 0.5 * shaped_advantages[t]

        return shaped_advantages

    def _is_tool_call(self, action: str) -> bool:
        """Check if action is a tool invocation."""
        for tool_name in self.tool_names:
            if tool_name in action.lower():
                return True
        return False

    def _is_reasoning_step(self, action: str) -> bool:
        """Check if action is a thinking/reasoning step."""
        reasoning_keywords = ["think", "reason", "because", "let's", "note that"]
        return any(kw in action.lower() for kw in reasoning_keywords)

    def _is_appropriate_tool(self, action: str, state: Dict) -> bool:
        """Check if tool selection matches task context."""
        task_description = state.get("task", "")

        # Rule-based matching
        if "math" in task_description.lower() and "calculator" in action.lower():
            return True
        if "code" in task_description.lower() and "interpreter" in action.lower():
            return True
        if "search" in task_description.lower() and "search" in action.lower():
            return True

        return False

2. Implement ASPO Optimization

Perform policy optimization with shaped advantages:

class ASPOOptimizer:
    """Advantage Shaping Policy Optimization."""

    def __init__(
        self,
        model: "LLM",
        advantage_shaper: ToolAwareAdvantageShaper,
        learning_rate: float = 1e-5,
        clip_ratio: float = 0.2,
        entropy_coef: float = 0.01
    ):
        self.model = model
        self.advantage_shaper = advantage_shaper
        self.optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
        self.clip_ratio = clip_ratio
        self.entropy_coef = entropy_coef

    def compute_aspo_loss(
        self,
        trajectory: Dict,
        old_log_probs: torch.Tensor,  # Probability under old policy
        shaped_advantages: torch.Tensor,
        value_estimates: torch.Tensor = None
    ) -> Tuple[torch.Tensor, Dict[str, float]]:
        """
        Compute ASPO loss using shaped advantages.
        """
        states = trajectory["states"]
        actions = trajectory["actions"]
        returns = trajectory["returns"]

        # Get new log probabilities under current policy
        new_log_probs_list = []
        entropy = 0.0

        for state, action in zip(states, actions):
            logits = self.model.get_action_logits(state)
            log_probs = F.log_softmax(logits, dim=-1)
            action_idx = self._encode_action(action)

            new_log_probs = log_probs[action_idx]
            new_log_probs_list.append(new_log_probs)

            # Entropy for exploration
            entropy += -(log_probs * torch.exp(log_probs)).sum()

        new_log_probs = torch.stack(new_log_probs_list)

        # PPO-style clipped objective
        ratio = torch.exp(new_log_probs - old_log_probs)

        # Clipped objective with shaped advantages
        surr1 = ratio * shaped_advantages
        surr2 = torch.clamp(ratio, 1 - self.clip_ratio, 1 + self.clip_ratio) * shaped_advantages

        policy_loss = -torch.min(surr1, surr2).mean()

        # Value loss (if using value estimates)
        value_loss = 0.0
        if value_estimates is not None:
            value_loss = F.mse_loss(value_estimates, returns)

        # Entropy bonus for exploration
        entropy_loss = -self.entropy_coef * entropy / len(states)

        # Total loss
        total_loss = policy_loss + 0.5 * value_loss + entropy_loss

        metrics = {
            "policy_loss": policy_loss.item(),
            "value_loss": value_loss.item() if isinstance(value_loss, torch.Tensor) else value_loss,
            "entropy": entropy.item() / len(states),
            "total_loss": total_loss.item(),
            "avg_ratio": ratio.mean().item()
        }

        return total_loss, metrics

    def train_step(
        self,
        trajectory: Dict,
        num_updates: int = 3
    ) -> Dict[str, float]:
        """
        Execute ASPO training step with multiple passes over trajectory.
        """
        # Compute base advantages (TD or GAE)
        base_advantages = self._compute_advantages(trajectory)

        # Shape advantages for tool integration
        shaped_advantages = self.advantage_shaper.shape_advantages(
            trajectory, base_advantages
        )

        # Normalize advantages
        shaped_advantages = (shaped_advantages - shaped_advantages.mean()) / \
                           (shaped_advantages.std() + 1e-8)

        # Store old log probs
        old_log_probs = self._get_old_log_probs(trajectory)

        # Multiple passes for stable optimization
        all_metrics = []
        for update in range(num_updates):
            loss, metrics = self.compute_aspo_loss(
                trajectory,
                old_log_probs,
                shaped_advantages
            )

            self.optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)
            self.optimizer.step()

            all_metrics.append(metrics)

        # Average metrics across updates
        avg_metrics = {
            key: sum(m[key] for m in all_metrics) / len(all_metrics)
            for key in all_metrics[0].keys()
        }

        return avg_metrics

    def _compute_advantages(
        self,
        trajectory: Dict,
        gamma: float = 0.99,
        lambda_: float = 0.95
    ) -> torch.Tensor:
        """Compute Generalized Advantage Estimation (GAE)."""
        rewards = torch.tensor(trajectory["rewards"], dtype=torch.float32)
        values = self._estimate_values(trajectory["states"])

        advantages = []
        gae = 0

        for t in reversed(range(len(rewards))):
            if t == len(rewards) - 1:
                next_value = 0
            else:
                next_value = values[t + 1]

            delta = rewards[t] + gamma * next_value - values[t]
            gae = delta + gamma * lambda_ * gae
            advantages.insert(0, gae)

        return torch.tensor(advantages, dtype=torch.float32)

    def _estimate_values(self, states: List[Dict]) -> torch.Tensor:
        """Estimate state values for advantage computation."""
        values = []
        for state in states:
            # Use model's internal value estimate if available
            value = self.model.estimate_value(state)
            values.append(value)
        return torch.tensor(values)

    def _get_old_log_probs(self, trajectory: Dict) -> torch.Tensor:
        """Get log probs of actions under policy at collection time."""
        # In practice: these would be stored during trajectory collection
        return torch.zeros(len(trajectory["actions"]))

    def _encode_action(self, action: str) -> int:
        """Encode action string to index."""
        # In practice: implement action encoding
        return 0

3. Implement Tool Integration Evaluation

Measure tool usage quality:

class ToolIntegrationEvaluator:
    """Evaluate quality of tool integration patterns."""

    def __init__(self, task_verifier: "TaskVerifier"):
        self.task_verifier = task_verifier

    def evaluate_tool_usage(
        self,
        trajectory: Dict,
        task_description: str
    ) -> Dict[str, float]:
        """
        Evaluate tool integration quality.
        """
        states = trajectory["states"]
        actions = trajectory["actions"]
        final_answer = trajectory.get("final_answer", "")

        metrics = {
            "early_invocation_step": None,
            "num_tool_calls": 0,
            "num_interactive_turns": 0,
            "tool_appropriateness": 0.0,
            "task_success": False,
            "efficiency": 0.0
        }

        # Count tool invocations
        tool_call_steps = []
        interactive_turns = 0

        for t, action in enumerate(actions):
            if self._is_tool_call(action):
                metrics["num_tool_calls"] += 1
                tool_call_steps.append(t)

                # Check if preceded by reasoning
                if t > 0 and self._is_reasoning(actions[t-1]):
                    interactive_turns += 1

        # Early invocation metric
        if tool_call_steps:
            metrics["early_invocation_step"] = tool_call_steps[0]
            metrics["efficiency"] = 1.0 / (1.0 + tool_call_steps[0] / 10.0)

        metrics["num_interactive_turns"] = interactive_turns

        # Tool appropriateness (simplified)
        metrics["tool_appropriateness"] = min(
            1.0, metrics["num_tool_calls"] / 3.0  # Target: 3 tool calls
        )

        # Task success
        metrics["task_success"] = self.task_verifier.verify(final_answer, task_description)

        return metrics

    def _is_tool_call(self, action: str) -> bool:
        tool_keywords = ["calculator", "interpreter", "search", "api", "tool"]
        return any(kw in action.lower() for kw in tool_keywords)

    def _is_reasoning(self, action: str) -> bool:
        reasoning_keywords = ["think", "reason", "note", "consider", "because"]
        return any(kw in action.lower() for kw in reasoning_keywords)

    def evaluate_on_benchmark(
        self,
        model: "LLM",
        benchmark: List[Dict],
        num_trajectories_per_task: int = 5
    ) -> Dict[str, float]:
        """Evaluate on benchmark dataset."""
        all_metrics = []

        for task in benchmark:
            for _ in range(num_trajectories_per_task):
                trajectory = model.collect_trajectory(task["description"])
                metrics = self.evaluate_tool_usage(trajectory, task["description"])
                all_metrics.append(metrics)

        # Aggregate
        return {
            "avg_early_invocation": sum(m["early_invocation_step"] or float('inf') for m in all_metrics) / len(all_metrics),
            "avg_tool_calls": sum(m["num_tool_calls"] for m in all_metrics) / len(all_metrics),
            "avg_interactive_turns": sum(m["num_interactive_turns"] for m in all_metrics) / len(all_metrics),
            "success_rate": sum(1 for m in all_metrics if m["task_success"]) / len(all_metrics),
            "avg_efficiency": sum(m["efficiency"] for m in all_metrics) / len(all_metrics)
        }

Practical Guidance

When to Use ASPO

Training models for tool integration
Mathematical reasoning with calculators
Code generation with interpreters
Complex multi-tool workflows
Tasks requiring external computation

When NOT to Use

Pure language generation without tools
Real-time systems with strict latency
Single-tool or no-tool scenarios
Offline policy optimization only

Key Hyperparameters

early_invocation_bonus: 0.3-0.7
interactivity_bonus: 0.2-0.5
strategic_penalty: 0.05-0.2
clip_ratio: 0.2 standard PPO value
entropy_coef: 0.001-0.01

Performance Expectations

Tool Call Improvement: Early and interactive patterns
Task Success: Measurable improvement on math benchmarks
Stability: No degradation vs. standard RL
Convergence: Typically 10-50 training episodes

Reference

Researchers. (2024). Understanding Tool-Integrated Reasoning. arXiv preprint arXiv:2508.19201.

Adoption

ADu2021/aspo-advantage-shaping-policy-optimization

$ install --global

Security Scan Results

SKILL.md

ASPO: Advantage Shaping Policy Optimization

Core Concept

Architecture Overview

Implementation Steps

1. Design Tool-Aware Advantage Shaping

2. Implement ASPO Optimization

3. Implement Tool Integration Evaluation

Practical Guidance

When to Use ASPO

When NOT to Use

Key Hyperparameters

Performance Expectations

Reference

Related Skills

ADu2021/flow-map-trajectory-tilting

ADu2021/flexible-data-mixture-of-experts

ADu2021/flexibility-trap-diffusion-reasoning

ADu2021/flex-continuous-agent-evolution

ADu2021/aspo-advantage-shaping-policy-optimization

$ install --global

Security Scan Results

SKILL.md

ASPO: Advantage Shaping Policy Optimization

Core Concept

Architecture Overview

Implementation Steps

1. Design Tool-Aware Advantage Shaping

2. Implement ASPO Optimization

3. Implement Tool Integration Evaluation

Practical Guidance

When to Use ASPO

When NOT to Use

Key Hyperparameters

Performance Expectations

Reference

Related Skills

ADu2021/flow-map-trajectory-tilting

ADu2021/flexible-data-mixture-of-experts

ADu2021/flexibility-trap-diffusion-reasoning

ADu2021/flex-continuous-agent-evolution