skills/skillxiv-v0.0.2-claude-opus-4.6/balanced-policy-optimization-rl/SKILL.md
Stabilize off-policy RL for LLMs using adaptive clipping that dynamically rebalances positive/negative gradients and preserves entropy, improving mathematical reasoning performance vs standard PPO.
npx skillsauth add ADu2021/skillXiv balanced-policy-optimization-rlInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Off-policy reinforcement learning for LLMs faces two fundamental instability problems: positive-advantage samples get drowned out by negative-advantage samples during gradient updates, and fixed clipping (PPO-style) systematically suppresses entropy-increasing updates, leading to premature convergence and over-exploitation.
BAPO solves both problems through adaptive clipping: instead of fixed clip ratios, dynamically adjust clipping bounds to balance gradients across positive and negative samples while explicitly preserving entropy. This enables more stable training and better final performance on complex reasoning tasks.
BAPO operates on three principles:
The result is more stable training curves and stronger final performance on mathematical reasoning (+state-of-the-art on AIME).
The key innovation is replacing fixed PPO clipping with dynamic bounds that balance gradients. This example shows the core algorithm.
import torch
import torch.nn as nn
from typing import Tuple
class BalancedPolicyOptimizer:
"""
Adaptive clipping for stable off-policy RL training.
"""
def __init__(
self,
model: nn.Module,
initial_clip_ratio: float = 0.2,
entropy_weight: float = 0.01,
gradient_balance_target: float = 1.0
):
self.model = model
self.clip_ratio = initial_clip_ratio
self.entropy_weight = entropy_weight
self.gradient_balance_target = gradient_balance_target
self.optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)
def compute_advantages(
self,
rewards: torch.Tensor,
values: torch.Tensor,
gamma: float = 0.99,
gae_lambda: float = 0.95
) -> torch.Tensor:
"""
Compute generalized advantage estimates (GAE).
Args:
rewards: (batch, seq_len)
values: (batch, seq_len) value estimates
Returns:
advantages: (batch, seq_len)
"""
advantages = torch.zeros_like(rewards)
gae = 0.0
for t in reversed(range(len(rewards))):
if t == len(rewards) - 1:
next_value = 0.0
else:
next_value = values[:, t + 1]
delta = rewards[:, t] + gamma * next_value - values[:, t]
gae = delta + gamma * gae_lambda * gae
advantages[:, t] = gae
return advantages
def compute_policy_loss_with_adaptive_clipping(
self,
old_log_probs: torch.Tensor,
new_log_probs: torch.Tensor,
advantages: torch.Tensor,
entropy: torch.Tensor
) -> Tuple[torch.Tensor, dict]:
"""
Compute policy loss with adaptive clipping.
"""
# Compute ratio: importance weight for off-policy correction
ratio = torch.exp(new_log_probs - old_log_probs)
# Separate positive and negative advantages
positive_mask = advantages > 0
negative_mask = advantages <= 0
# Standard PPO clipping
clipped_ratio = torch.clamp(ratio, 1 - self.clip_ratio, 1 + self.clip_ratio)
# Compute losses for positive and negative samples separately
positive_loss = torch.where(
positive_mask,
torch.min(ratio * advantages, clipped_ratio * advantages),
torch.zeros_like(advantages)
)
negative_loss = torch.where(
negative_mask,
torch.max(ratio * advantages, clipped_ratio * advantages),
torch.zeros_like(advantages)
)
# Check gradient dominance: are negatives suppressing positives?
positive_grad = positive_loss.sum()
negative_grad = -negative_loss.sum() # Negative because they decrease loss
gradient_ratio = torch.abs(negative_grad) / (torch.abs(positive_grad) + 1e-8)
# Adaptive clipping: if negatives dominate, loosen clipping for them
if gradient_ratio > self.gradient_balance_target:
# Increase clip range for negatives to let them contribute less
adaptive_clip_negative = self.clip_ratio * (1.0 + 0.5 * gradient_ratio)
adaptive_ratio_neg = torch.clamp(
ratio,
1 - adaptive_clip_negative,
1 + adaptive_clip_negative
)
negative_loss = torch.where(
negative_mask,
torch.max(ratio * advantages, adaptive_ratio_neg * advantages),
torch.zeros_like(advantages)
)
# Combine losses
policy_loss = -(positive_loss + negative_loss).mean()
# Entropy bonus: encourage exploration
entropy_loss = -self.entropy_weight * entropy.mean()
total_loss = policy_loss + entropy_loss
return total_loss, {
"policy_loss": policy_loss.item(),
"entropy_bonus": entropy_loss.item(),
"gradient_ratio": gradient_ratio.item(),
"adaptive_clip": self.clip_ratio
}
def update_adaptive_clip_ratio(self, gradient_ratio: float):
"""
Adjust clipping ratio based on gradient balance.
Higher gradient_ratio -> more negative dominance -> increase clip range.
"""
if gradient_ratio > self.gradient_balance_target:
# Negative samples dominate: increase clip ratio to let positives breathe
self.clip_ratio = min(self.clip_ratio * 1.1, 0.5)
elif gradient_ratio < self.gradient_balance_target * 0.5:
# Positive samples dominate: decrease clip ratio to reduce variance
self.clip_ratio = max(self.clip_ratio * 0.9, 0.1)
def train_step_with_bapo(
model: nn.Module,
optimizer: BalancedPolicyOptimizer,
old_log_probs: torch.Tensor,
new_log_probs: torch.Tensor,
rewards: torch.Tensor,
values: torch.Tensor,
entropy: torch.Tensor
) -> dict:
"""
Single training step with BAPO.
"""
# Compute advantages
advantages = optimizer.compute_advantages(rewards, values)
# Normalize advantages for stability
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
# Compute loss with adaptive clipping
loss, metrics = optimizer.compute_policy_loss_with_adaptive_clipping(
old_log_probs,
new_log_probs,
advantages,
entropy
)
# Update model
optimizer.optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.optimizer.step()
# Adapt clipping ratio for next step
optimizer.update_adaptive_clip_ratio(metrics["gradient_ratio"])
return metrics
The key innovation is monitoring gradient contributions from positive vs negative samples and dynamically adjusting clipping to balance them. This prevents negative samples from suppressing beneficial positive updates.
| Setting | Metric | BAPO | Standard PPO | |---------|--------|------|-------------| | AIME 2024 | Accuracy | 59.6% | 52.1% | | AIME 2025 | Accuracy | 55.4% | 48.3% | | Training stability | Variance | Lower | Higher |
When to Use:
When NOT to Use:
Common Pitfalls:
BAPO: Stabilizing Off-Policy RL for LLMs via Balanced Policy Optimization with Adaptive Clipping
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.