skills/skillxiv-v0.0.2-claude-opus-4.6/epo-entropy-regularized-policy-optimization/SKILL.md
Stabilize multi-turn LLM agent training with entropy-regularized policy optimization that prevents exploration-exploitation cascade failures in sparse-reward environments through trajectory-level entropy regulation, historical smoothing, and adaptive phase-based weighting. Achieve up to 152% performance improvement on scientific reasoning tasks and 19.8% on embodied control by maintaining controlled entropy oscillations across 30+ interaction turns.
npx skillsauth add ADu2021/skillXiv epo-entropy-regularized-policy-optimizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Training language model agents in multi-turn environments creates a unique instability challenge absent from single-turn reinforcement learning. In sparse-reward settings (where feedback comes only at episode completion), agents must explore effectively across 30+ interaction turns while sharing policy parameters across all steps. Standard entropy regularization techniques fail because entropy oscillates severely across training iterations—early in training, the policy under-explores and converges to suboptimal strategies; later, it over-explores chaotically, destabilizing learning dynamics.
This "exploration-exploitation cascade failure" occurs because per-turn entropy adjustments cannot independently control exploration and exploitation when parameters are shared. Intermediate trajectories provide no reward signal to guide adjustment, forcing the agent into oscillatory dynamics that prevent convergence. Existing methods either apply insufficient regularization (allowing oscillations) or rigid entropy targets (causing instability when optimal entropy varies).
EPO addresses this through three coordinated mechanisms that work together: trajectory-level entropy aggregation (computing entropy across entire episodes rather than individual turns), entropy smoothing via historical anchoring (maintaining entropy within dynamically bounded corridors), and adaptive phase-based weighting (controlling regularization strength across training phases).
The key insight is treating entropy regularization as a constrained optimization problem where the constraint corridor adapts based on historical entropy patterns. Early in training, tight corridors enforce controlled exploration; as training progresses, the corridor relaxes, allowing the policy to settle into its natural entropy range. This curriculum-style approach stabilizes dynamics while preserving learning efficiency.
Entropy must be aggregated across all turns to capture the agent's global exploration state. Rather than tracking entropy per turn (which creates misaligned gradient signals), compute the mean entropy across the entire trajectory batch.
import torch
import torch.nn.functional as F
def compute_trajectory_entropy(logits_per_turn, attention_mask=None):
"""
Compute entropy across all turns in trajectories.
Args:
logits_per_turn: Dict mapping turn_id -> [batch_size, seq_len, vocab_size]
attention_mask: Dict mapping turn_id -> [batch_size, seq_len] (optional)
Returns:
trajectory_entropy: Scalar tensor, mean entropy across batch and turns
"""
entropies = []
for turn_id, logits in logits_per_turn.items():
# Compute entropy per token
probs = F.softmax(logits, dim=-1)
token_entropy = -torch.sum(probs * F.log_softmax(logits, dim=-1), dim=-1)
# Apply attention mask if provided
if attention_mask is not None:
mask = attention_mask[turn_id]
token_entropy = token_entropy * mask
token_entropy = token_entropy.sum() / mask.sum()
else:
token_entropy = token_entropy.mean()
entropies.append(token_entropy)
# Average entropy across all turns in trajectory
trajectory_entropy = torch.stack(entropies).mean()
return trajectory_entropy
This approach captures whether the agent is exploring (high entropy across all steps) or exploiting (low entropy), independent of which individual turn contributes entropy.
The entropy history window stores past entropy values to compute adaptive corridor bounds. This anchoring mechanism dampens oscillations by establishing reference points for what "normal" entropy looks like during training.
class EntropyHistoryBuffer:
"""Maintain rolling entropy history for corridor-based regularization."""
def __init__(self, window_size=20):
self.window_size = window_size
self.history = []
def add(self, entropy_value):
"""Add current trajectory entropy to history."""
self.history.append(float(entropy_value.detach().cpu()))
# Keep only recent window
if len(self.history) > self.window_size:
self.history.pop(0)
def get_corridor_bounds(self, kappa_left=0.5, kappa_right=1.5):
"""
Compute dynamic corridor bounds based on historical entropy.
Args:
kappa_left: Left bound as fraction of historical mean
kappa_right: Right bound as fraction of historical mean
Returns:
lower_bound, upper_bound: Adaptive entropy corridor
"""
if len(self.history) < 5:
# Early training: use constant bounds
return None, None
mean_entropy = sum(self.history) / len(self.history)
lower_bound = kappa_left * mean_entropy
upper_bound = kappa_right * mean_entropy
return lower_bound, upper_bound
def get_mean_entropy(self):
"""Return mean entropy from history."""
if not self.history:
return 0.0
return sum(self.history) / len(self.history)
The corridor bounds dynamically adjust as the policy's natural entropy settles, allowing the regularization to become less restrictive as training progresses.
The smoothing loss penalizes entropy deviations from the corridor. Critically, penalties apply only when entropy falls outside bounds—not every update requires correction, reducing unnecessary gradient noise.
def compute_entropy_smoothing_loss(current_entropy, entropy_buffer,
kappa_left=0.5, kappa_right=1.5):
"""
Compute penalty for entropy deviations from historical corridor.
Args:
current_entropy: Scalar tensor, trajectory entropy from current batch
entropy_buffer: EntropyHistoryBuffer instance
kappa_left, kappa_right: Corridor bound multipliers
Returns:
smoothing_loss: Scalar tensor (0 if within bounds, penalty if outside)
"""
lower_bound, upper_bound = entropy_buffer.get_corridor_bounds(kappa_left, kappa_right)
if lower_bound is None:
# Insufficient history: no smoothing penalty yet
return torch.tensor(0.0, device=current_entropy.device)
smoothing_loss = torch.tensor(0.0, device=current_entropy.device)
# Penalize deviations beyond corridor bounds
if current_entropy < lower_bound:
deviation = lower_bound - current_entropy
smoothing_loss = deviation ** 2
elif current_entropy > upper_bound:
deviation = current_entropy - upper_bound
smoothing_loss = deviation ** 2
return smoothing_loss
This formulation differs from hard constraints—soft penalties allow occasional deviations while still discouraging sustained oscillation.
The dynamic schedule controls how strongly entropy smoothing is enforced throughout training. Early phases apply strong constraints to establish controlled exploration; later phases relax constraints as the policy stabilizes.
def compute_adaptive_weight(k, k_mid=60, gamma=1.0):
"""
Compute dynamic weighting coefficient βₖ across training phases.
The schedule implements: βₖ = 1 + e^(-γk/k_mid)
- Early phase (k << k_mid): βₖ ≈ 2 (strong regularization)
- Mid phase (k ≈ k_mid): βₖ ≈ 1.37 (moderate regularization)
- Late phase (k >> k_mid): βₖ ≈ 1 (minimal regularization)
Args:
k: Current training step
k_mid: Midpoint step where transition occurs (suggests k_mid ≈ 60% of total steps)
gamma: Steepness of transition (higher = sharper transition)
Returns:
beta_k: Adaptive weighting coefficient
"""
exponent = -gamma * k / k_mid
beta_k = 1.0 + torch.exp(torch.tensor(exponent, dtype=torch.float32))
return beta_k.item()
The exponential schedule naturally decays regularization without requiring threshold tuning—phase transitions happen automatically based on training progress.
The final EPO objective integrates entropy regularization with standard policy loss. The multi-turn policy loss (from PPO or GRPO) provides gradient direction; entropy components modulate exploration behavior.
def compute_epo_loss(policy_loss, entropy_loss, smoothing_loss,
beta_k, lambda_entropy=0.1, alpha=0.1):
"""
Compute final EPO loss combining multi-turn RL with entropy regularization.
Loss formula: L_EPO = L_policy - λ[L_entropy - β_k * α * L_smooth]
Args:
policy_loss: Tensor from PPO/GRPO (multi-turn RL objective)
entropy_loss: Trajectory-level entropy regularization loss
smoothing_loss: Penalty for corridor deviations
beta_k: Adaptive weight for current training phase
lambda_entropy: Entropy regularization coefficient (default: 0.1)
alpha: Smoothing importance relative to entropy (default: 0.1)
Returns:
epo_loss: Final scalar loss for backpropagation
loss_dict: Dict with component losses for logging
"""
# Entropy regularization: encourage exploration via entropy
entropy_reg = lambda_entropy * entropy_loss
# Smoothing penalty: prevent entropy oscillation
smoothing_penalty = beta_k * alpha * smoothing_loss
# Combined: entropy encourages exploration, smoothing prevents oscillation
entropy_component = entropy_reg - smoothing_penalty
# Final EPO loss: policy gradient minus entropy modulation
epo_loss = policy_loss - entropy_component
loss_dict = {
'policy_loss': policy_loss.item(),
'entropy_loss': entropy_loss.item(),
'smoothing_loss': smoothing_loss.item(),
'beta_k': beta_k,
'total_entropy_component': entropy_component.item(),
'epo_loss': epo_loss.item()
}
return epo_loss, loss_dict
The structure preserves policy gradient signals while using entropy as a separate regularization channel—this maintains credit assignment integrity compared to approaches that reshape advantages.
The complete training loop coordinates entropy history tracking, corridor adaptation, and loss computation. This integration is where EPO prevents cascade failures.
def training_step(model, policy_optimizer, batch_trajectories,
entropy_buffer, k, total_steps,
kappa_left=0.5, kappa_right=1.5,
lambda_entropy=0.1, alpha=0.1, gamma=1.0):
"""
Single EPO training step with adaptive entropy regularization.
Args:
model: LLM agent model
policy_optimizer: PyTorch optimizer
batch_trajectories: List of complete trajectories from rollout
entropy_buffer: EntropyHistoryBuffer for corridor computation
k: Current training step
total_steps: Total training steps (for scheduling)
kappa_left, kappa_right: Corridor bound parameters
lambda_entropy, alpha: Loss weighting hyperparameters
gamma: Schedule steepness parameter
Returns:
loss_dict: Logged metrics for monitoring
"""
policy_optimizer.zero_grad()
# 1. Collect logits from all turns across batch
logits_per_turn = {}
all_rewards = []
for trajectory in batch_trajectories:
# Forward pass through model for each turn
for turn_idx, turn_data in enumerate(trajectory['turns']):
if turn_idx not in logits_per_turn:
logits_per_turn[turn_idx] = []
logits = model(turn_data['input_ids'],
attention_mask=turn_data['attention_mask'])
logits_per_turn[turn_idx].append(logits)
all_rewards.append(trajectory['reward'])
# Stack logits by turn and convert rewards to tensor
for turn_idx in logits_per_turn:
logits_per_turn[turn_idx] = torch.stack(logits_per_turn[turn_idx])
rewards = torch.tensor(all_rewards, dtype=torch.float32)
# 2. Compute trajectory-level entropy
trajectory_entropy = compute_trajectory_entropy(logits_per_turn)
entropy_loss = -trajectory_entropy # Negative because we subtract entropy in loss
# 3. Compute smoothing loss with current corridor
smoothing_loss = compute_entropy_smoothing_loss(
trajectory_entropy, entropy_buffer, kappa_left, kappa_right
)
# 4. Compute policy loss (PPO/GRPO-style)
# This is simplified; real implementation needs advantage estimation
policy_loss = compute_policy_loss(logits_per_turn, rewards)
# 5. Get adaptive weight for current training phase
k_mid = int(total_steps * 0.6)
beta_k = compute_adaptive_weight(k, k_mid, gamma)
# 6. Combine into EPO loss
epo_loss, loss_dict = compute_epo_loss(
policy_loss, entropy_loss, smoothing_loss, beta_k,
lambda_entropy, alpha
)
# 7. Backward pass and optimizer step
epo_loss.backward()
policy_optimizer.step()
# 8. Update entropy history with current trajectory entropy
entropy_buffer.add(trajectory_entropy)
# Add step info to logged metrics
loss_dict['training_step'] = k
return loss_dict
This integration shows how entropy history, adaptive weights, and corridor bounds coordinate during training to prevent oscillations.
| Parameter | Default | Range | Role | |-----------|---------|-------|------| | λ (lambda_entropy) | 0.1 | 0.01-1.0 | Entropy regularization strength; higher encourages more exploration | | κₗ (kappa_left) | 0.5 | 0.1-0.7 | Lower corridor bound as fraction of historical mean entropy | | κᵣ (kappa_right) | 1.5 | 1.2-3.0 | Upper corridor bound as fraction of historical mean entropy | | γ (gamma) | 1.0 | 0.5-2.0 | Schedule steepness; higher transitions faster from exploration to exploitation | | k_mid | 60% of total steps | 50-70% | Midpoint where phase transition occurs; set to ~60% of total training steps | | α (alpha) | 0.1 | 0.05-0.5 | Smoothing penalty relative to entropy regularization | | window_size | 20 | 10-50 | Historical window for corridor bounds; larger = more stable bounds, slower adaptation |
Setting k_mid: This should typically be set to 60% of your total training steps. For a 100-step training run, use k_mid=60. For 200 steps, use k_mid=120. This places the strongest regularization in the first 40% of training while allowing relaxation afterward.
Choosing corridor bounds: Start with κₗ=0.5 and κᵣ=1.5. These values create a 3× width corridor around historical mean entropy (0.5× to 1.5×). If entropy oscillates too much, tighten bounds (reduce κᵣ to 1.2, increase κₗ to 0.6). If policy under-explores, widen bounds.
λ parameter tuning: Begin with λ=0.1. Check training logs for entropy trends: if entropy steadily increases, reduce λ; if it plateaus, increase λ. The smoothing mechanism should prevent extreme oscillations even with modest λ values.
Setting k_mid too low: If phase transition happens too early (k_mid < 40% of steps), the policy relaxes constraints before stabilizing. Use k_mid = 0.6 × total_steps as default.
Corridor bounds too tight: κₗ=0.7, κᵣ=1.2 severely restricts entropy changes. This produces rigid exploration that may prevent optimal behavior discovery. Start wide (0.5-1.5), tighten only if oscillations persist.
Not tracking entropy in logs: Without visibility into entropy trends, you cannot diagnose whether oscillation or under-exploration is occurring. Always log trajectory_entropy at every step.
Using constant window_size: Early in training with only 3-5 history samples, bounds computed from small windows are unstable. Either skip smoothing until window has 10+ samples, or use adaptive window sizing.
Forgetting reward scaling: If rewards vary wildly across episodes, entropy regularization strength becomes inconsistent relative to policy gradients. Normalize or clip rewards before computing policy loss.
Applying to non-multi-turn settings: EPO's trajectory aggregation assumes long episodes. For single-step or very short tasks, use standard per-step entropy regularization instead.
Reference: EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning - Wujiang Xu et al. GitHub: https://github.com/WujiangXu/EPO
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.