EPO: Entropy-Regularized Policy Optimization for LLM Agents

Problem Context

Training language model agents in multi-turn environments creates a unique instability challenge absent from single-turn reinforcement learning. In sparse-reward settings (where feedback comes only at episode completion), agents must explore effectively across 30+ interaction turns while sharing policy parameters across all steps. Standard entropy regularization techniques fail because entropy oscillates severely across training iterations—early in training, the policy under-explores and converges to suboptimal strategies; later, it over-explores chaotically, destabilizing learning dynamics.

This "exploration-exploitation cascade failure" occurs because per-turn entropy adjustments cannot independently control exploration and exploitation when parameters are shared. Intermediate trajectories provide no reward signal to guide adjustment, forcing the agent into oscillatory dynamics that prevent convergence. Existing methods either apply insufficient regularization (allowing oscillations) or rigid entropy targets (causing instability when optimal entropy varies).

Core Concept

EPO addresses this through three coordinated mechanisms that work together: trajectory-level entropy aggregation (computing entropy across entire episodes rather than individual turns), entropy smoothing via historical anchoring (maintaining entropy within dynamically bounded corridors), and adaptive phase-based weighting (controlling regularization strength across training phases).

The key insight is treating entropy regularization as a constrained optimization problem where the constraint corridor adapts based on historical entropy patterns. Early in training, tight corridors enforce controlled exploration; as training progresses, the corridor relaxes, allowing the policy to settle into its natural entropy range. This curriculum-style approach stabilizes dynamics while preserving learning efficiency.

Architecture Overview

Trajectory-level entropy computation: Aggregate entropy across all dialogue turns within each trajectory, then average over batch to capture global exploration state rather than per-step entropy
Entropy corridor mechanism: Maintain rolling window of historical entropy values with left bound κₗ and right bound κᵣ; apply penalty only when current entropy deviates beyond these bounds
Dynamic weighting schedule: Use exponential annealing with midpoint γ to modulate penalty strength βₖ = 1 + e^(-γk/k_mid), applying strong regularization early and relaxing as training stabilizes
Integration with policy optimizers: Compatible with PPO and GRPO—entropy loss separates from policy gradient, preserving credit assignment integrity
Adaptive penalty application: Penalize deviations from corridor only when needed, reducing unnecessary optimization overhead compared to constant regularization

Implementation

Step 1: Compute Trajectory-Level Entropy

Entropy must be aggregated across all turns to capture the agent's global exploration state. Rather than tracking entropy per turn (which creates misaligned gradient signals), compute the mean entropy across the entire trajectory batch.

import torch
import torch.nn.functional as F

def compute_trajectory_entropy(logits_per_turn, attention_mask=None):
    """
    Compute entropy across all turns in trajectories.

    Args:
        logits_per_turn: Dict mapping turn_id -> [batch_size, seq_len, vocab_size]
        attention_mask: Dict mapping turn_id -> [batch_size, seq_len] (optional)

    Returns:
        trajectory_entropy: Scalar tensor, mean entropy across batch and turns
    """
    entropies = []

    for turn_id, logits in logits_per_turn.items():
        # Compute entropy per token
        probs = F.softmax(logits, dim=-1)
        token_entropy = -torch.sum(probs * F.log_softmax(logits, dim=-1), dim=-1)

        # Apply attention mask if provided
        if attention_mask is not None:
            mask = attention_mask[turn_id]
            token_entropy = token_entropy * mask
            token_entropy = token_entropy.sum() / mask.sum()
        else:
            token_entropy = token_entropy.mean()

        entropies.append(token_entropy)

    # Average entropy across all turns in trajectory
    trajectory_entropy = torch.stack(entropies).mean()
    return trajectory_entropy

This approach captures whether the agent is exploring (high entropy across all steps) or exploiting (low entropy), independent of which individual turn contributes entropy.

Step 2: Initialize and Maintain Entropy History

The entropy history window stores past entropy values to compute adaptive corridor bounds. This anchoring mechanism dampens oscillations by establishing reference points for what "normal" entropy looks like during training.

class EntropyHistoryBuffer:
    """Maintain rolling entropy history for corridor-based regularization."""

    def __init__(self, window_size=20):
        self.window_size = window_size
        self.history = []

    def add(self, entropy_value):
        """Add current trajectory entropy to history."""
        self.history.append(float(entropy_value.detach().cpu()))
        # Keep only recent window
        if len(self.history) > self.window_size:
            self.history.pop(0)

    def get_corridor_bounds(self, kappa_left=0.5, kappa_right=1.5):
        """
        Compute dynamic corridor bounds based on historical entropy.

        Args:
            kappa_left: Left bound as fraction of historical mean
            kappa_right: Right bound as fraction of historical mean

        Returns:
            lower_bound, upper_bound: Adaptive entropy corridor
        """
        if len(self.history) < 5:
            # Early training: use constant bounds
            return None, None

        mean_entropy = sum(self.history) / len(self.history)
        lower_bound = kappa_left * mean_entropy
        upper_bound = kappa_right * mean_entropy

        return lower_bound, upper_bound

    def get_mean_entropy(self):
        """Return mean entropy from history."""
        if not self.history:
            return 0.0
        return sum(self.history) / len(self.history)

The corridor bounds dynamically adjust as the policy's natural entropy settles, allowing the regularization to become less restrictive as training progresses.

Step 3: Compute Entropy Smoothing Loss

The smoothing loss penalizes entropy deviations from the corridor. Critically, penalties apply only when entropy falls outside bounds—not every update requires correction, reducing unnecessary gradient noise.

def compute_entropy_smoothing_loss(current_entropy, entropy_buffer,
                                   kappa_left=0.5, kappa_right=1.5):
    """
    Compute penalty for entropy deviations from historical corridor.

    Args:
        current_entropy: Scalar tensor, trajectory entropy from current batch
        entropy_buffer: EntropyHistoryBuffer instance
        kappa_left, kappa_right: Corridor bound multipliers

    Returns:
        smoothing_loss: Scalar tensor (0 if within bounds, penalty if outside)
    """
    lower_bound, upper_bound = entropy_buffer.get_corridor_bounds(kappa_left, kappa_right)

    if lower_bound is None:
        # Insufficient history: no smoothing penalty yet
        return torch.tensor(0.0, device=current_entropy.device)

    smoothing_loss = torch.tensor(0.0, device=current_entropy.device)

    # Penalize deviations beyond corridor bounds
    if current_entropy < lower_bound:
        deviation = lower_bound - current_entropy
        smoothing_loss = deviation ** 2
    elif current_entropy > upper_bound:
        deviation = current_entropy - upper_bound
        smoothing_loss = deviation ** 2

    return smoothing_loss

This formulation differs from hard constraints—soft penalties allow occasional deviations while still discouraging sustained oscillation.

Step 4: Implement Adaptive Phase-Based Weighting

The dynamic schedule controls how strongly entropy smoothing is enforced throughout training. Early phases apply strong constraints to establish controlled exploration; later phases relax constraints as the policy stabilizes.

def compute_adaptive_weight(k, k_mid=60, gamma=1.0):
    """
    Compute dynamic weighting coefficient βₖ across training phases.

    The schedule implements: βₖ = 1 + e^(-γk/k_mid)
    - Early phase (k << k_mid): βₖ ≈ 2 (strong regularization)
    - Mid phase (k ≈ k_mid): βₖ ≈ 1.37 (moderate regularization)
    - Late phase (k >> k_mid): βₖ ≈ 1 (minimal regularization)

    Args:
        k: Current training step
        k_mid: Midpoint step where transition occurs (suggests k_mid ≈ 60% of total steps)
        gamma: Steepness of transition (higher = sharper transition)

    Returns:
        beta_k: Adaptive weighting coefficient
    """
    exponent = -gamma * k / k_mid
    beta_k = 1.0 + torch.exp(torch.tensor(exponent, dtype=torch.float32))
    return beta_k.item()

The exponential schedule naturally decays regularization without requiring threshold tuning—phase transitions happen automatically based on training progress.

Step 5: Combine Components into EPO Loss

The final EPO objective integrates entropy regularization with standard policy loss. The multi-turn policy loss (from PPO or GRPO) provides gradient direction; entropy components modulate exploration behavior.

def compute_epo_loss(policy_loss, entropy_loss, smoothing_loss,
                     beta_k, lambda_entropy=0.1, alpha=0.1):
    """
    Compute final EPO loss combining multi-turn RL with entropy regularization.

    Loss formula: L_EPO = L_policy - λ[L_entropy - β_k * α * L_smooth]

    Args:
        policy_loss: Tensor from PPO/GRPO (multi-turn RL objective)
        entropy_loss: Trajectory-level entropy regularization loss
        smoothing_loss: Penalty for corridor deviations
        beta_k: Adaptive weight for current training phase
        lambda_entropy: Entropy regularization coefficient (default: 0.1)
        alpha: Smoothing importance relative to entropy (default: 0.1)

    Returns:
        epo_loss: Final scalar loss for backpropagation
        loss_dict: Dict with component losses for logging
    """
    # Entropy regularization: encourage exploration via entropy
    entropy_reg = lambda_entropy * entropy_loss

    # Smoothing penalty: prevent entropy oscillation
    smoothing_penalty = beta_k * alpha * smoothing_loss

    # Combined: entropy encourages exploration, smoothing prevents oscillation
    entropy_component = entropy_reg - smoothing_penalty

    # Final EPO loss: policy gradient minus entropy modulation
    epo_loss = policy_loss - entropy_component

    loss_dict = {
        'policy_loss': policy_loss.item(),
        'entropy_loss': entropy_loss.item(),
        'smoothing_loss': smoothing_loss.item(),
        'beta_k': beta_k,
        'total_entropy_component': entropy_component.item(),
        'epo_loss': epo_loss.item()
    }

    return epo_loss, loss_dict

The structure preserves policy gradient signals while using entropy as a separate regularization channel—this maintains credit assignment integrity compared to approaches that reshape advantages.

Step 6: Training Loop Integration

The complete training loop coordinates entropy history tracking, corridor adaptation, and loss computation. This integration is where EPO prevents cascade failures.

def training_step(model, policy_optimizer, batch_trajectories,
                  entropy_buffer, k, total_steps,
                  kappa_left=0.5, kappa_right=1.5,
                  lambda_entropy=0.1, alpha=0.1, gamma=1.0):
    """
    Single EPO training step with adaptive entropy regularization.

    Args:
        model: LLM agent model
        policy_optimizer: PyTorch optimizer
        batch_trajectories: List of complete trajectories from rollout
        entropy_buffer: EntropyHistoryBuffer for corridor computation
        k: Current training step
        total_steps: Total training steps (for scheduling)
        kappa_left, kappa_right: Corridor bound parameters
        lambda_entropy, alpha: Loss weighting hyperparameters
        gamma: Schedule steepness parameter

    Returns:
        loss_dict: Logged metrics for monitoring
    """
    policy_optimizer.zero_grad()

    # 1. Collect logits from all turns across batch
    logits_per_turn = {}
    all_rewards = []

    for trajectory in batch_trajectories:
        # Forward pass through model for each turn
        for turn_idx, turn_data in enumerate(trajectory['turns']):
            if turn_idx not in logits_per_turn:
                logits_per_turn[turn_idx] = []

            logits = model(turn_data['input_ids'],
                          attention_mask=turn_data['attention_mask'])
            logits_per_turn[turn_idx].append(logits)

        all_rewards.append(trajectory['reward'])

    # Stack logits by turn and convert rewards to tensor
    for turn_idx in logits_per_turn:
        logits_per_turn[turn_idx] = torch.stack(logits_per_turn[turn_idx])
    rewards = torch.tensor(all_rewards, dtype=torch.float32)

    # 2. Compute trajectory-level entropy
    trajectory_entropy = compute_trajectory_entropy(logits_per_turn)
    entropy_loss = -trajectory_entropy  # Negative because we subtract entropy in loss

    # 3. Compute smoothing loss with current corridor
    smoothing_loss = compute_entropy_smoothing_loss(
        trajectory_entropy, entropy_buffer, kappa_left, kappa_right
    )

    # 4. Compute policy loss (PPO/GRPO-style)
    # This is simplified; real implementation needs advantage estimation
    policy_loss = compute_policy_loss(logits_per_turn, rewards)

    # 5. Get adaptive weight for current training phase
    k_mid = int(total_steps * 0.6)
    beta_k = compute_adaptive_weight(k, k_mid, gamma)

    # 6. Combine into EPO loss
    epo_loss, loss_dict = compute_epo_loss(
        policy_loss, entropy_loss, smoothing_loss, beta_k,
        lambda_entropy, alpha
    )

    # 7. Backward pass and optimizer step
    epo_loss.backward()
    policy_optimizer.step()

    # 8. Update entropy history with current trajectory entropy
    entropy_buffer.add(trajectory_entropy)

    # Add step info to logged metrics
    loss_dict['training_step'] = k

    return loss_dict

This integration shows how entropy history, adaptive weights, and corridor bounds coordinate during training to prevent oscillations.

Practical Guidance

Hyperparameter Configuration

| Parameter | Default | Range | Role | |-----------|---------|-------|------| | λ (lambda_entropy) | 0.1 | 0.01-1.0 | Entropy regularization strength; higher encourages more exploration | | κₗ (kappa_left) | 0.5 | 0.1-0.7 | Lower corridor bound as fraction of historical mean entropy | | κᵣ (kappa_right) | 1.5 | 1.2-3.0 | Upper corridor bound as fraction of historical mean entropy | | γ (gamma) | 1.0 | 0.5-2.0 | Schedule steepness; higher transitions faster from exploration to exploitation | | k_mid | 60% of total steps | 50-70% | Midpoint where phase transition occurs; set to ~60% of total training steps | | α (alpha) | 0.1 | 0.05-0.5 | Smoothing penalty relative to entropy regularization | | window_size | 20 | 10-50 | Historical window for corridor bounds; larger = more stable bounds, slower adaptation |

Setting k_mid: This should typically be set to 60% of your total training steps. For a 100-step training run, use k_mid=60. For 200 steps, use k_mid=120. This places the strongest regularization in the first 40% of training while allowing relaxation afterward.

Choosing corridor bounds: Start with κₗ=0.5 and κᵣ=1.5. These values create a 3× width corridor around historical mean entropy (0.5× to 1.5×). If entropy oscillates too much, tighten bounds (reduce κᵣ to 1.2, increase κₗ to 0.6). If policy under-explores, widen bounds.

λ parameter tuning: Begin with λ=0.1. Check training logs for entropy trends: if entropy steadily increases, reduce λ; if it plateaus, increase λ. The smoothing mechanism should prevent extreme oscillations even with modest λ values.

When to Use EPO

Multi-turn environments (20+ interaction steps per episode) where shared parameters create coordination challenges
Sparse reward settings where trajectories complete before meaningful intermediate feedback occurs
Text-based reasoning tasks like scientific question-answering where exploration across reasoning paths is necessary
Embodied control with multiple decision points per episode requiring sustained exploration
Any RL training showing entropy oscillation patterns (entropy jumps wildly between iterations)

When NOT to Use EPO

Single-turn or short-horizon tasks (< 5 steps); standard entropy regularization suffices
Dense reward environments where intermediate rewards guide exploration naturally
Fully-explored task distributions where standard policies naturally converge to low entropy
Imitation learning settings where entropy regularization typically conflicts with behavioral cloning objectives
Tasks where over-exploration is harmful (e.g., safety-critical systems); EPO encourages more exploration than some alternatives

Common Pitfalls

Setting k_mid too low: If phase transition happens too early (k_mid < 40% of steps), the policy relaxes constraints before stabilizing. Use k_mid = 0.6 × total_steps as default.
Corridor bounds too tight: κₗ=0.7, κᵣ=1.2 severely restricts entropy changes. This produces rigid exploration that may prevent optimal behavior discovery. Start wide (0.5-1.5), tighten only if oscillations persist.
Not tracking entropy in logs: Without visibility into entropy trends, you cannot diagnose whether oscillation or under-exploration is occurring. Always log trajectory_entropy at every step.
Using constant window_size: Early in training with only 3-5 history samples, bounds computed from small windows are unstable. Either skip smoothing until window has 10+ samples, or use adaptive window sizing.
Forgetting reward scaling: If rewards vary wildly across episodes, entropy regularization strength becomes inconsistent relative to policy gradients. Normalize or clip rewards before computing policy loss.
Applying to non-multi-turn settings: EPO's trajectory aggregation assumes long episodes. For single-step or very short tasks, use standard per-step entropy regularization instead.

Reference: EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning - Wujiang Xu et al. GitHub: https://github.com/WujiangXu/EPO

EPO: Entropy-Regularized Policy Optimization for LLM Agents

Problem Context

Core Concept

Architecture Overview

Trajectory-level entropy computation: Aggregate entropy across all dialogue turns within each trajectory, then average over batch to capture global exploration state rather than per-step entropy
Entropy corridor mechanism: Maintain rolling window of historical entropy values with left bound κₗ and right bound κᵣ; apply penalty only when current entropy deviates beyond these bounds
Dynamic weighting schedule: Use exponential annealing with midpoint γ to modulate penalty strength βₖ = 1 + e^(-γk/k_mid), applying strong regularization early and relaxing as training stabilizes
Integration with policy optimizers: Compatible with PPO and GRPO—entropy loss separates from policy gradient, preserving credit assignment integrity
Adaptive penalty application: Penalize deviations from corridor only when needed, reducing unnecessary optimization overhead compared to constant regularization

Implementation

Step 1: Compute Trajectory-Level Entropy

import torch
import torch.nn.functional as F

def compute_trajectory_entropy(logits_per_turn, attention_mask=None):
    """
    Compute entropy across all turns in trajectories.

    Args:
        logits_per_turn: Dict mapping turn_id -> [batch_size, seq_len, vocab_size]
        attention_mask: Dict mapping turn_id -> [batch_size, seq_len] (optional)

    Returns:
        trajectory_entropy: Scalar tensor, mean entropy across batch and turns
    """
    entropies = []

    for turn_id, logits in logits_per_turn.items():
        # Compute entropy per token
        probs = F.softmax(logits, dim=-1)
        token_entropy = -torch.sum(probs * F.log_softmax(logits, dim=-1), dim=-1)

        # Apply attention mask if provided
        if attention_mask is not None:
            mask = attention_mask[turn_id]
            token_entropy = token_entropy * mask
            token_entropy = token_entropy.sum() / mask.sum()
        else:
            token_entropy = token_entropy.mean()

        entropies.append(token_entropy)

    # Average entropy across all turns in trajectory
    trajectory_entropy = torch.stack(entropies).mean()
    return trajectory_entropy

This approach captures whether the agent is exploring (high entropy across all steps) or exploiting (low entropy), independent of which individual turn contributes entropy.

Step 2: Initialize and Maintain Entropy History

class EntropyHistoryBuffer:
    """Maintain rolling entropy history for corridor-based regularization."""

    def __init__(self, window_size=20):
        self.window_size = window_size
        self.history = []

    def add(self, entropy_value):
        """Add current trajectory entropy to history."""
        self.history.append(float(entropy_value.detach().cpu()))
        # Keep only recent window
        if len(self.history) > self.window_size:
            self.history.pop(0)

    def get_corridor_bounds(self, kappa_left=0.5, kappa_right=1.5):
        """
        Compute dynamic corridor bounds based on historical entropy.

        Args:
            kappa_left: Left bound as fraction of historical mean
            kappa_right: Right bound as fraction of historical mean

        Returns:
            lower_bound, upper_bound: Adaptive entropy corridor
        """
        if len(self.history) < 5:
            # Early training: use constant bounds
            return None, None

        mean_entropy = sum(self.history) / len(self.history)
        lower_bound = kappa_left * mean_entropy
        upper_bound = kappa_right * mean_entropy

        return lower_bound, upper_bound

    def get_mean_entropy(self):
        """Return mean entropy from history."""
        if not self.history:
            return 0.0
        return sum(self.history) / len(self.history)

The corridor bounds dynamically adjust as the policy's natural entropy settles, allowing the regularization to become less restrictive as training progresses.

Step 3: Compute Entropy Smoothing Loss

def compute_entropy_smoothing_loss(current_entropy, entropy_buffer,
                                   kappa_left=0.5, kappa_right=1.5):
    """
    Compute penalty for entropy deviations from historical corridor.

    Args:
        current_entropy: Scalar tensor, trajectory entropy from current batch
        entropy_buffer: EntropyHistoryBuffer instance
        kappa_left, kappa_right: Corridor bound multipliers

    Returns:
        smoothing_loss: Scalar tensor (0 if within bounds, penalty if outside)
    """
    lower_bound, upper_bound = entropy_buffer.get_corridor_bounds(kappa_left, kappa_right)

    if lower_bound is None:
        # Insufficient history: no smoothing penalty yet
        return torch.tensor(0.0, device=current_entropy.device)

    smoothing_loss = torch.tensor(0.0, device=current_entropy.device)

    # Penalize deviations beyond corridor bounds
    if current_entropy < lower_bound:
        deviation = lower_bound - current_entropy
        smoothing_loss = deviation ** 2
    elif current_entropy > upper_bound:
        deviation = current_entropy - upper_bound
        smoothing_loss = deviation ** 2

    return smoothing_loss

This formulation differs from hard constraints—soft penalties allow occasional deviations while still discouraging sustained oscillation.

Step 4: Implement Adaptive Phase-Based Weighting

def compute_adaptive_weight(k, k_mid=60, gamma=1.0):
    """
    Compute dynamic weighting coefficient βₖ across training phases.

    The schedule implements: βₖ = 1 + e^(-γk/k_mid)
    - Early phase (k << k_mid): βₖ ≈ 2 (strong regularization)
    - Mid phase (k ≈ k_mid): βₖ ≈ 1.37 (moderate regularization)
    - Late phase (k >> k_mid): βₖ ≈ 1 (minimal regularization)

    Args:
        k: Current training step
        k_mid: Midpoint step where transition occurs (suggests k_mid ≈ 60% of total steps)
        gamma: Steepness of transition (higher = sharper transition)

    Returns:
        beta_k: Adaptive weighting coefficient
    """
    exponent = -gamma * k / k_mid
    beta_k = 1.0 + torch.exp(torch.tensor(exponent, dtype=torch.float32))
    return beta_k.item()

The exponential schedule naturally decays regularization without requiring threshold tuning—phase transitions happen automatically based on training progress.

Step 5: Combine Components into EPO Loss

def compute_epo_loss(policy_loss, entropy_loss, smoothing_loss,
                     beta_k, lambda_entropy=0.1, alpha=0.1):
    """
    Compute final EPO loss combining multi-turn RL with entropy regularization.

    Loss formula: L_EPO = L_policy - λ[L_entropy - β_k * α * L_smooth]

    Args:
        policy_loss: Tensor from PPO/GRPO (multi-turn RL objective)
        entropy_loss: Trajectory-level entropy regularization loss
        smoothing_loss: Penalty for corridor deviations
        beta_k: Adaptive weight for current training phase
        lambda_entropy: Entropy regularization coefficient (default: 0.1)
        alpha: Smoothing importance relative to entropy (default: 0.1)

    Returns:
        epo_loss: Final scalar loss for backpropagation
        loss_dict: Dict with component losses for logging
    """
    # Entropy regularization: encourage exploration via entropy
    entropy_reg = lambda_entropy * entropy_loss

    # Smoothing penalty: prevent entropy oscillation
    smoothing_penalty = beta_k * alpha * smoothing_loss

    # Combined: entropy encourages exploration, smoothing prevents oscillation
    entropy_component = entropy_reg - smoothing_penalty

    # Final EPO loss: policy gradient minus entropy modulation
    epo_loss = policy_loss - entropy_component

    loss_dict = {
        'policy_loss': policy_loss.item(),
        'entropy_loss': entropy_loss.item(),
        'smoothing_loss': smoothing_loss.item(),
        'beta_k': beta_k,
        'total_entropy_component': entropy_component.item(),
        'epo_loss': epo_loss.item()
    }

    return epo_loss, loss_dict

The structure preserves policy gradient signals while using entropy as a separate regularization channel—this maintains credit assignment integrity compared to approaches that reshape advantages.

Step 6: Training Loop Integration

The complete training loop coordinates entropy history tracking, corridor adaptation, and loss computation. This integration is where EPO prevents cascade failures.

def training_step(model, policy_optimizer, batch_trajectories,
                  entropy_buffer, k, total_steps,
                  kappa_left=0.5, kappa_right=1.5,
                  lambda_entropy=0.1, alpha=0.1, gamma=1.0):
    """
    Single EPO training step with adaptive entropy regularization.

    Args:
        model: LLM agent model
        policy_optimizer: PyTorch optimizer
        batch_trajectories: List of complete trajectories from rollout
        entropy_buffer: EntropyHistoryBuffer for corridor computation
        k: Current training step
        total_steps: Total training steps (for scheduling)
        kappa_left, kappa_right: Corridor bound parameters
        lambda_entropy, alpha: Loss weighting hyperparameters
        gamma: Schedule steepness parameter

    Returns:
        loss_dict: Logged metrics for monitoring
    """
    policy_optimizer.zero_grad()

    # 1. Collect logits from all turns across batch
    logits_per_turn = {}
    all_rewards = []

    for trajectory in batch_trajectories:
        # Forward pass through model for each turn
        for turn_idx, turn_data in enumerate(trajectory['turns']):
            if turn_idx not in logits_per_turn:
                logits_per_turn[turn_idx] = []

            logits = model(turn_data['input_ids'],
                          attention_mask=turn_data['attention_mask'])
            logits_per_turn[turn_idx].append(logits)

        all_rewards.append(trajectory['reward'])

    # Stack logits by turn and convert rewards to tensor
    for turn_idx in logits_per_turn:
        logits_per_turn[turn_idx] = torch.stack(logits_per_turn[turn_idx])
    rewards = torch.tensor(all_rewards, dtype=torch.float32)

    # 2. Compute trajectory-level entropy
    trajectory_entropy = compute_trajectory_entropy(logits_per_turn)
    entropy_loss = -trajectory_entropy  # Negative because we subtract entropy in loss

    # 3. Compute smoothing loss with current corridor
    smoothing_loss = compute_entropy_smoothing_loss(
        trajectory_entropy, entropy_buffer, kappa_left, kappa_right
    )

    # 4. Compute policy loss (PPO/GRPO-style)
    # This is simplified; real implementation needs advantage estimation
    policy_loss = compute_policy_loss(logits_per_turn, rewards)

    # 5. Get adaptive weight for current training phase
    k_mid = int(total_steps * 0.6)
    beta_k = compute_adaptive_weight(k, k_mid, gamma)

    # 6. Combine into EPO loss
    epo_loss, loss_dict = compute_epo_loss(
        policy_loss, entropy_loss, smoothing_loss, beta_k,
        lambda_entropy, alpha
    )

    # 7. Backward pass and optimizer step
    epo_loss.backward()
    policy_optimizer.step()

    # 8. Update entropy history with current trajectory entropy
    entropy_buffer.add(trajectory_entropy)

    # Add step info to logged metrics
    loss_dict['training_step'] = k

    return loss_dict

This integration shows how entropy history, adaptive weights, and corridor bounds coordinate during training to prevent oscillations.

Practical Guidance

Hyperparameter Configuration

When to Use EPO

Multi-turn environments (20+ interaction steps per episode) where shared parameters create coordination challenges
Sparse reward settings where trajectories complete before meaningful intermediate feedback occurs
Text-based reasoning tasks like scientific question-answering where exploration across reasoning paths is necessary
Embodied control with multiple decision points per episode requiring sustained exploration
Any RL training showing entropy oscillation patterns (entropy jumps wildly between iterations)

When NOT to Use EPO

Single-turn or short-horizon tasks (< 5 steps); standard entropy regularization suffices
Dense reward environments where intermediate rewards guide exploration naturally
Fully-explored task distributions where standard policies naturally converge to low entropy
Imitation learning settings where entropy regularization typically conflicts with behavioral cloning objectives
Tasks where over-exploration is harmful (e.g., safety-critical systems); EPO encourages more exploration than some alternatives

Common Pitfalls

Setting k_mid too low: If phase transition happens too early (k_mid < 40% of steps), the policy relaxes constraints before stabilizing. Use k_mid = 0.6 × total_steps as default.
Corridor bounds too tight: κₗ=0.7, κᵣ=1.2 severely restricts entropy changes. This produces rigid exploration that may prevent optimal behavior discovery. Start wide (0.5-1.5), tighten only if oscillations persist.
Not tracking entropy in logs: Without visibility into entropy trends, you cannot diagnose whether oscillation or under-exploration is occurring. Always log trajectory_entropy at every step.
Using constant window_size: Early in training with only 3-5 history samples, bounds computed from small windows are unstable. Either skip smoothing until window has 10+ samples, or use adaptive window sizing.
Forgetting reward scaling: If rewards vary wildly across episodes, entropy regularization strength becomes inconsistent relative to policy gradients. Normalize or clip rewards before computing policy loss.
Applying to non-multi-turn settings: EPO's trajectory aggregation assumes long episodes. For single-step or very short tasks, use standard per-step entropy regularization instead.

Reference: EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning - Wujiang Xu et al. GitHub: https://github.com/WujiangXu/EPO

Adoption

ADu2021/epo-entropy-regularized-policy-optimization

$ install --global

Security Scan Results

SKILL.md

EPO: Entropy-Regularized Policy Optimization for LLM Agents

Problem Context

Core Concept

Architecture Overview

Implementation

Step 1: Compute Trajectory-Level Entropy

Step 2: Initialize and Maintain Entropy History

Step 3: Compute Entropy Smoothing Loss

Step 4: Implement Adaptive Phase-Based Weighting

Step 5: Combine Components into EPO Loss

Step 6: Training Loop Integration

Practical Guidance

Hyperparameter Configuration

When to Use EPO

When NOT to Use EPO

Common Pitfalls

Related Skills

ADu2021/flow-map-trajectory-tilting

ADu2021/flexible-data-mixture-of-experts

ADu2021/flexibility-trap-diffusion-reasoning

ADu2021/flex-continuous-agent-evolution

ADu2021/epo-entropy-regularized-policy-optimization

$ install --global

Security Scan Results

SKILL.md

EPO: Entropy-Regularized Policy Optimization for LLM Agents

Problem Context

Core Concept

Architecture Overview

Implementation

Step 1: Compute Trajectory-Level Entropy

Step 2: Initialize and Maintain Entropy History

Step 3: Compute Entropy Smoothing Loss

Step 4: Implement Adaptive Phase-Based Weighting

Step 5: Combine Components into EPO Loss

Step 6: Training Loop Integration

Practical Guidance

Hyperparameter Configuration

When to Use EPO

When NOT to Use EPO

Common Pitfalls

Related Skills

ADu2021/flow-map-trajectory-tilting

ADu2021/flexible-data-mixture-of-experts

ADu2021/flexibility-trap-diffusion-reasoning

ADu2021/flex-continuous-agent-evolution