skills/skillxiv-v0.0.2-claude-opus-4.6/ce-gppo-gradient-preserving-entropy-control/SKILL.md
Control policy entropy dynamics in RL by reweighting gradients from clipped tokens. CE-GPPO preserves out-of-clip gradients with beta parameters to stabilize exploration-exploitation balance, preventing entropy collapse while maintaining training stability in LLM fine-tuning.
npx skillsauth add ADu2021/skillXiv ce-gppo-gradient-preserving-entropy-controlInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
CE-GPPO enables fine-grained control over policy entropy dynamics during reinforcement learning training, preventing entropy collapse while maintaining training stability. This is critical for LLM fine-tuning where premature convergence to deterministic policies undermines exploration and model reasoning capabilities.
Standard policy gradient methods like PPO use gradient clipping to constrain policy updates, but this creates a hidden cost: gradients from tokens outside the clipping interval (low-probability and high-probability tokens) are discarded. This asymmetric gradient loss accelerates entropy collapse.
In LLM training, entropy collapse manifests as the model converging too quickly to high-probability tokens, losing exploration capability needed for complex reasoning tasks. Analysis shows that:
The covariance between log-probabilities and advantages drives entropy change. When clipped gradients are dropped, this covariance increases unnaturally, forcing rapid policy convergence.
CE-GPPO reframes entropy control as a gradient reweighting problem. Instead of discarding clipped-token gradients, the method incorporates them with tunable weights beta1 and beta2:
The key insight: policy entropy change is governed by the covariance between log-probabilities and advantages. By reweighting these specific gradient sources, CE-GPPO directly controls entropy dynamics without separate entropy regularization terms. This preserves gradient information while maintaining the pessimistic clipping behavior of standard PPO, ensuring stable optimization.
The method maintains PPO's pessimistic update mechanism: when advantages are positive but ratios are too high, updates are clipped; when advantages are negative but ratios are too low, updates are clipped. CE-GPPO adds controlled gradient flow from these clipped regions.
For each token in a batch, compute the importance sampling ratio (new policy probability / old policy probability) and classify it relative to clipping bounds [1-epsilon, 1+epsilon].
# Token-level importance sampling and classification
import torch
import torch.nn.functional as F
def compute_token_importance_ratios(
log_probs_new: torch.Tensor, # [batch, seq_len]
log_probs_old: torch.Tensor, # [batch, seq_len]
advantages: torch.Tensor, # [batch, seq_len]
epsilon: float = 0.2
) -> tuple:
"""
Compute importance ratios and token classifications.
Returns:
ratios: importance sampling ratios
is_in_clip: boolean mask for tokens inside clipping interval
is_pa_lp: boolean mask for PA&LP tokens (pos advantage, clipped low)
is_na_lp: boolean mask for NA&LP tokens (neg advantage, clipped low)
"""
# Importance sampling: exp(log_new - log_old)
ratios = torch.exp(log_probs_new - log_probs_old)
# Clipping bounds
lower_bound = 1.0 - epsilon
upper_bound = 1.0 + epsilon
# Determine which tokens are inside the clipping interval
is_in_clip = (ratios >= lower_bound) & (ratios <= upper_bound)
# PA&LP: positive advantage but ratio too low (below lower bound)
is_pa_lp = (advantages > 0) & (ratios < lower_bound)
# NA&LP: negative advantage but ratio too high (above upper bound)
is_na_lp = (advantages < 0) & (ratios > upper_bound)
return ratios, is_in_clip, is_pa_lp, is_na_lp
The CE-GPPO loss combines standard clipped losses with weighted out-of-clip gradients.
def ce_gppo_loss(
log_probs_new: torch.Tensor,
log_probs_old: torch.Tensor,
advantages: torch.Tensor,
epsilon: float = 0.2,
beta_1: float = 0.5,
beta_2: float = 1.0
) -> torch.Tensor:
"""
CE-GPPO loss with gradient-preserving clipping.
Args:
log_probs_new: log probabilities under new policy [batch, seq_len]
log_probs_old: log probabilities under old policy [batch, seq_len]
advantages: advantage estimates [batch, seq_len]
epsilon: PPO clipping parameter
beta_1: weight for NA&LP token gradients
beta_2: weight for PA&LP token gradients
Returns:
loss: scalar CE-GPPO loss (negate for gradient ascent)
"""
ratios, is_in_clip, is_pa_lp, is_na_lp = compute_token_importance_ratios(
log_probs_new, log_probs_old, advantages, epsilon
)
# Initialize loss accumulator
loss = torch.zeros(1, device=log_probs_new.device, dtype=log_probs_new.dtype)
# Standard PPO clipped loss for tokens inside interval
clipped_ratio = torch.clamp(ratios, 1.0 - epsilon, 1.0 + epsilon)
standard_loss = -torch.min(
ratios * advantages,
clipped_ratio * advantages
)
loss = loss + standard_loss[is_in_clip].mean()
# Weighted gradients for PA&LP tokens (encourage exploration)
# Stop gradient on clipped ratio to preserve original probability gradients
if is_pa_lp.any():
pa_lp_loss = -beta_2 * (ratios * advantages)
loss = loss + pa_lp_loss[is_pa_lp].mean()
# Weighted gradients for NA&LP tokens (accelerate convergence)
if is_na_lp.any():
na_lp_loss = -beta_1 * (ratios * advantages)
loss = loss + na_lp_loss[is_na_lp].mean()
return loss / (is_in_clip.sum().float() + is_pa_lp.sum().float() + is_na_lp.sum().float()).clamp(min=1.0)
Integrate CE-GPPO into a standard RL training loop with value function updates.
def ce_gppo_train_step(
model: torch.nn.Module,
batch_log_probs_new: torch.Tensor,
batch_log_probs_old: torch.Tensor,
batch_advantages: torch.Tensor,
batch_values_new: torch.Tensor,
batch_returns: torch.Tensor,
optimizer: torch.optim.Optimizer,
epsilon: float = 0.2,
beta_1: float = 0.5,
beta_2: float = 1.0,
value_coeff: float = 0.5,
entropy_coeff: float = 0.01
) -> dict:
"""
Single CE-GPPO training step combining policy and value updates.
Args:
model: policy and value network
batch_log_probs_new: new policy log probs [num_samples, seq_len]
batch_log_probs_old: old policy log probs [num_samples, seq_len]
batch_advantages: advantage estimates [num_samples, seq_len]
batch_values_new: value predictions [num_samples]
batch_returns: cumulative returns [num_samples]
optimizer: torch optimizer
epsilon: PPO clipping parameter
beta_1: weight for NA&LP gradients
beta_2: weight for PA&LP gradients
value_coeff: coefficient for value loss
entropy_coeff: coefficient for entropy regularization
Returns:
metrics: dict with loss components
"""
optimizer.zero_grad()
# Compute policy loss
policy_loss = ce_gppo_loss(
batch_log_probs_new,
batch_log_probs_old,
batch_advantages,
epsilon,
beta_1,
beta_2
)
# Compute value loss (standard MSE)
value_loss = F.mse_loss(batch_values_new, batch_returns)
# Optional entropy regularization (usually disabled with CE-GPPO)
# Entropy from the new policy
log_probs_dist = torch.exp(batch_log_probs_new)
entropy = -(log_probs_dist * batch_log_probs_new).sum(dim=-1).mean()
# Combined loss
total_loss = policy_loss + value_coeff * value_loss - entropy_coeff * entropy
# Backward pass and optimization
total_loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
return {
"policy_loss": policy_loss.item(),
"value_loss": value_loss.item(),
"entropy": entropy.item(),
"total_loss": total_loss.item()
}
Adapt CE-GPPO for transformer-based LLMs, handling token sequences and attention masks.
def compute_ce_gppo_gradients_for_lm(
model: torch.nn.Module,
input_ids: torch.Tensor, # [batch, seq_len]
attention_mask: torch.Tensor, # [batch, seq_len]
advantage_scores: torch.Tensor, # [batch, seq_len-1]
old_logits: torch.Tensor, # [batch, seq_len-1, vocab_size]
epsilon: float = 0.2,
beta_1: float = 0.5,
beta_2: float = 1.0
) -> torch.Tensor:
"""
Compute CE-GPPO loss for language model fine-tuning.
Args:
model: transformer language model
input_ids: token indices
attention_mask: padding mask
advantage_scores: token-level advantages from reward model
old_logits: cached logits from old policy
epsilon, beta_1, beta_2: CE-GPPO hyperparameters
Returns:
loss: scalar loss for backpropagation (negate for gradient ascent)
"""
# Forward pass to get new logits
outputs = model(input_ids, attention_mask=attention_mask, output_hidden_states=False)
new_logits = outputs.logits[:, :-1, :] # [batch, seq_len-1, vocab_size]
# Get target token indices (next tokens in sequence)
target_tokens = input_ids[:, 1:] # [batch, seq_len-1]
# Compute log probabilities
log_probs_new = F.log_softmax(new_logits, dim=-1)
log_probs_old = F.log_softmax(old_logits, dim=-1)
# Extract log probabilities for target tokens
log_probs_new = torch.gather(log_probs_new, -1, target_tokens.unsqueeze(-1)).squeeze(-1)
log_probs_old = torch.gather(log_probs_old, -1, target_tokens.unsqueeze(-1)).squeeze(-1)
# Mask out padding tokens
advantage_scores = advantage_scores * attention_mask[:, 1:]
# Compute CE-GPPO loss
loss = ce_gppo_loss(
log_probs_new,
log_probs_old,
advantage_scores,
epsilon=epsilon,
beta_1=beta_1,
beta_2=beta_2
)
return loss
| Parameter | Default | Range | Impact | |-----------|---------|-------|--------| | beta_1 | 0.5 | [0.0, 1.0] | Weight for NA&LP (negative advantage, low probability) gradients. Higher values accelerate convergence but may increase entropy decay rate. | | beta_2 | 1.0 | [0.0, 1.0] | Weight for PA&LP (positive advantage, low probability) gradients. Higher values encourage exploration by preserving encouraging-but-clipped gradients. | | epsilon | 0.2 | [0.15, 0.3] | PPO clipping range. Larger epsilon increases clipping region, affecting how many gradients are reweighted. | | entropy_coeff | 0.0 | [0.0, 0.01] | Coefficient for entropy regularization. CE-GPPO typically sets this to 0 since entropy is controlled via betas. | | value_coeff | 0.5 | [0.1, 1.0] | Coefficient for value function loss in combined objective. |
CE-GPPO is recommended for:
Example scenarios: training models on AIME, mathematical problem-solving, reasoning-heavy instruction following.
Do not use CE-GPPO when:
Research shows that CE-GPPO achieves the best balance when maintaining "relatively high and stable entropy" throughout training, with greater weight on PA&LP gradients (higher beta_2) than NA&LP gradients (lower beta_1).
Paper: CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning
Authors: Zhenpeng Su, Leiyu Pan, Minxuan Lv, Yuntao Li, Wenping Hu, Fuzheng Zhang, Kun Gai, Guorui Zhou
arXiv: https://arxiv.org/abs/2509.20712
Citation:
@article{su2025cegppo,
title={CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning},
author={Su, Zhenpeng and Pan, Leiyu and Lv, Minxuan and Li, Yuntao and Hu, Wenping and Zhang, Fuzheng and Gai, Kun and Zhou, Guorui},
journal={arXiv preprint arXiv:2509.20712},
year={2025}
}
On mathematical reasoning benchmarks (AIME24, AIME25, HMMT25, MATH500, AMC23):
Compared to:
Implementation tested on DeepSeek-R1-Distill models with mathematical reasoning datasets (30k samples). Gradients remain stable with KL divergence and gradient norms within expected ranges relative to standard PPO.
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.