skills/skillxiv-v0.0.2-claude-opus-4.6/dcpo-dynamic-clipping-policy-optimization/SKILL.md
DCPO eliminates zero-gradient dead zones in policy optimization by adaptively adjusting token-level clipping bounds based on prior probabilities and smoothing advantage standardization across cumulative training steps, achieving 28% improvement in effective response utilization and 10x reduction in token clipping ratio on mathematical reasoning benchmarks.
npx skillsauth add ADu2021/skillXiv dcpo-dynamic-clipping-policy-optimizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful framework for enhancing reasoning capabilities in large language models. However, existing policy optimization methods like GRPO and DAPO suffer from critical inefficiencies that limit their effectiveness. The primary issues are: (1) zero-gradient regions caused by fixed, symmetric clipping bounds that prevent gradient flow for tokens with high or low probabilities, (2) inefficient advantage standardization across identical rewards that creates dead zones where gradients vanish, and (3) response-level inefficiency where batch-level averaging dilutes the relative advantage structure among responses to the same prompt.
These limitations result in significant computational waste and poor training efficiency. For instance, GRPO exhibits only 50-60% utilization of generated responses for gradient updates, meaning half the training data provides no gradient signal. The fixed clipping approach further reduces the effective gradient computation, necessitating more training iterations to achieve equivalent performance.
DCPO (Dynamic Clipping Policy Optimization) replaces fixed, symmetric clipping bounds with an adaptive mechanism that tailors clipping thresholds to individual token probabilities. Rather than using a uniform clipping interval like [1-ε, 1+ε], DCPO computes token-specific bounds that expand or contract based on the old policy's token probability. This approach permits greater exploration in low-probability regions where the model lacks confidence, while maintaining stability in high-probability regions.
Simultaneously, DCPO introduces Smooth Advantage Standardization (SAS) that blends step-specific and cumulative advantage statistics. Instead of standardizing advantages using only current-step statistics (which creates identical advantages for identical rewards), SAS incorporates historical advantage data from all previous responses to the same prompt. This mixture reduces variance while preserving gradient information, effectively eliminating the zero-gradient dead zones.
The DCPO framework operates on top of standard policy optimization pipelines and consists of three integrated components:
Dynamic-Adaptive Clipping Bounds (DAC): Computes token-specific lower and upper clipping bounds that depend on the old policy's token probability, enabling variance control tailored to each token's probability distribution.
Smooth Advantage Standardization (SAS): Implements a weighted mixture of step-specific and cumulative advantage standardization, with mixture weights that adapt as training progresses to prioritize current-step information during later training phases.
Response-Level Loss Computation: Computes loss independently for each response rather than averaging across the batch, preserving relative advantage magnitudes and preventing batch-level dilution effects.
Together these components form a cohesive optimization strategy that dramatically reduces zero-gradient occurrences, increases response utilization, and stabilizes training dynamics.
The core innovation lies in computing adaptive clipping bounds that respond to each token's prior probability. The dynamic bounds formula adjusts the clipping thresholds based on the old policy probability to prevent gradient starvation.
import torch
import torch.nn.functional as F
def compute_dynamic_clipping_bounds(old_logprobs, clip_coeff=0.2):
"""
Compute dynamic clipping bounds based on token-specific old probabilities.
This replaces fixed bounds [1-eps, 1+eps] with adaptive bounds that expand
in low-probability regions (where exploration is needed) and contract in
high-probability regions (where stability matters). The formula derives from
variance-bias tradeoff analysis.
Args:
old_logprobs: Log probabilities from old policy, shape [batch_size, seq_len]
clip_coeff: Coefficient controlling bound width (typically 0.2)
Returns:
lower_bound: Token-specific lower clipping bound
upper_bound: Token-specific upper clipping bound
"""
# Convert log probabilities to probabilities for bound computation
old_probs = torch.exp(old_logprobs).clamp(min=1e-7, max=1.0)
# Dynamic bound formula: sqrt(p_old) determines the bound width
# This ensures bounds are tighter for high-probability tokens and wider for low-prob
prob_sqrt = torch.sqrt(old_probs)
# Compute symmetric bounds around 1.0
bound_width = clip_coeff / (prob_sqrt + 1e-8)
lower_bound = torch.maximum(
(1.0 - bound_width) * torch.ones_like(old_logprobs),
torch.zeros_like(old_logprobs) + 1e-7
)
upper_bound = 1.0 + bound_width
return lower_bound, upper_bound
Apply the dynamic bounds to clip the probability ratio between new and old policies, replacing standard PPO clipping.
def clip_by_dynamic_bounds(probability_ratio, lower_bound, upper_bound):
"""
Clip probability ratios using token-specific dynamic bounds.
This function replaces the fixed clipping clamp(ratio, 1-eps, 1+eps) with
adaptive clipping. The dynamic bounds permit the policy to move more freely
in low-confidence regions while maintaining stability in confident regions.
Args:
probability_ratio: New policy prob / old policy prob, shape [batch_size, seq_len]
lower_bound: Token-specific lower bounds from dynamic computation
upper_bound: Token-specific upper bounds from dynamic computation
Returns:
clipped_ratio: Ratio clipped by dynamic bounds
"""
clipped_ratio = torch.clamp(probability_ratio, min=lower_bound, max=upper_bound)
return clipped_ratio
Smooth Advantage Standardization (SAS) blends current-step and cumulative advantage statistics to eliminate zero-gradient dead zones.
class SmoothAdvantageStandardizer:
"""
Implements Smooth Advantage Standardization that mixes step-specific and
cumulative statistics to reduce zero gradients.
Instead of standardizing advantages using only current-step statistics
(which makes identical rewards have identical advantages and zero gradients),
SAS incorporates historical advantage data from all previous responses to
the same prompt.
"""
def __init__(self, smoothing_coeff=0.9):
"""
Initialize the standardizer.
Args:
smoothing_coeff: Balance between current-step (low) and cumulative (high) stats
"""
self.smoothing_coeff = smoothing_coeff
# Store cumulative statistics per prompt
self.cumulative_advantage_stats = {}
def compute_advantages(self, rewards, values, gamma=0.99, gae_lambda=0.95):
"""
Compute generalized advantage estimates using GAE.
Args:
rewards: Reward signal, shape [batch_size, seq_len]
values: Value function estimates, shape [batch_size, seq_len]
gamma: Discount factor
gae_lambda: GAE lambda parameter
Returns:
advantages: Computed advantages, shape [batch_size, seq_len]
"""
batch_size, seq_len = rewards.shape
advantages = torch.zeros_like(rewards)
# Standard GAE computation
gae = 0
for t in reversed(range(seq_len)):
if t == seq_len - 1:
next_value = 0
else:
next_value = values[:, t+1]
delta = rewards[:, t] + gamma * next_value - values[:, t]
gae = delta + gamma * gae_lambda * gae
advantages[:, t] = gae
return advantages
def standardize_step_level(self, advantages):
"""
Compute step-specific standardization using current batch statistics.
Args:
advantages: Advantages to standardize
Returns:
standardized: Step-level standardized advantages
"""
mean = advantages.mean(dim=0, keepdim=True)
std = advantages.std(dim=0, keepdim=True) + 1e-8
standardized = (advantages - mean) / std
return standardized
def update_cumulative_stats(self, advantages, prompt_ids):
"""
Update cumulative statistics for each prompt across training iterations.
Args:
advantages: Current advantages
prompt_ids: Prompt identifiers for grouping
"""
for prompt_id in torch.unique(prompt_ids):
mask = prompt_ids == prompt_id
prompt_advantages = advantages[mask]
if prompt_id.item() not in self.cumulative_advantage_stats:
self.cumulative_advantage_stats[prompt_id.item()] = {
'mean': prompt_advantages.mean(),
'std': prompt_advantages.std() + 1e-8,
'count': 1
}
else:
stats = self.cumulative_advantage_stats[prompt_id.item()]
# Update running statistics
stats['mean'] = (stats['mean'] * stats['count'] +
prompt_advantages.mean()) / (stats['count'] + 1)
stats['std'] = torch.sqrt((stats['std']**2 * stats['count'] +
prompt_advantages.std()**2) /
(stats['count'] + 1)) + 1e-8
stats['count'] += 1
def standardize_smooth(self, advantages, prompt_ids, current_step, total_steps):
"""
Compute smooth standardization that blends step-specific and cumulative stats.
This is the core of SAS: instead of purely step-specific standardization which
creates zero gradients for identical rewards, we blend in cumulative statistics
that incorporate historical data.
Args:
advantages: Current advantages
prompt_ids: Prompt identifiers for cumulative lookup
current_step: Current training step (affects mixture weight)
total_steps: Total training steps (for scheduling)
Returns:
standardized: Smooth advantage standardization
"""
# Get step-level standardization
step_standardized = self.standardize_step_level(advantages)
# Compute cumulative standardization
cumulative_standardized = torch.zeros_like(advantages)
for i, prompt_id in enumerate(prompt_ids):
if prompt_id.item() in self.cumulative_advantage_stats:
stats = self.cumulative_advantage_stats[prompt_id.item()]
cumulative_standardized[i] = (advantages[i] - stats['mean']) / stats['std']
else:
cumulative_standardized[i] = step_standardized[i]
# Adaptive mixture weight: weight more toward current distribution as training progresses
# Early training: use more cumulative info; late training: use more current info
progress = current_step / (total_steps + 1)
mixture_weight = self.smoothing_coeff + (1 - self.smoothing_coeff) * progress
# Blend the two standardizations
smooth_standardized = (mixture_weight * step_standardized +
(1 - mixture_weight) * cumulative_standardized)
return smooth_standardized
Combine dynamic clipping and smooth advantage standardization into a complete policy optimization step.
def dcpo_policy_update(
policy_model,
value_model,
trajectories,
old_logprobs,
clip_coeff=0.2,
value_coeff=1.0,
entropy_coeff=0.01,
num_epochs=3,
batch_size=32
):
"""
Execute a DCPO policy optimization update.
This function orchestrates the complete DCPO update including dynamic clipping
bound computation, smooth advantage standardization, and response-level loss
aggregation. It replaces standard PPO update loops.
Args:
policy_model: Language model policy to optimize
value_model: Value function for advantage estimation
trajectories: Dict with keys: 'input_ids', 'attention_mask', 'rewards', 'prompt_ids'
old_logprobs: Log probabilities from old policy
clip_coeff: Dynamic clipping coefficient
value_coeff: Value loss coefficient
entropy_coeff: Entropy bonus coefficient
num_epochs: Number of update epochs
batch_size: Minibatch size
Returns:
logs: Dictionary of training metrics
"""
device = next(policy_model.parameters()).device
# Compute advantages
with torch.no_grad():
values = value_model(
input_ids=trajectories['input_ids'],
attention_mask=trajectories['attention_mask']
).squeeze(-1)
# Standard advantage computation
advantages = trajectories['rewards'] - values.detach()
# Initialize smooth standardizer
standardizer = SmoothAdvantageStandardizer()
# Prepare for updates
total_policy_loss = 0.0
total_value_loss = 0.0
num_updates = 0
for epoch in range(num_epochs):
# Compute dynamic clipping bounds
lower_bound, upper_bound = compute_dynamic_clipping_bounds(
old_logprobs, clip_coeff=clip_coeff
)
# Apply smooth advantage standardization
smooth_advantages = standardizer.standardize_smooth(
advantages,
prompt_ids=trajectories['prompt_ids'],
current_step=epoch,
total_steps=num_epochs
)
# Update cumulative statistics
standardizer.update_cumulative_stats(
advantages,
trajectories['prompt_ids']
)
# Create minibatches
num_samples = len(trajectories['input_ids'])
indices = torch.randperm(num_samples, device=device)
for start_idx in range(0, num_samples, batch_size):
batch_indices = indices[start_idx:start_idx + batch_size]
# Forward pass
logits = policy_model(
input_ids=trajectories['input_ids'][batch_indices],
attention_mask=trajectories['attention_mask'][batch_indices]
).logits
# Compute new log probabilities
log_probs = F.log_softmax(logits, dim=-1)
# Gather log probs for selected actions
new_logprobs = log_probs[
torch.arange(len(batch_indices)),
trajectories['actions'][batch_indices]
]
# Compute probability ratio
prob_ratio = torch.exp(new_logprobs - old_logprobs[batch_indices])
# Apply dynamic clipping
clipped_ratio = clip_by_dynamic_bounds(
prob_ratio,
lower_bound[batch_indices],
upper_bound[batch_indices]
)
# Compute policy loss (response-level aggregation)
policy_loss = -(
torch.min(prob_ratio * smooth_advantages[batch_indices],
clipped_ratio * smooth_advantages[batch_indices])
).mean()
# Value loss
new_values = value_model(
input_ids=trajectories['input_ids'][batch_indices],
attention_mask=trajectories['attention_mask'][batch_indices]
).squeeze(-1)
value_loss = F.mse_loss(new_values, trajectories['rewards'][batch_indices])
# Entropy bonus
entropy = -(log_probs * torch.exp(log_probs)).sum(dim=-1).mean()
# Combined loss
total_loss = policy_loss + value_coeff * value_loss - entropy_coeff * entropy
# Backprop and optimize
optimizer = torch.optim.Adam(policy_model.parameters(), lr=1e-5)
optimizer.zero_grad()
total_loss.backward()
torch.nn.utils.clip_grad_norm_(policy_model.parameters(), 1.0)
optimizer.step()
total_policy_loss += policy_loss.item()
total_value_loss += value_loss.item()
num_updates += 1
logs = {
'policy_loss': total_policy_loss / num_updates,
'value_loss': total_value_loss / num_updates,
'nonzero_advantage_ratio': (smooth_advantages.abs() > 1e-6).float().mean().item(),
}
return logs
| Parameter | Default | Range | Notes | |-----------|---------|-------|-------| | clip_coeff | 0.2 | 0.1-0.5 | Controls dynamic bound width; lower = tighter bounds | | smoothing_coeff | 0.9 | 0.7-0.99 | Balance toward cumulative stats; higher = more cumulative | | value_coeff | 1.0 | 0.5-2.0 | Weight of value loss; controls value function training | | entropy_coeff | 0.01 | 0.0-0.1 | Entropy bonus magnitude; prevents premature convergence | | gae_lambda | 0.95 | 0.9-0.99 | GAE parameter; higher = lower bias, higher variance | | batch_size | 32 | 16-128 | Minibatch size; affects variance of gradient estimates |
DCPO excels in scenarios where existing policy optimization methods suffer from inefficiency:
Avoid DCPO when:
Zero-Gradient Dominance Persists: If nonzero_advantage_ratio remains below 0.5, increase smoothing_coeff or decrease initial clip_coeff. The mixture weight scheduler may be converging too quickly to current-step standardization.
Unstable Value Function: The value function is not constrained by dynamic clipping. Ensure separate value updates with lower learning rate than policy. Consider auxiliary loss for value function stability.
Memory Blowup with Large Batches: Cumulative statistics are stored per prompt. In high-diversity datasets, limit the number of unique prompts or use a rolling window of recent prompts instead of all historical statistics.
Divergence During Early Training: If policy diverges in first epoch, reduce clip_coeff from 0.2 to 0.1 or increase value_coeff to stabilize value function before aggressive policy updates.
Response Utilization Not Improving: Verify that standardization is actually creating gradient diversity. Log the distribution of smooth advantages—if they're still clustered around zero, the cumulative statistics may not be diverse enough. Mix in more diverse reward signals or increase prompt diversity.
Reference: https://arxiv.org/abs/2509.02333
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.