skills/skillxiv-v0.0.2-claude-opus-4.6/clipping-free-policy-optimization/SKILL.md
Replace hard clipping in policy gradients with smooth quadratic penalties derived from Total Variation divergence constraints. Eliminates zero-gradient regions and training instability while maintaining stable policy evolution.
npx skillsauth add ADu2021/skillXiv clipping-free-policy-optimizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Policy gradient methods like PPO and GRPO use hard clipping to enforce trust regions, but this creates discontinuous gradients that cause zero-gradient regions, reward hacking, and training instability at scale. Models exploit superficial reward correlates like verbosity and degrade rapidly. CFPO replaces hard clipping with smooth convex penalties derived from Total Variation divergence constraints, providing everywhere-differentiable gradients that smoothly pull the policy toward the trust region without artificial boundaries.
The key insight is that TV divergence permits larger policy improvements than KL while remaining tractable, and smooth penalty-based enforcement is more stable than clipping.
CFPO replaces the clipped objective with a smooth penalty:
Traditional GRPO (with hard clip):
L_GRPO = r * min(ratio, clip(ratio, 1-ε, 1+ε))
CFPO (with smooth penalty):
L_CFPO = r * ratio - |ratio_advantage| / (2ε) * (ratio - 1)²
The penalty term is a quadratic function that smoothly constrains the ratio while providing everywhere-nonzero gradients. This avoids the cliff-like behavior of clipping while maintaining strong trust region enforcement.
The method involves computing policy ratios and applying the smooth penalty objective.
Compute policy ratios and advantages:
import torch
import torch.nn.functional as F
def compute_policy_ratios(logprobs_new, logprobs_ref, logprobs_old):
"""Compute probability ratios for policy gradient."""
# Log probability differences
log_ratio = logprobs_new - logprobs_old
ratio = torch.exp(log_ratio)
return ratio, log_ratio
def compute_advantages(rewards, values, gamma=0.99, gae_lambda=0.95):
"""Compute advantages using Generalized Advantage Estimation."""
# Standard GAE
deltas = rewards - values
advantages = []
gae = 0
for delta in reversed(deltas):
gae = delta + gamma * gae_lambda * gae
advantages.insert(0, gae)
return torch.tensor(advantages)
# Group-relative advantage (for reasoning tasks)
def compute_group_relative_advantages(rewards):
"""Normalize advantages within each group."""
group_mean = rewards.mean()
group_std = rewards.std() + 1e-8
advantages = (rewards - group_mean) / group_std
return advantages
Implement CFPO objective with smooth penalty:
def cfpo_loss(logprobs_new, logprobs_ref, logprobs_old, advantages,
epsilon=0.2, use_reward_scale=True):
"""CFPO objective with smooth TV-constrained penalty."""
# Compute ratio
ratio = torch.exp(logprobs_new - logprobs_old)
# Reward signal
reward_term = ratio * advantages
# Advantage magnitude (used for penalty scaling)
if use_reward_scale:
advantage_mag = torch.abs(advantages)
else:
advantage_mag = 1.0
# Smooth penalty: quadratic cost for deviating from 1.0
# penalty = |advantage| / (2ε) * (ratio - 1)²
penalty_term = (advantage_mag / (2 * epsilon)) * (ratio - 1) ** 2
# Combined objective
objective = reward_term - penalty_term
return -objective.mean() # Negative because we minimize
def cfpo_training_step(model, batch, ref_model, optimizer, epsilon=0.2):
"""Single CFPO training step."""
states, actions, rewards, old_logprobs = batch
# Forward pass with new policy
logprobs_new = model.compute_logprobs(states, actions)
# Reference policy (for KL divergence monitoring)
with torch.no_grad():
logprobs_ref = ref_model.compute_logprobs(states, actions)
# Advantages (group-relative for reasoning tasks)
advantages = compute_group_relative_advantages(rewards)
# CFPO loss
loss = cfpo_loss(logprobs_new, logprobs_ref, old_logprobs,
advantages, epsilon=epsilon)
# Backward pass
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
return loss.item()
Monitor training stability and policy divergence:
def compute_training_metrics(logprobs_new, logprobs_old, advantages):
"""Monitor CFPO training dynamics."""
# Policy divergence (approximate KL)
kl_div = (logprobs_old - logprobs_new).mean()
# Clipping ratio (for PPO comparison)
ratio = torch.exp(logprobs_new - logprobs_old)
clipped_mask = (ratio < 0.8) | (ratio > 1.2) # Threshold for clipping
clipping_ratio = clipped_mask.float().mean()
# Entropy estimate
entropy = -logprobs_new.mean()
# Advantage magnitude
advantage_mag = torch.abs(advantages).mean()
return {
"kl_divergence": kl_div.item(),
"clipping_ratio": clipping_ratio.item(),
"entropy": entropy.item(),
"advantage_magnitude": advantage_mag.item()
}
# Validation loop
for epoch in range(num_epochs):
for batch in train_loader:
loss = cfpo_training_step(model, batch, ref_model, optimizer)
# Monitor metrics
metrics = compute_training_metrics(
logprobs_new, logprobs_old, advantages
)
if epoch % 10 == 0:
print(f"Epoch {epoch}: KL={metrics['kl_divergence']:.4f}, "
f"Clipping={metrics['clipping_ratio']:.4f}")
| Aspect | Recommendation | Notes | |--------|-----------------|-------| | Epsilon (TV constraint) | 0.15-0.25 | Controls trust region width | | Learning Rate | 1e-5 to 1e-4 | Typical RL scales | | Num Epochs | 2-4 per batch | Standard PPO range | | KL Target | 0.01-0.05 | Monitor divergence from reference | | Clipping Ratio | Should be < 0.05 | Indicates smooth learning | | Entropy Decay | Monitor for collapse | Track during training |
When to use: For alignment tasks where GRPO exhibits instability. When you observe reward hacking or verbosity exploitation. For reasoning tasks where clipping causes training stalls.
When NOT to use: For tasks already stable with GRPO. When computational efficiency of clipping is critical.
Common pitfalls:
Clipping-Free Policy Optimization for Large Language Models https://arxiv.org/abs/2601.22801
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.