skills/skillxiv-v0.0.2-claude-opus-4.6/dppo-divergence-policy/SKILL.md
Replace PPO's heuristic ratio-based clipping with Divergence Proximal Policy Optimization (DPPO) that directly constrains policy divergence using either Total Variation or KL, enabling lightweight approximations (Binary, Top-K) for vocabulary-scale computations while improving stability and efficiency.
npx skillsauth add ADu2021/skillXiv dppo-divergence-policyInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
PPO's probability ratio clipping creates problematic asymmetries in LLM training: low-probability tokens trigger aggressive clipping despite negligible distributional impact, while high-probability tokens shift substantially without penalty. DPPO replaces heuristic ratio clipping with direct divergence constraints, measuring actual policy divergence rather than noisy single-sample estimates. Lightweight approximations enable vocabulary-scale divergence computation.
The key insight is that PPO's mechanism is fundamentally misaligned with LLM training objectives. PPO uses token-level probability ratios as proxies for policy divergence, but these are noisy single-sample estimates that don't reflect true distributional differences. DPPO directly measures divergence (Total Variation or KL) between old and new policies, constraining whether the entire distribution has shifted too far rather than penalizing individual tokens.
Analyze why probability ratios are problematic for policy divergence control.
def analyze_ppo_asymmetry(old_logits, new_logits, sampled_tokens):
"""
Demonstrate PPO's asymmetry: ratio clipping poorly controls actual divergence.
"""
batch_size, vocab_size = old_logits.shape
# Compute probability ratios for sampled tokens
old_probs = torch.softmax(old_logits, dim=-1)
new_probs = torch.softmax(new_logits, dim=-1)
# Get sampled token probabilities
sampled_old_probs = old_probs[torch.arange(batch_size), sampled_tokens]
sampled_new_probs = new_probs[torch.arange(batch_size), sampled_tokens]
# Probability ratios
prob_ratios = sampled_new_probs / (sampled_old_probs + 1e-8)
# Problem: ratio depends on old probability magnitude
print("Low probability tokens: high ratios but small distributional impact")
print("High probability tokens: small ratios but large distributional impact")
# Actual KL divergence
actual_kl = torch.nn.functional.kl_div(
torch.log_softmax(new_logits, dim=-1),
torch.softmax(old_logits, dim=-1),
reduction='batchmean'
)
return prob_ratios, actual_kl
Measure policy divergence via Total Variation rather than probability ratios.
def compute_total_variation_divergence(old_logits, new_logits, sampled_tokens):
"""
Measure TV divergence: 0.5 * sum(|p_old - p_new|)
Lower cost than KL, symmetric.
"""
batch_size, vocab_size = old_logits.shape
old_probs = torch.softmax(old_logits, dim=-1)
new_probs = torch.softmax(new_logits, dim=-1)
# TV divergence
tv_divergence = 0.5 * torch.abs(old_probs - new_probs).sum(dim=-1)
return tv_divergence
def compute_kl_divergence(old_logits, new_logits):
"""
Full KL divergence: expensive but precise.
"""
old_probs = torch.softmax(old_logits, dim=-1)
new_log_probs = torch.log_softmax(new_logits, dim=-1)
kl = (old_probs * (torch.log(old_probs + 1e-8) - new_log_probs)).sum(dim=-1)
return kl
Approximate divergence via binary classification: sampled token vs. all others.
def binary_divergence_approximation(old_logits, new_logits, sampled_tokens):
"""
Binary approximation: treat sampled token as 1, all others as 0.
Compute divergence between:
- (p_old[sampled], 1 - p_old[sampled])
- (p_new[sampled], 1 - p_new[sampled])
"""
batch_size, vocab_size = old_logits.shape
old_probs = torch.softmax(old_logits, dim=-1)
new_probs = torch.softmax(new_logits, dim=-1)
# Probability of sampled token
p_old_sampled = old_probs[torch.arange(batch_size), sampled_tokens]
p_new_sampled = new_probs[torch.arange(batch_size), sampled_tokens]
# Binary divergence (simplified KL)
# D(p_old || p_new) for Bernoulli
eps = 1e-8
divergence = (
p_old_sampled * (torch.log(p_old_sampled + eps) - torch.log(p_new_sampled + eps)) +
(1 - p_old_sampled) * (torch.log(1 - p_old_sampled + eps) - torch.log(1 - p_new_sampled + eps))
)
return divergence
Approximate divergence by tracking K highest-probability tokens.
def topk_divergence_approximation(old_logits, new_logits, sampled_tokens, k=10):
"""
Top-K approximation: track K highest tokens + sampled token.
Aggregate remaining tokens as 'other' category.
"""
batch_size, vocab_size = old_logits.shape
old_probs = torch.softmax(old_logits, dim=-1)
new_probs = torch.softmax(new_logits, dim=-1)
# Get top-K probabilities
top_k_old, top_k_indices_old = torch.topk(old_probs, k=k, dim=-1)
top_k_new = torch.gather(new_probs, -1, top_k_indices_old)
# Find position of sampled token in top-k
sampled_in_topk = torch.isin(top_k_indices_old, sampled_tokens.unsqueeze(-1))
# Aggregate probabilities of tokens not in top-k or sampled
other_old = (1 - top_k_old.sum(dim=-1, keepdim=True))
other_new = (1 - top_k_new.sum(dim=-1, keepdim=True))
# Approximate KL divergence
eps = 1e-8
kl = (
(top_k_old * (torch.log(top_k_old + eps) - torch.log(top_k_new + eps))).sum(dim=-1) +
other_old.squeeze() * (torch.log(other_old.squeeze() + eps) - torch.log(other_new.squeeze() + eps))
)
return kl
Implement GRPO-style training with direct divergence constraints.
def dppo_training_step(model, batch, reward_fn, divergence_limit=0.05, divergence_type='tv'):
"""
DPPO training: constrain policy divergence directly.
"""
input_ids = batch['input_ids']
old_logits = batch['old_logits']
# Generate rollouts with new policy
new_logits = model(input_ids)
sampled_tokens = batch['sampled_tokens']
rewards = reward_fn(batch)
# Compute policy divergence
if divergence_type == 'tv':
divergence = compute_total_variation_divergence(old_logits, new_logits, sampled_tokens)
elif divergence_type == 'binary':
divergence = binary_divergence_approximation(old_logits, new_logits, sampled_tokens)
elif divergence_type == 'topk':
divergence = topk_divergence_approximation(old_logits, new_logits, sampled_tokens, k=10)
else:
divergence = compute_kl_divergence(old_logits, new_logits)
# GRPO-style advantage estimation
groups = partition_into_groups(sampled_tokens, rewards)
advantages = compute_group_advantages(groups)
# Policy loss: optimize rewards subject to divergence constraint
policy_loss = compute_policy_loss(new_logits, advantages)
# Divergence penalty: only penalize if divergence exceeds limit
divergence_penalty = torch.clamp(divergence - divergence_limit, min=0.0).mean()
# Total loss
total_loss = policy_loss + divergence_penalty
# Backprop
total_loss.backward()
optimizer.step()
optimizer.zero_grad()
return total_loss.item(), divergence.mean().item()
| Aspect | Recommendation | Notes | |--------|----------------|-------| | Divergence Type | TV for speed, KL for precision | Binary/Top-K approximate while remaining efficient | | Divergence Limit | 0.01-0.1 | Lower = tighter constraint; task-dependent | | Approximation | Top-K for large vocab, Binary for simplicity | Top-K better preserves probability mass | | K Value | 10-50 tokens | Trade-off between accuracy and computation | | Comparison to PPO | Typically 10-20% stability improvement | DPPO avoids asymmetric penalty structure |
When to Use:
When Not to Use:
Demonstrates superior stability and efficiency compared to GRPO and other baselines across multiple model sizes and tasks, with lightweight approximations enabling vocabulary-scale divergence computation.
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.