Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

ADu2021/dppo-divergence-policy

Name: dppo-divergence-policy
Author: ADu2021

skills/skillxiv-v0.0.2-claude-opus-4.6/dppo-divergence-policy/SKILL.md

npx skillsauth add ADu2021/skillXiv dppo-divergence-policy

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

DPPO: Direct Divergence Constraints for Stable LLM RL

PPO's probability ratio clipping creates problematic asymmetries in LLM training: low-probability tokens trigger aggressive clipping despite negligible distributional impact, while high-probability tokens shift substantially without penalty. DPPO replaces heuristic ratio clipping with direct divergence constraints, measuring actual policy divergence rather than noisy single-sample estimates. Lightweight approximations enable vocabulary-scale divergence computation.

Core Concept

The key insight is that PPO's mechanism is fundamentally misaligned with LLM training objectives. PPO uses token-level probability ratios as proxies for policy divergence, but these are noisy single-sample estimates that don't reflect true distributional differences. DPPO directly measures divergence (Total Variation or KL) between old and new policies, constraining whether the entire distribution has shifted too far rather than penalizing individual tokens.

Architecture Overview

Direct Divergence Measurement: Compute actual policy divergence rather than probability ratios
Lightweight Approximations: Binary (Bernoulli) and Top-K approximations for efficient divergence estimation
Vocabulary-Scale Feasibility: Avoid expensive full-vocabulary KL computation via smart approximations
Asymmetry-Free Updates: Uniform penalty structure regardless of token probability
Joint Optimization: Train both policy and reward model with aligned divergence constraints

Implementation

Step 1: Understand PPO's Ratio-Based Clipping Problem

Analyze why probability ratios are problematic for policy divergence control.

def analyze_ppo_asymmetry(old_logits, new_logits, sampled_tokens):
    """
    Demonstrate PPO's asymmetry: ratio clipping poorly controls actual divergence.
    """
    batch_size, vocab_size = old_logits.shape
    
    # Compute probability ratios for sampled tokens
    old_probs = torch.softmax(old_logits, dim=-1)
    new_probs = torch.softmax(new_logits, dim=-1)
    
    # Get sampled token probabilities
    sampled_old_probs = old_probs[torch.arange(batch_size), sampled_tokens]
    sampled_new_probs = new_probs[torch.arange(batch_size), sampled_tokens]
    
    # Probability ratios
    prob_ratios = sampled_new_probs / (sampled_old_probs + 1e-8)
    
    # Problem: ratio depends on old probability magnitude
    print("Low probability tokens: high ratios but small distributional impact")
    print("High probability tokens: small ratios but large distributional impact")
    
    # Actual KL divergence
    actual_kl = torch.nn.functional.kl_div(
        torch.log_softmax(new_logits, dim=-1),
        torch.softmax(old_logits, dim=-1),
        reduction='batchmean'
    )
    
    return prob_ratios, actual_kl

Step 2: Implement Direct Total Variation Divergence

Measure policy divergence via Total Variation rather than probability ratios.

def compute_total_variation_divergence(old_logits, new_logits, sampled_tokens):
    """
    Measure TV divergence: 0.5 * sum(|p_old - p_new|)
    Lower cost than KL, symmetric.
    """
    batch_size, vocab_size = old_logits.shape
    
    old_probs = torch.softmax(old_logits, dim=-1)
    new_probs = torch.softmax(new_logits, dim=-1)
    
    # TV divergence
    tv_divergence = 0.5 * torch.abs(old_probs - new_probs).sum(dim=-1)
    
    return tv_divergence

def compute_kl_divergence(old_logits, new_logits):
    """
    Full KL divergence: expensive but precise.
    """
    old_probs = torch.softmax(old_logits, dim=-1)
    new_log_probs = torch.log_softmax(new_logits, dim=-1)
    
    kl = (old_probs * (torch.log(old_probs + 1e-8) - new_log_probs)).sum(dim=-1)
    
    return kl

Step 3: Implement Binary (Bernoulli) Approximation

Approximate divergence via binary classification: sampled token vs. all others.

def binary_divergence_approximation(old_logits, new_logits, sampled_tokens):
    """
    Binary approximation: treat sampled token as 1, all others as 0.
    Compute divergence between:
    - (p_old[sampled], 1 - p_old[sampled])
    - (p_new[sampled], 1 - p_new[sampled])
    """
    batch_size, vocab_size = old_logits.shape
    
    old_probs = torch.softmax(old_logits, dim=-1)
    new_probs = torch.softmax(new_logits, dim=-1)
    
    # Probability of sampled token
    p_old_sampled = old_probs[torch.arange(batch_size), sampled_tokens]
    p_new_sampled = new_probs[torch.arange(batch_size), sampled_tokens]
    
    # Binary divergence (simplified KL)
    # D(p_old || p_new) for Bernoulli
    eps = 1e-8
    divergence = (
        p_old_sampled * (torch.log(p_old_sampled + eps) - torch.log(p_new_sampled + eps)) +
        (1 - p_old_sampled) * (torch.log(1 - p_old_sampled + eps) - torch.log(1 - p_new_sampled + eps))
    )
    
    return divergence

Step 4: Implement Top-K Approximation

Approximate divergence by tracking K highest-probability tokens.

def topk_divergence_approximation(old_logits, new_logits, sampled_tokens, k=10):
    """
    Top-K approximation: track K highest tokens + sampled token.
    Aggregate remaining tokens as 'other' category.
    """
    batch_size, vocab_size = old_logits.shape
    
    old_probs = torch.softmax(old_logits, dim=-1)
    new_probs = torch.softmax(new_logits, dim=-1)
    
    # Get top-K probabilities
    top_k_old, top_k_indices_old = torch.topk(old_probs, k=k, dim=-1)
    top_k_new = torch.gather(new_probs, -1, top_k_indices_old)
    
    # Find position of sampled token in top-k
    sampled_in_topk = torch.isin(top_k_indices_old, sampled_tokens.unsqueeze(-1))
    
    # Aggregate probabilities of tokens not in top-k or sampled
    other_old = (1 - top_k_old.sum(dim=-1, keepdim=True))
    other_new = (1 - top_k_new.sum(dim=-1, keepdim=True))
    
    # Approximate KL divergence
    eps = 1e-8
    kl = (
        (top_k_old * (torch.log(top_k_old + eps) - torch.log(top_k_new + eps))).sum(dim=-1) +
        other_old.squeeze() * (torch.log(other_old.squeeze() + eps) - torch.log(other_new.squeeze() + eps))
    )
    
    return kl

Step 5: DPPO Training with Divergence Constraints

Implement GRPO-style training with direct divergence constraints.

def dppo_training_step(model, batch, reward_fn, divergence_limit=0.05, divergence_type='tv'):
    """
    DPPO training: constrain policy divergence directly.
    """
    input_ids = batch['input_ids']
    old_logits = batch['old_logits']
    
    # Generate rollouts with new policy
    new_logits = model(input_ids)
    sampled_tokens = batch['sampled_tokens']
    rewards = reward_fn(batch)
    
    # Compute policy divergence
    if divergence_type == 'tv':
        divergence = compute_total_variation_divergence(old_logits, new_logits, sampled_tokens)
    elif divergence_type == 'binary':
        divergence = binary_divergence_approximation(old_logits, new_logits, sampled_tokens)
    elif divergence_type == 'topk':
        divergence = topk_divergence_approximation(old_logits, new_logits, sampled_tokens, k=10)
    else:
        divergence = compute_kl_divergence(old_logits, new_logits)
    
    # GRPO-style advantage estimation
    groups = partition_into_groups(sampled_tokens, rewards)
    advantages = compute_group_advantages(groups)
    
    # Policy loss: optimize rewards subject to divergence constraint
    policy_loss = compute_policy_loss(new_logits, advantages)
    
    # Divergence penalty: only penalize if divergence exceeds limit
    divergence_penalty = torch.clamp(divergence - divergence_limit, min=0.0).mean()
    
    # Total loss
    total_loss = policy_loss + divergence_penalty
    
    # Backprop
    total_loss.backward()
    optimizer.step()
    optimizer.zero_grad()
    
    return total_loss.item(), divergence.mean().item()

Practical Guidance

| Aspect | Recommendation | Notes | |--------|----------------|-------| | Divergence Type | TV for speed, KL for precision | Binary/Top-K approximate while remaining efficient | | Divergence Limit | 0.01-0.1 | Lower = tighter constraint; task-dependent | | Approximation | Top-K for large vocab, Binary for simplicity | Top-K better preserves probability mass | | K Value | 10-50 tokens | Trade-off between accuracy and computation | | Comparison to PPO | Typically 10-20% stability improvement | DPPO avoids asymmetric penalty structure |

When to Use:

LLM RL training where PPO stability is problematic
Scenarios with sparse reward signals (divergence constraints help)
Models with large vocabularies (approximations become essential)

When Not to Use:

Well-tuned PPO with careful hyperparameter selection
Small vocabulary tasks (ratio-based methods may be sufficient)
Systems requiring extremely tight divergence control (need full KL)

Reference

Demonstrates superior stability and efficiency compared to GRPO and other baselines across multiple model sizes and tasks, with lightweight approximations enabling vocabulary-scale divergence computation.

ADu2021/dppo-divergence-policy

skills/skillxiv-v0.0.2-claude-opus-4.6/dppo-divergence-policy/SKILL.md

Replace PPO's heuristic ratio-based clipping with Divergence Proximal Policy Optimization (DPPO) that directly constrains policy divergence using either Total Variation or KL, enabling lightweight approximations (Binary, Top-K) for vocabulary-scale computations while improving stability and efficiency.

2 stars

tools

Updated Apr 17, 2026

$ install --global

skillsauth

npx skillsauth add ADu2021/skillXiv dppo-divergence-policy

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 17, 2026, 5:33 AM8.2s1 file scanned

SKILL.md

name:: dppo-divergence-policy
title:: Rethinking the Trust Region in LLM Reinforcement Learning: DPPO
version:: 0.0.2
engine:: skillxiv-v0.0.2-claude-opus-4.6
license:: MIT
url:: https://arxiv.org/abs/2602.04879
keywords:: [Reinforcement Learning, PPO, Policy Divergence, Trust Regions, LLM Training]
description:: Replace PPO's heuristic ratio-based clipping with Divergence Proximal Policy Optimization (DPPO) that directly constrains policy divergence using either Total Variation or KL, enabling lightweight approximations (Binary, Top-K) for vocabulary-scale computations while improving stability and efficiency.

DPPO: Direct Divergence Constraints for Stable LLM RL

Core Concept

Architecture Overview

Direct Divergence Measurement: Compute actual policy divergence rather than probability ratios
Lightweight Approximations: Binary (Bernoulli) and Top-K approximations for efficient divergence estimation
Vocabulary-Scale Feasibility: Avoid expensive full-vocabulary KL computation via smart approximations
Asymmetry-Free Updates: Uniform penalty structure regardless of token probability
Joint Optimization: Train both policy and reward model with aligned divergence constraints

Implementation

Step 1: Understand PPO's Ratio-Based Clipping Problem

Analyze why probability ratios are problematic for policy divergence control.

def analyze_ppo_asymmetry(old_logits, new_logits, sampled_tokens):
    """
    Demonstrate PPO's asymmetry: ratio clipping poorly controls actual divergence.
    """
    batch_size, vocab_size = old_logits.shape
    
    # Compute probability ratios for sampled tokens
    old_probs = torch.softmax(old_logits, dim=-1)
    new_probs = torch.softmax(new_logits, dim=-1)
    
    # Get sampled token probabilities
    sampled_old_probs = old_probs[torch.arange(batch_size), sampled_tokens]
    sampled_new_probs = new_probs[torch.arange(batch_size), sampled_tokens]
    
    # Probability ratios
    prob_ratios = sampled_new_probs / (sampled_old_probs + 1e-8)
    
    # Problem: ratio depends on old probability magnitude
    print("Low probability tokens: high ratios but small distributional impact")
    print("High probability tokens: small ratios but large distributional impact")
    
    # Actual KL divergence
    actual_kl = torch.nn.functional.kl_div(
        torch.log_softmax(new_logits, dim=-1),
        torch.softmax(old_logits, dim=-1),
        reduction='batchmean'
    )
    
    return prob_ratios, actual_kl

Step 2: Implement Direct Total Variation Divergence

Measure policy divergence via Total Variation rather than probability ratios.

def compute_total_variation_divergence(old_logits, new_logits, sampled_tokens):
    """
    Measure TV divergence: 0.5 * sum(|p_old - p_new|)
    Lower cost than KL, symmetric.
    """
    batch_size, vocab_size = old_logits.shape
    
    old_probs = torch.softmax(old_logits, dim=-1)
    new_probs = torch.softmax(new_logits, dim=-1)
    
    # TV divergence
    tv_divergence = 0.5 * torch.abs(old_probs - new_probs).sum(dim=-1)
    
    return tv_divergence

def compute_kl_divergence(old_logits, new_logits):
    """
    Full KL divergence: expensive but precise.
    """
    old_probs = torch.softmax(old_logits, dim=-1)
    new_log_probs = torch.log_softmax(new_logits, dim=-1)
    
    kl = (old_probs * (torch.log(old_probs + 1e-8) - new_log_probs)).sum(dim=-1)
    
    return kl

Step 3: Implement Binary (Bernoulli) Approximation

Approximate divergence via binary classification: sampled token vs. all others.

def binary_divergence_approximation(old_logits, new_logits, sampled_tokens):
    """
    Binary approximation: treat sampled token as 1, all others as 0.
    Compute divergence between:
    - (p_old[sampled], 1 - p_old[sampled])
    - (p_new[sampled], 1 - p_new[sampled])
    """
    batch_size, vocab_size = old_logits.shape
    
    old_probs = torch.softmax(old_logits, dim=-1)
    new_probs = torch.softmax(new_logits, dim=-1)
    
    # Probability of sampled token
    p_old_sampled = old_probs[torch.arange(batch_size), sampled_tokens]
    p_new_sampled = new_probs[torch.arange(batch_size), sampled_tokens]
    
    # Binary divergence (simplified KL)
    # D(p_old || p_new) for Bernoulli
    eps = 1e-8
    divergence = (
        p_old_sampled * (torch.log(p_old_sampled + eps) - torch.log(p_new_sampled + eps)) +
        (1 - p_old_sampled) * (torch.log(1 - p_old_sampled + eps) - torch.log(1 - p_new_sampled + eps))
    )
    
    return divergence

Step 4: Implement Top-K Approximation

Approximate divergence by tracking K highest-probability tokens.

def topk_divergence_approximation(old_logits, new_logits, sampled_tokens, k=10):
    """
    Top-K approximation: track K highest tokens + sampled token.
    Aggregate remaining tokens as 'other' category.
    """
    batch_size, vocab_size = old_logits.shape
    
    old_probs = torch.softmax(old_logits, dim=-1)
    new_probs = torch.softmax(new_logits, dim=-1)
    
    # Get top-K probabilities
    top_k_old, top_k_indices_old = torch.topk(old_probs, k=k, dim=-1)
    top_k_new = torch.gather(new_probs, -1, top_k_indices_old)
    
    # Find position of sampled token in top-k
    sampled_in_topk = torch.isin(top_k_indices_old, sampled_tokens.unsqueeze(-1))
    
    # Aggregate probabilities of tokens not in top-k or sampled
    other_old = (1 - top_k_old.sum(dim=-1, keepdim=True))
    other_new = (1 - top_k_new.sum(dim=-1, keepdim=True))
    
    # Approximate KL divergence
    eps = 1e-8
    kl = (
        (top_k_old * (torch.log(top_k_old + eps) - torch.log(top_k_new + eps))).sum(dim=-1) +
        other_old.squeeze() * (torch.log(other_old.squeeze() + eps) - torch.log(other_new.squeeze() + eps))
    )
    
    return kl

Step 5: DPPO Training with Divergence Constraints

Implement GRPO-style training with direct divergence constraints.

def dppo_training_step(model, batch, reward_fn, divergence_limit=0.05, divergence_type='tv'):
    """
    DPPO training: constrain policy divergence directly.
    """
    input_ids = batch['input_ids']
    old_logits = batch['old_logits']
    
    # Generate rollouts with new policy
    new_logits = model(input_ids)
    sampled_tokens = batch['sampled_tokens']
    rewards = reward_fn(batch)
    
    # Compute policy divergence
    if divergence_type == 'tv':
        divergence = compute_total_variation_divergence(old_logits, new_logits, sampled_tokens)
    elif divergence_type == 'binary':
        divergence = binary_divergence_approximation(old_logits, new_logits, sampled_tokens)
    elif divergence_type == 'topk':
        divergence = topk_divergence_approximation(old_logits, new_logits, sampled_tokens, k=10)
    else:
        divergence = compute_kl_divergence(old_logits, new_logits)
    
    # GRPO-style advantage estimation
    groups = partition_into_groups(sampled_tokens, rewards)
    advantages = compute_group_advantages(groups)
    
    # Policy loss: optimize rewards subject to divergence constraint
    policy_loss = compute_policy_loss(new_logits, advantages)
    
    # Divergence penalty: only penalize if divergence exceeds limit
    divergence_penalty = torch.clamp(divergence - divergence_limit, min=0.0).mean()
    
    # Total loss
    total_loss = policy_loss + divergence_penalty
    
    # Backprop
    total_loss.backward()
    optimizer.step()
    optimizer.zero_grad()
    
    return total_loss.item(), divergence.mean().item()

Practical Guidance

When to Use:

LLM RL training where PPO stability is problematic
Scenarios with sparse reward signals (divergence constraints help)
Models with large vocabularies (approximations become essential)

When Not to Use:

Well-tuned PPO with careful hyperparameter selection
Small vocabulary tasks (ratio-based methods may be sufficient)
Systems requiring extremely tight divergence control (need full KL)

Reference

Related Skills

ADu2021/flow-map-trajectory-tilting

testing

VerifiedTrustedCommunity

Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flow-map-trajectory-tilting

ADu2021/flexible-data-mixture-of-experts

testing

VerifiedTrustedCommunity

Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flexible-data-mixture-of-experts

ADu2021/flexibility-trap-diffusion-reasoning

data-ai

VerifiedTrustedCommunity

Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flexibility-trap-diffusion-reasoning

ADu2021/flex-continuous-agent-evolution

devops

VerifiedTrustedCommunity

Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flex-continuous-agent-evolution

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/ADu2021/skillXiv.git

# Copy into Claude Code skills folder (global)
cp -r skillXiv/skills/skillxiv-v0.0.2-claude-opus-4.6/dppo-divergence-policy ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

ADu2021/skillXiv

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT