skills/skillxiv-v0.0.2-claude-opus-4.6/atr-grpo-policy-divergence/SKILL.md
Replace ratio-based clipping in GRPO with KL-divergence constraints using the KL3 estimator, improving exploration and training stability with asymmetric clipping that requires no additional computation.
npx skillsauth add ADu2021/skillXiv atr-grpo-policy-divergenceInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Current reinforcement learning methods for LLMs rely on ratio-based clipping (PPO, GRPO) to ensure training stability, but this represents a specific design choice with potential limitations. The mechanism that controls how much the policy can diverge from the reference is critical for balancing improvement and stability. Recent evidence shows that exploration and final performance are highly sensitive to how policy divergence constraints are defined.
This framework proposes [KL-divergence-based constraints, KL3 estimator, asymmetric clipping] as a more principled alternative to ratio-based clipping. KL3 divergence (r - 1 - log r) provides effective trust-region approximation without needing full action-space expectations, enabling computationally efficient constraints that promote stronger exploration.
Analyze different divergence measures and their computational properties.
# Policy divergence measures
class PolicyDivergenceMeasures:
"""
Compare different policy divergence constraint formulations.
"""
@staticmethod
def ratio_based_divergence(log_prob_ratio):
"""
Standard PPO/GRPO: ratio r(θ) = exp(log_prob_ratio)
Constraint: r in [1-ε, 1+ε] (symmetric clipping)
"""
ratio = torch.exp(log_prob_ratio)
return ratio
@staticmethod
def kl_divergence(log_prob_ratio):
"""
KL divergence: D_KL(p_old || p_new) ≈ -log_prob_ratio
(for small changes)
"""
return -log_prob_ratio
@staticmethod
def kl3_divergence(log_prob_ratio):
"""
KL3 divergence: r(θ) - 1 - log r(θ)
where r(θ) = exp(log_prob_ratio)
Properties:
- Symmetric and bounded below by 0
- Equals KL in first-order approximation
- Computationally efficient (no expectations needed)
"""
ratio = torch.exp(log_prob_ratio)
kl3 = ratio - 1 - log_prob_ratio
return kl3
@staticmethod
def js_divergence(log_prob_ratio):
"""
Jensen-Shannon divergence (symmetric KL variant)
"""
kl_forward = -log_prob_ratio
kl_reverse = log_prob_ratio
js = 0.5 * (kl_forward + kl_reverse)
return js
@staticmethod
def compare_divergences(log_prob_ratio_range=None):
"""
Visualize divergence measures across different probability changes.
"""
if log_prob_ratio_range is None:
log_prob_ratio_range = torch.linspace(-1, 1, 100)
ratio_div = PolicyDivergenceMeasures.ratio_based_divergence(
log_prob_ratio_range
)
kl_div = PolicyDivergenceMeasures.kl_divergence(log_prob_ratio_range)
kl3_div = PolicyDivergenceMeasures.kl3_divergence(log_prob_ratio_range)
return {
'log_prob_ratio': log_prob_ratio_range,
'ratio': ratio_div,
'kl': kl_div,
'kl3': kl3_div
}
Formulate the KL3-based constraint for policy updates.
# KL3-based constraint
def compute_kl3_constraint(log_prob_ratio, max_kl3=0.1):
"""
Compute KL3 divergence and check against constraint.
Args:
log_prob_ratio: log(p_new) - log(p_old)
max_kl3: Maximum allowed KL3 divergence
Returns:
kl3_values: Computed KL3 for each token
mask: Boolean mask indicating which tokens violate constraint
"""
ratio = torch.exp(log_prob_ratio)
kl3 = ratio - 1 - log_prob_ratio
# Check constraint
violates_constraint = kl3 > max_kl3
return kl3, violates_constraint
def asymmetric_clipping_from_kl3(log_prob_ratio, max_kl3=0.1):
"""
Derive asymmetric clipping ranges from KL3 constraint.
KL3(δ) = exp(δ) - 1 - δ ≤ max_kl3
This gives asymmetric bounds:
- Allow larger increases: exp(δ) can be larger
- Constrain decreases more: prevent large probability drops
"""
# Solve for bounds numerically
# (In practice, use precomputed lookup table)
def solve_kl3_bound(kl_max, direction='increase'):
"""
Numerically solve for δ such that exp(δ) - 1 - δ = kl_max
"""
if direction == 'increase':
# Find upper bound (positive δ)
delta_candidate = torch.linspace(0, 2.0, 100)
else:
# Find lower bound (negative δ)
delta_candidate = torch.linspace(-2.0, 0, 100)
kl3_vals = torch.exp(delta_candidate) - 1 - delta_candidate
closest_idx = torch.argmin(torch.abs(kl3_vals - kl_max))
return delta_candidate[closest_idx].item()
# Compute bounds
upper_bound = solve_kl3_bound(max_kl3, direction='increase')
lower_bound = solve_kl3_bound(max_kl3, direction='decrease')
# Asymmetric ratio bounds
ratio_upper = torch.exp(torch.tensor(upper_bound))
ratio_lower = torch.exp(torch.tensor(lower_bound))
return ratio_lower, ratio_upper
Integrate KL3 constraints into the GRPO training step.
# ATR-GRPO: GRPO with KL3-based constraints
class ATRGRPO:
def __init__(
self,
model,
optimizer,
max_kl3=0.1,
group_size=8
):
self.model = model
self.optimizer = optimizer
self.max_kl3 = max_kl3
self.group_size = group_size
def compute_loss(
self,
log_probs,
rewards,
log_probs_old=None
):
"""
Compute ATR-GRPO loss with KL3 constraints.
Args:
log_probs: Log probabilities from current policy
rewards: Task rewards
log_probs_old: Log probabilities from reference policy
"""
if log_probs_old is None:
log_probs_old = log_probs.detach()
# Compute log probability ratio
log_prob_ratio = log_probs - log_probs_old
# Compute KL3 divergence
kl3, violates = compute_kl3_constraint(
log_prob_ratio, max_kl3=self.max_kl3
)
# Get asymmetric clipping bounds from KL3
ratio_lower, ratio_upper = asymmetric_clipping_from_kl3(
log_prob_ratio, max_kl3=self.max_kl3
)
# Compute ratio
ratio = torch.exp(log_prob_ratio)
# Apply asymmetric clipping
clipped_ratio = torch.clamp(
ratio, ratio_lower.item(), ratio_upper.item()
)
# Standard GRPO advantage computation
batch_size = len(rewards)
num_groups = batch_size // self.group_size
advantages = []
for group_idx in range(num_groups):
group_start = group_idx * self.group_size
group_end = (group_idx + 1) * self.group_size
group_rewards = rewards[group_start:group_end]
mean_reward = group_rewards.mean()
for i in range(self.group_size):
adv = group_rewards[i] - mean_reward
advantages.append(adv)
advantages = torch.stack(advantages)
# ATR-GRPO: use asymmetric clipping
loss = -torch.min(
log_prob_ratio * advantages,
torch.log(clipped_ratio) * advantages
).mean()
return loss, kl3
Implement side-by-side comparison to validate improvements.
# Comparison utility
def compare_grpo_vs_atr_grpo(
model, test_prompts, verifier,
num_samples=8, max_kl3=0.1
):
"""
Compare standard GRPO and ATR-GRPO on same test set.
"""
results = {'standard_grpo': [], 'atr_grpo': []}
for prompt in test_prompts:
# Generate samples
samples = []
log_probs_list = []
for _ in range(num_samples):
sample, log_prob = model.generate_with_logprobs(
prompt, max_tokens=200
)
samples.append(sample)
log_probs_list.append(log_prob)
log_probs = torch.stack(log_probs_list)
# Compute rewards
rewards = torch.tensor([
verifier(s) for s in samples
], dtype=torch.float32)
# Standard GRPO loss
standard_grpo = GRPO()
loss_standard = standard_grpo.compute_loss(log_probs, rewards)
# ATR-GRPO loss
atr_grpo = ATRGRPO(model, None, max_kl3=max_kl3)
loss_atr, kl3 = atr_grpo.compute_loss(log_probs, rewards)
results['standard_grpo'].append(loss_standard.item())
results['atr_grpo'].append(loss_atr.item())
return results
Full training loop using ATR-GRPO instead of standard GRPO.
# Training with ATR-GRPO
def train_with_atr_grpo(
model, train_loader, verifier, optimizer,
num_epochs=3, max_kl3=0.1, group_size=8, device='cuda'
):
"""
Train LLM using ATR-GRPO optimization.
"""
atr_grpo = ATRGRPO(model, optimizer, max_kl3=max_kl3, group_size=group_size)
for epoch in range(num_epochs):
total_loss = 0.0
total_kl3 = 0.0
num_batches = 0
for batch_idx, batch in enumerate(train_loader):
prompts = batch['prompts']
# Generate responses
responses = []
log_probs_list = []
for prompt in prompts:
response, log_prob = model.generate_with_logprobs(
prompt, max_tokens=200
)
responses.append(response)
log_probs_list.append(log_prob)
log_probs = torch.stack(log_probs_list).to(device)
# Compute rewards
rewards = torch.tensor([
verifier(r) for r in responses
], dtype=torch.float32, device=device)
# Compute loss with ATR-GRPO
loss, kl3 = atr_grpo.compute_loss(log_probs, rewards)
# Backward pass
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
total_loss += loss.item()
total_kl3 += kl3.mean().item()
num_batches += 1
if (batch_idx + 1) % 10 == 0:
print(f" Batch {batch_idx + 1}: loss={loss.item():.4f}, "
f"avg_kl3={kl3.mean().item():.6f}")
avg_loss = total_loss / num_batches
avg_kl3 = total_kl3 / num_batches
print(f"Epoch {epoch + 1}: Loss={avg_loss:.4f}, KL3={avg_kl3:.6f}")
return model
When to use: Reasoning tasks (math, code) where exploration and diversity matter. Apply to standard GRPO setups as drop-in replacement.
Hyperparameters:
Key advantages:
Common pitfalls:
Scaling: Negligible overhead (one extra computation per token). Scales to any model size.
Paper: https://arxiv.org/abs/2602.05494 Code: Available at author's repository Related work: GRPO, policy divergence, trust region optimization Benchmarks: AIME, MATH, reasoning tasks
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.