skills/skillxiv-v0.0.2-claude-opus-4.6/f-grpo-divergence-alignment/SKILL.md
Unify LLM alignment methods through f-divergence theory. f-GRPO extends GRPO to handle any divergence measure (KL, Jensen-Shannon, Hellinger), enabling tailored alignment objectives. f-HAL combines on-policy and off-policy preference learning to prevent reward hacking while maintaining safety alignment.
npx skillsauth add ADu2021/skillXiv f-grpo-divergence-alignmentInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Popular LLM alignment methods optimize divergence between aligned and unaligned distributions, yet lack a unified framework. f-GRPO grounds GRPO in f-divergence theory, enabling selection of divergence measures matching your alignment objectives. For verifiable rewards (correct/incorrect), f-GRPO concentrates probability on high-reward responses. For preference-based alignment, f-HAL balances on-policy exploration with off-policy preference learning to prevent reward hacking.
f-GRPO generalizes GRPO by parameterizing the divergence: min_π f(π || π_ref) where f is any f-divergence (KL, Jensen-Shannon, Hellinger, etc.). Different f-divergences yield different concentration behaviors:
For verifiable rewards: f-GRPO estimates divergence between above-average and below-average reward distributions. For preference-based: f-HAL uses on-policy RL (exploration) combined with off-policy preference signals (exploitation).
Implement f-GRPO for verifiable rewards:
import torch
import torch.nn.functional as F
def compute_f_divergence(p_logits, q_logits, divergence_type='kl'):
"""
Compute f-divergence between two distributions.
Args:
p_logits: Logits from policy distribution [batch, vocab_size]
q_logits: Logits from reference distribution [batch, vocab_size]
divergence_type: 'kl', 'js', 'hellinger', etc.
Returns:
divergence: Scalar divergence value
"""
p = F.softmax(p_logits, dim=-1)
q = F.softmax(q_logits, dim=-1)
if divergence_type == 'kl':
# KL(p || q) = sum p * log(p/q)
return (p * (torch.log(p) - torch.log(q))).sum(dim=-1).mean()
elif divergence_type == 'js':
# Jensen-Shannon: symmetric divergence
m = 0.5 * (p + q)
return 0.5 * (p * (torch.log(p) - torch.log(m))).sum(dim=-1).mean() + \
0.5 * (q * (torch.log(q) - torch.log(m))).sum(dim=-1).mean()
elif divergence_type == 'hellinger':
# Hellinger distance (squared)
return (torch.sqrt(p * q + 1e-8)).sum(dim=-1).mean()
else:
raise ValueError(f"Unknown divergence: {divergence_type}")
def f_grpo_advantage(rewards, divergence_strength=1.0):
"""Compute advantages for f-GRPO with reward-based concentration."""
# Separate above-average and below-average rewards
mean_reward = rewards.mean()
above_avg = (rewards > mean_reward).float()
# Advantage: how much better than average
advantage = rewards - mean_reward
return advantage * divergence_strength
def f_grpo_loss(policy_logits, reference_logits, rewards, divergence_type='kl',
divergence_strength=1.0, entropy_coef=0.01):
"""Compute f-GRPO loss combining divergence and reward optimization."""
# Compute divergence penalty
divergence = compute_f_divergence(policy_logits, reference_logits, divergence_type)
# Compute advantages for reward optimization
advantages = f_grpo_advantage(rewards, divergence_strength)
# Policy gradient
log_probs = F.log_softmax(policy_logits, dim=-1)
policy_loss = -(log_probs.detach() * advantages.unsqueeze(-1)).mean()
# Entropy regularization (prevent mode collapse)
entropy = -(log_probs * F.softmax(policy_logits, dim=-1)).sum(dim=-1).mean()
# Combined loss
loss = policy_loss + divergence + entropy_coef * entropy
return loss
def f_grpo_training_step(policy, reference_model, batch, optimizer, divergence_type='js'):
"""Single f-GRPO training step."""
inputs, rewards = batch
# Get logits
policy_logits = policy(inputs).logits
with torch.no_grad():
ref_logits = reference_model(inputs).logits
# Compute loss
loss = f_grpo_loss(policy_logits, ref_logits, rewards, divergence_type=divergence_type)
# Update
optimizer.zero_grad()
loss.backward()
optimizer.step()
return loss.item()
Implement f-HAL for preference-based alignment:
def f_hal_loss(policy_logits, reference_logits, preference_pairs, on_policy_rewards,
divergence_type='js', on_policy_weight=0.5, off_policy_weight=0.5):
"""
f-HAL: Hybrid on-policy + off-policy preference alignment.
Args:
policy_logits: [batch, vocab_size]
reference_logits: [batch, vocab_size]
preference_pairs: [(preferred_seq, dispreferred_seq), ...] from offline data
on_policy_rewards: Rewards from on-policy rollouts
on_policy_weight: Balance between on-policy and off-policy
off_policy_weight: Weight for preference learning
"""
# On-policy component: standard f-GRPO with on-policy rewards
on_policy_loss = f_grpo_loss(policy_logits, reference_logits, on_policy_rewards,
divergence_type=divergence_type)
# Off-policy component: preference learning from offline pairs
off_policy_loss = 0.0
for preferred, dispreferred in preference_pairs:
# Bradley-Terry preference model: log(π(preferred)) - log(π(dispreferred))
pref_logprob = F.log_softmax(policy_logits, dim=-1)[preferred].sum()
dispref_logprob = F.log_softmax(policy_logits, dim=-1)[dispreferred].sum()
# Loss: maximize log odds of preferred
off_policy_loss += -(pref_logprob - dispref_logprob)
off_policy_loss = off_policy_loss / len(preference_pairs)
# Combine with weight balance
total_loss = on_policy_weight * on_policy_loss + off_policy_weight * off_policy_loss
return total_loss
| Parameter | Recommendation | Notes | |-----------|-----------------|-------| | Divergence type | Start with 'js' | Jensen-Shannon balances exploration; tune if needed. | | Divergence strength | 0.5-2.0 | Higher values strengthen alignment pressure. | | On-policy weight | 0.5-0.7 | Balance exploration (on-policy) vs. preference (off-policy). | | Off-policy weight | 0.3-0.5 | Too high causes reward hacking; too low loses preference signal. | | Entropy coefficient | 0.01-0.05 | Prevents mode collapse; adjust if needed. |
When to Use
When NOT to Use
Common Pitfalls
See https://arxiv.org/abs/2602.05946 for theoretical analysis of f-divergence, convergence proofs, and empirical validation on math reasoning and safety alignment benchmarks.
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.