skills/skillxiv-v0.0.2-claude-opus-4.6/direct-group-preference-optimization-diffusion/SKILL.md
Optimize diffusion models for preference alignment by learning directly from group-level preferences without stochastic policies, enabling efficient and stable training.
npx skillsauth add ADu2021/skillXiv direct-group-preference-optimization-diffusionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Alignment of generative models with human preferences typically requires stochastic policies and rollout-based learning, which is inefficient for diffusion models. DGPO (Direct Group Preference Optimization) eliminates this requirement by learning preferences directly from groups of samples, leveraging the determinism of ODE-based samplers.
The key insight is that preference optimization doesn't require stochastic policies. By partitioning generated samples into preferred and non-preferred groups based on advantage scores, the model learns to increase likelihood for the preferred group while decreasing it for the non-preferred group. This enables efficient training without trajectory sampling overhead.
DGPO operates through three mechanisms:
Implement the group partitioning mechanism.
def partition_samples_by_advantage(samples, rewards, partition_ratio=0.5):
"""
Partition samples into positive and negative groups by advantage.
Args:
samples: List of generated samples
rewards: Float array of reward scores
partition_ratio: Fraction of samples for positive group
Returns:
positive_samples: High-advantage samples
negative_samples: Low-advantage samples
weights: Normalized weights for each group
"""
import numpy as np
rewards = np.array(rewards)
sorted_indices = np.argsort(rewards)[::-1] # Sort descending
# Split at partition point
split_idx = int(len(samples) * partition_ratio)
positive_indices = sorted_indices[:split_idx]
negative_indices = sorted_indices[split_idx:]
positive_samples = [samples[i] for i in positive_indices]
negative_samples = [samples[i] for i in negative_indices]
# Equal weighting for now; could be adjusted by advantage magnitude
positive_weights = np.ones(len(positive_samples)) / len(positive_samples)
negative_weights = np.ones(len(negative_samples)) / len(negative_samples)
return positive_samples, negative_samples, \
(positive_weights, negative_weights)
Implement the direct preference loss with balanced weighting.
def dgpo_preference_loss(model, positive_samples, negative_samples,
positive_weights, negative_weights, timesteps=None):
"""
Compute DGPO preference loss without partition functions.
Args:
model: Diffusion model
positive_samples: Preferred samples
negative_samples: Non-preferred samples
positive_weights: Weights for positive samples
negative_weights: Weights for negative samples
timesteps: Timesteps to evaluate (for curriculum)
Returns:
loss: Scalar loss to minimize
"""
import torch
import torch.nn.functional as F
if timesteps is None:
timesteps = list(range(50, 1000, 50)) # Later timesteps by default
total_loss = 0.0
for t in timesteps:
# Score positive samples
pos_scores = []
for sample, weight in zip(positive_samples, positive_weights):
score = model.score(sample, timestep=t)
weighted_score = score * weight
pos_scores.append(weighted_score)
# Score negative samples
neg_scores = []
for sample, weight in zip(negative_samples, negative_weights):
score = model.score(sample, timestep=t)
weighted_score = score * weight
neg_scores.append(weighted_score)
# Preference loss: maximize log(pos) - log(neg)
pos_score_mean = sum(pos_scores) / len(pos_scores) if pos_scores else 0
neg_score_mean = sum(neg_scores) / len(neg_scores) if neg_scores else 0
# Loss: minimize negative log odds ratio
loss = -torch.log(torch.sigmoid(pos_score_mean - neg_score_mean) + 1e-8)
total_loss += loss
return total_loss / len(timesteps)
Implement timestep clipping strategy.
def get_clipped_timestep_schedule(model, num_inference_steps=50,
clip_early_steps=5):
"""
Create timestep schedule that avoids early steps during training.
Args:
model: Diffusion model
num_inference_steps: Total inference steps
clip_early_steps: Number of early steps to skip
Returns:
timesteps: Array of timestep indices for training
"""
import numpy as np
# Full timestep schedule (e.g., 0-999 for 1000-step schedule)
full_schedule = np.linspace(0, model.num_timesteps - 1,
num_inference_steps, dtype=int)
# Remove early steps
clipped_schedule = full_schedule[clip_early_steps:]
return clipped_schedule
Implement efficient multi-sample generation and filtering.
def generate_and_filter_samples(model, batch_input, num_samples=8,
reward_model=None):
"""
Generate multiple samples and score them efficiently.
Args:
model: Diffusion model
batch_input: Input conditioning (e.g., text prompt or image)
num_samples: Number of samples to generate
reward_model: Optional reward model for scoring
Returns:
samples: List of generated samples
rewards: List of reward scores
"""
import torch
samples = []
rewards = []
with torch.no_grad():
for _ in range(num_samples):
# Generate using ODE-based sampler (deterministic given seed)
sample = model.generate(batch_input, sampler='euler_ode',
temperature=0.0)
samples.append(sample)
# Score with reward model
if reward_model:
reward = reward_model.score(batch_input, sample)
rewards.append(reward)
return samples, rewards
Implement the full DGPO training loop.
def train_dgpo(model, reward_model, train_loader, optimizer,
num_epochs=3, partition_ratio=0.5):
"""
Full DGPO training procedure.
Args:
model: Diffusion model
reward_model: Reward model for scoring
train_loader: DataLoader with conditioning inputs
optimizer: PyTorch optimizer
num_epochs: Training epochs
partition_ratio: Split ratio for positive/negative groups
Returns:
losses: Training loss curve
"""
losses = []
for epoch in range(num_epochs):
epoch_loss = 0.0
for batch in train_loader:
# Generate multiple samples per input
batch_size = batch.shape[0]
all_samples = []
all_rewards = []
for i in range(batch_size):
# Generate samples
samples, rewards = generate_and_filter_samples(
model, batch[i], num_samples=4, reward_model=reward_model
)
all_samples.extend(samples)
all_rewards.extend(rewards)
# Partition into preference groups
positive_samples, negative_samples, weights = \
partition_samples_by_advantage(
all_samples, all_rewards, partition_ratio
)
# Get timestep schedule with clipping
timesteps = get_clipped_timestep_schedule(model, clip_early_steps=5)
# Compute loss
loss = dgpo_preference_loss(
model, positive_samples, negative_samples,
weights[0], weights[1], timesteps=timesteps
)
# Update
optimizer.zero_grad()
loss.backward()
optimizer.step()
epoch_loss += loss.item()
avg_loss = epoch_loss / len(train_loader)
losses.append(avg_loss)
print(f"Epoch {epoch+1}: Loss={avg_loss:.4f}")
return losses
| Aspect | Recommendation | Notes | |--------|---------------|-------| | Partition ratio | 0.3-0.5 | Balance between groups; extreme ratios can cause instability | | Timestep clipping | Skip first 5-10 steps | Prevents overfitting to early noise-dependent steps | | Batch size | 32-64 | Larger batches improve reward estimation stability | | Reward model quality | High-quality training data | Poor reward model hurts preference learning | | ODE sampler | Use deterministic sampling | Avoids stochasticity from SDE samplers | | When to use | Preference-aligned image/video generation | Text-to-image, conditional generation | | When NOT to use | Unconditional generation or diversity-critical tasks | Direct preference may reduce diversity | | Common pitfall | Reward model overfitting | Validate on held-out test set |
Paper: https://arxiv.org/abs/2510.08425
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.