skills/skillxiv-v0.0.2-claude-opus-4.6/densegrpo-flow-matching/SKILL.md
Improve diffusion model alignment by assigning step-wise rewards during denoising instead of terminal rewards. Fixes sparse reward signal mismatch in multi-step generation processes through ODE-based reward estimation.
npx skillsauth add ADu2021/skillXiv densegrpo-flow-matchingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Diffusion models generate sequences through iterative refinement across hundreds of denoising steps, yet most RL-based alignment methods assign a single terminal reward to all intermediate steps. This creates a fundamental mismatch: individual steps receive feedback proportional to global performance, obscuring which denoising decisions actually contributed to quality. DenseGRPO fixes this by computing step-wise rewards that align feedback signals with each step's contribution.
The key innovation is using ODE-based denoising to recover clean outputs at intermediate steps, enabling reward models to evaluate progressive quality and compute step-specific reward gains. This transforms diffusion alignment from guessing trajectories to supervised step-by-step refinement.
DenseGRPO introduces two complementary mechanisms:
Step-Wise Dense Reward Estimation: For each intermediate latent at timestep t, use ODE denoising to obtain the partially-denoised output, evaluate it with a reward model, and compute the gain from this step: ΔR_t = R_{t-1} - R_t
Exploration Space Calibration: Standard diffusion samplers apply uniform noise across timesteps, creating imbalanced reward distributions. Adaptively adjust timestep-specific stochasticity (ψ(t)) to maintain balanced exploration while preserving diversity.
Together, these ensure feedback aligns with actual step contributions, enabling effective preference learning without trajectory-level averaging.
The method involves three stages: intermediate recovery, reward computation, and exploration calibration.
Use ODE-based denoising to recover intermediate denoised outputs:
# ODE solver for intermediate latent recovery
import torch
from torchdiffeq import odeint
class ODEDenoiser:
def __init__(self, model, noise_schedule):
self.model = model
self.noise_schedule = noise_schedule
def recover_clean_latent(self, noisy_latent, current_step, target_step=0):
"""Recover partially-denoised latent via ODE integration."""
# Define ODE: dz/dt = -score_theta(z_t, t)
def ode_func(t, z):
t_scaled = torch.tensor([t], device=z.device)
# Score function from diffusion model
score = -self.model.predict_noise(z, t_scaled)
return score
# Integrate from current_step to target_step
t_span = torch.linspace(current_step, target_step, steps=10)
solution = odeint(ode_func, noisy_latent, t_span)
return solution[-1] # Clean latent at target timestep
denoiser = ODEDenoiser(diffusion_model, noise_schedule)
clean_t = denoiser.recover_clean_latent(latent_t, current_t, target_t=0)
Compute step-wise reward gains using recovered intermediates:
# Dense reward computation at each step
def compute_dense_rewards(initial_latent, trajectory_steps, reward_models):
"""Compute per-step reward gains during denoising."""
dense_rewards = []
prev_reward = 0
for i, latent_t in enumerate(trajectory_steps):
# Recover clean output at this step
clean_output = denoiser.recover_clean_latent(latent_t, step=i)
# Evaluate with reward models (composition, aesthetics, text)
scores = {}
for name, model in reward_models.items():
scores[name] = model.score(clean_output)
# Aggregate reward (weighted combination)
current_reward = (
0.5 * scores['composition'] +
0.3 * scores['aesthetics'] +
0.2 * scores['text_accuracy']
)
# Step-wise gain
reward_gain = current_reward - prev_reward
dense_rewards.append(reward_gain)
prev_reward = current_reward
return torch.tensor(dense_rewards)
rewards = compute_dense_rewards(latent, trajectory, reward_models)
Adaptively adjust noise injection to balance exploration:
# Adaptive exploration calibration
class ExplorationCalibrator:
def __init__(self, num_steps=50):
self.num_steps = num_steps
self.step_stochasticity = torch.ones(num_steps)
def calibrate(self, reward_trajectory):
"""Adjust noise schedule to balance reward distribution."""
# Compute reward statistics per timestep
neg_mask = reward_trajectory < 0
pos_mask = reward_trajectory > 0
# Count positive/negative outcomes per step
neg_counts = neg_mask.sum(dim=0)
pos_counts = pos_mask.sum(dim=0)
# Target 40% negative, 60% positive across steps
target_ratio = 0.4
current_ratio = neg_counts / (neg_counts + pos_counts + 1e-8)
# Increase noise where ratio is too skewed
adjustment = torch.where(
current_ratio < target_ratio,
torch.ones_like(current_ratio),
1.0 / (current_ratio + 1e-8)
)
self.step_stochasticity *= adjustment
return self.step_stochasticity
calibrator = ExplorationCalibrator()
adjusted_psi = calibrator.calibrate(reward_batch)
| Aspect | Recommendation | Notes | |--------|-----------------|-------| | ODE Steps | n=t (same as current step) | Balances quality and speed | | Reward Models | 2-4 specialized models | Composition, aesthetics, text accuracy | | Reward Weights | 0.5/0.3/0.2 split | Domain-dependent; tune per task | | Exploration Target | 40% negative rewards | Maintains diversity in sampling | | LoRA Rank | 16-32 | Sufficient for fine-tuning | | Batch Size | 64-128 per GPU | Memory constraints from ODE decoding |
When to use: When training diffusion models for preference alignment (image/video generation). Effective for multi-modal reward objectives (composition, text accuracy).
When NOT to use: For simple scalar reward functions—sparse rewards sufficient. When computational cost of ODE recovery is prohibitive.
Common pitfalls:
DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment https://arxiv.org/abs/2601.20218
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.