Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

ADu2021/densegrpo-flow-matching

Name: densegrpo-flow-matching
Author: ADu2021

skills/skillxiv-v0.0.2-claude-opus-4.6/densegrpo-flow-matching/SKILL.md

npx skillsauth add ADu2021/skillXiv densegrpo-flow-matching

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment

Diffusion models generate sequences through iterative refinement across hundreds of denoising steps, yet most RL-based alignment methods assign a single terminal reward to all intermediate steps. This creates a fundamental mismatch: individual steps receive feedback proportional to global performance, obscuring which denoising decisions actually contributed to quality. DenseGRPO fixes this by computing step-wise rewards that align feedback signals with each step's contribution.

The key innovation is using ODE-based denoising to recover clean outputs at intermediate steps, enabling reward models to evaluate progressive quality and compute step-specific reward gains. This transforms diffusion alignment from guessing trajectories to supervised step-by-step refinement.

Core Concept

DenseGRPO introduces two complementary mechanisms:

Step-Wise Dense Reward Estimation: For each intermediate latent at timestep t, use ODE denoising to obtain the partially-denoised output, evaluate it with a reward model, and compute the gain from this step: ΔR_t = R_{t-1} - R_t
Exploration Space Calibration: Standard diffusion samplers apply uniform noise across timesteps, creating imbalanced reward distributions. Adaptively adjust timestep-specific stochasticity (ψ(t)) to maintain balanced exploration while preserving diversity.

Together, these ensure feedback aligns with actual step contributions, enabling effective preference learning without trajectory-level averaging.

Architecture Overview

Intermediate Reward Evaluation: ODE-based recovery of partial denoising results at each step
Reward Model Stacking: Multiple specialized reward models (composition, aesthetic, text accuracy)
Step-Wise Gain Computation: Calculate ΔR_t = R_{t-1} - R_t for each timestep
Exploration Rebalancing: Adaptive noise scheduling to create balanced positive/negative rewards
LoRA Fine-Tuning: Efficient parameter updates on pretrained diffusion backbones
GRPO Integration: Standard policy gradient updates using dense per-step advantages

Implementation

The method involves three stages: intermediate recovery, reward computation, and exploration calibration.

Use ODE-based denoising to recover intermediate denoised outputs:

# ODE solver for intermediate latent recovery
import torch
from torchdiffeq import odeint

class ODEDenoiser:
    def __init__(self, model, noise_schedule):
        self.model = model
        self.noise_schedule = noise_schedule

    def recover_clean_latent(self, noisy_latent, current_step, target_step=0):
        """Recover partially-denoised latent via ODE integration."""
        # Define ODE: dz/dt = -score_theta(z_t, t)
        def ode_func(t, z):
            t_scaled = torch.tensor([t], device=z.device)
            # Score function from diffusion model
            score = -self.model.predict_noise(z, t_scaled)
            return score

        # Integrate from current_step to target_step
        t_span = torch.linspace(current_step, target_step, steps=10)
        solution = odeint(ode_func, noisy_latent, t_span)

        return solution[-1]  # Clean latent at target timestep

denoiser = ODEDenoiser(diffusion_model, noise_schedule)
clean_t = denoiser.recover_clean_latent(latent_t, current_t, target_t=0)

Compute step-wise reward gains using recovered intermediates:

# Dense reward computation at each step
def compute_dense_rewards(initial_latent, trajectory_steps, reward_models):
    """Compute per-step reward gains during denoising."""
    dense_rewards = []
    prev_reward = 0

    for i, latent_t in enumerate(trajectory_steps):
        # Recover clean output at this step
        clean_output = denoiser.recover_clean_latent(latent_t, step=i)

        # Evaluate with reward models (composition, aesthetics, text)
        scores = {}
        for name, model in reward_models.items():
            scores[name] = model.score(clean_output)

        # Aggregate reward (weighted combination)
        current_reward = (
            0.5 * scores['composition'] +
            0.3 * scores['aesthetics'] +
            0.2 * scores['text_accuracy']
        )

        # Step-wise gain
        reward_gain = current_reward - prev_reward
        dense_rewards.append(reward_gain)
        prev_reward = current_reward

    return torch.tensor(dense_rewards)

rewards = compute_dense_rewards(latent, trajectory, reward_models)

Adaptively adjust noise injection to balance exploration:

# Adaptive exploration calibration
class ExplorationCalibrator:
    def __init__(self, num_steps=50):
        self.num_steps = num_steps
        self.step_stochasticity = torch.ones(num_steps)

    def calibrate(self, reward_trajectory):
        """Adjust noise schedule to balance reward distribution."""
        # Compute reward statistics per timestep
        neg_mask = reward_trajectory < 0
        pos_mask = reward_trajectory > 0

        # Count positive/negative outcomes per step
        neg_counts = neg_mask.sum(dim=0)
        pos_counts = pos_mask.sum(dim=0)

        # Target 40% negative, 60% positive across steps
        target_ratio = 0.4
        current_ratio = neg_counts / (neg_counts + pos_counts + 1e-8)

        # Increase noise where ratio is too skewed
        adjustment = torch.where(
            current_ratio < target_ratio,
            torch.ones_like(current_ratio),
            1.0 / (current_ratio + 1e-8)
        )

        self.step_stochasticity *= adjustment
        return self.step_stochasticity

calibrator = ExplorationCalibrator()
adjusted_psi = calibrator.calibrate(reward_batch)

Practical Guidance

| Aspect | Recommendation | Notes | |--------|-----------------|-------| | ODE Steps | n=t (same as current step) | Balances quality and speed | | Reward Models | 2-4 specialized models | Composition, aesthetics, text accuracy | | Reward Weights | 0.5/0.3/0.2 split | Domain-dependent; tune per task | | Exploration Target | 40% negative rewards | Maintains diversity in sampling | | LoRA Rank | 16-32 | Sufficient for fine-tuning | | Batch Size | 64-128 per GPU | Memory constraints from ODE decoding |

When to use: When training diffusion models for preference alignment (image/video generation). Effective for multi-modal reward objectives (composition, text accuracy).

When NOT to use: For simple scalar reward functions—sparse rewards sufficient. When computational cost of ODE recovery is prohibitive.

Common pitfalls:

ODE recovery is expensive—cache partial denoising results across trajectories
Imbalanced reward distributions collapse exploration early—monitor calibration metrics
Multiple reward models may conflict—normalize and validate weights empirically
Overfitting to reward models during dense optimization—use validation set monitoring

Reference

DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment https://arxiv.org/abs/2601.20218

ADu2021/densegrpo-flow-matching

skills/skillxiv-v0.0.2-claude-opus-4.6/densegrpo-flow-matching/SKILL.md

Improve diffusion model alignment by assigning step-wise rewards during denoising instead of terminal rewards. Fixes sparse reward signal mismatch in multi-step generation processes through ODE-based reward estimation.

2 stars

data-ai

Updated Apr 17, 2026

$ install --global

skillsauth

npx skillsauth add ADu2021/skillXiv densegrpo-flow-matching

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 17, 2026, 5:33 AM13.8s1 file scanned

SKILL.md

name:: densegrpo-flow-matching
title:: DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment
version:: 0.0.2
engine:: skillxiv-v0.0.2-claude-opus-4.6
license:: MIT
url:: https://arxiv.org/abs/2601.20218
keywords:: [Flow Matching, Dense Rewards, RLVR, Diffusion Models, Alignment]
description:: Improve diffusion model alignment by assigning step-wise rewards during denoising instead of terminal rewards. Fixes sparse reward signal mismatch in multi-step generation processes through ODE-based reward estimation.

DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment

Core Concept

DenseGRPO introduces two complementary mechanisms:

Step-Wise Dense Reward Estimation: For each intermediate latent at timestep t, use ODE denoising to obtain the partially-denoised output, evaluate it with a reward model, and compute the gain from this step: ΔR_t = R_{t-1} - R_t
Exploration Space Calibration: Standard diffusion samplers apply uniform noise across timesteps, creating imbalanced reward distributions. Adaptively adjust timestep-specific stochasticity (ψ(t)) to maintain balanced exploration while preserving diversity.

Together, these ensure feedback aligns with actual step contributions, enabling effective preference learning without trajectory-level averaging.

Architecture Overview

Intermediate Reward Evaluation: ODE-based recovery of partial denoising results at each step
Reward Model Stacking: Multiple specialized reward models (composition, aesthetic, text accuracy)
Step-Wise Gain Computation: Calculate ΔR_t = R_{t-1} - R_t for each timestep
Exploration Rebalancing: Adaptive noise scheduling to create balanced positive/negative rewards
LoRA Fine-Tuning: Efficient parameter updates on pretrained diffusion backbones
GRPO Integration: Standard policy gradient updates using dense per-step advantages

Implementation

The method involves three stages: intermediate recovery, reward computation, and exploration calibration.

Use ODE-based denoising to recover intermediate denoised outputs:

# ODE solver for intermediate latent recovery
import torch
from torchdiffeq import odeint

class ODEDenoiser:
    def __init__(self, model, noise_schedule):
        self.model = model
        self.noise_schedule = noise_schedule

    def recover_clean_latent(self, noisy_latent, current_step, target_step=0):
        """Recover partially-denoised latent via ODE integration."""
        # Define ODE: dz/dt = -score_theta(z_t, t)
        def ode_func(t, z):
            t_scaled = torch.tensor([t], device=z.device)
            # Score function from diffusion model
            score = -self.model.predict_noise(z, t_scaled)
            return score

        # Integrate from current_step to target_step
        t_span = torch.linspace(current_step, target_step, steps=10)
        solution = odeint(ode_func, noisy_latent, t_span)

        return solution[-1]  # Clean latent at target timestep

denoiser = ODEDenoiser(diffusion_model, noise_schedule)
clean_t = denoiser.recover_clean_latent(latent_t, current_t, target_t=0)

Compute step-wise reward gains using recovered intermediates:

# Dense reward computation at each step
def compute_dense_rewards(initial_latent, trajectory_steps, reward_models):
    """Compute per-step reward gains during denoising."""
    dense_rewards = []
    prev_reward = 0

    for i, latent_t in enumerate(trajectory_steps):
        # Recover clean output at this step
        clean_output = denoiser.recover_clean_latent(latent_t, step=i)

        # Evaluate with reward models (composition, aesthetics, text)
        scores = {}
        for name, model in reward_models.items():
            scores[name] = model.score(clean_output)

        # Aggregate reward (weighted combination)
        current_reward = (
            0.5 * scores['composition'] +
            0.3 * scores['aesthetics'] +
            0.2 * scores['text_accuracy']
        )

        # Step-wise gain
        reward_gain = current_reward - prev_reward
        dense_rewards.append(reward_gain)
        prev_reward = current_reward

    return torch.tensor(dense_rewards)

rewards = compute_dense_rewards(latent, trajectory, reward_models)

Adaptively adjust noise injection to balance exploration:

# Adaptive exploration calibration
class ExplorationCalibrator:
    def __init__(self, num_steps=50):
        self.num_steps = num_steps
        self.step_stochasticity = torch.ones(num_steps)

    def calibrate(self, reward_trajectory):
        """Adjust noise schedule to balance reward distribution."""
        # Compute reward statistics per timestep
        neg_mask = reward_trajectory < 0
        pos_mask = reward_trajectory > 0

        # Count positive/negative outcomes per step
        neg_counts = neg_mask.sum(dim=0)
        pos_counts = pos_mask.sum(dim=0)

        # Target 40% negative, 60% positive across steps
        target_ratio = 0.4
        current_ratio = neg_counts / (neg_counts + pos_counts + 1e-8)

        # Increase noise where ratio is too skewed
        adjustment = torch.where(
            current_ratio < target_ratio,
            torch.ones_like(current_ratio),
            1.0 / (current_ratio + 1e-8)
        )

        self.step_stochasticity *= adjustment
        return self.step_stochasticity

calibrator = ExplorationCalibrator()
adjusted_psi = calibrator.calibrate(reward_batch)

Practical Guidance

When to use: When training diffusion models for preference alignment (image/video generation). Effective for multi-modal reward objectives (composition, text accuracy).

When NOT to use: For simple scalar reward functions—sparse rewards sufficient. When computational cost of ODE recovery is prohibitive.

Common pitfalls:

ODE recovery is expensive—cache partial denoising results across trajectories
Imbalanced reward distributions collapse exploration early—monitor calibration metrics
Multiple reward models may conflict—normalize and validate weights empirically
Overfitting to reward models during dense optimization—use validation set monitoring

Reference

DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment https://arxiv.org/abs/2601.20218

Related Skills

ADu2021/flow-map-trajectory-tilting

testing

VerifiedTrustedCommunity

Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flow-map-trajectory-tilting

ADu2021/flexible-data-mixture-of-experts

testing

VerifiedTrustedCommunity

Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flexible-data-mixture-of-experts

ADu2021/flexibility-trap-diffusion-reasoning

data-ai

VerifiedTrustedCommunity

Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flexibility-trap-diffusion-reasoning

ADu2021/flex-continuous-agent-evolution

devops

VerifiedTrustedCommunity

Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flex-continuous-agent-evolution

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/ADu2021/skillXiv.git

# Copy into Claude Code skills folder (global)
cp -r skillXiv/skills/skillxiv-v0.0.2-claude-opus-4.6/densegrpo-flow-matching ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

ADu2021/skillXiv

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT