DiffuCoder: Optimizing Masked Diffusion for Code Generation

Standard autoregressive code models generate left-to-right, committing to each token before seeing what comes next. This is efficient at inference but limits the model's ability to plan globally—if the first tokens are wrong, the model propagates errors forward. Masked diffusion models offer an alternative: denoise all tokens simultaneously across multiple iterations, enabling global planning and potentially better code quality.

However, diffusion models are harder to train and optimize than autoregressive models. DiffuCoder demonstrates that coupled-GRPO, a reinforcement learning method designed specifically for non-autoregressive generation, significantly improves diffusion model code quality. The approach introduces "autoregressive-ness" metrics that measure how much a diffusion model exploits sequential structure, revealing that models can learn to adjust their causality dynamically.

Core Concept

The key insights are:

Diffusion models can learn to be semi-autoregressive: Rather than always denoising randomly, models learn when left-to-right sequential decoding is beneficial
Coupled sampling improves RL training: Standard policy gradient methods have high variance for non-autoregressive generation; coupled sampling (paired complementary masks) reduces variance without requiring semi-autoregressive decoding
Temperature controls both selection and order: In diffusion models, temperature affects not just which token is chosen but also the order tokens are denoised—unlike autoregressive models where it only affects token selection

These insights lead to better training and more efficient sampling.

Architecture Overview

DiffuCoder modifies a masked diffusion model for code with:

Masked Language Model: Corrupts code tokens randomly; model predicts all simultaneously
Iterative Refinement: Multiple denoising steps gradually clean up predictions
Coupled-GRPO Training: Uses complementary mask pairs to reduce RL gradient variance
Autoregressive-ness Tracking: Monitors when the model transitions between parallel and sequential generation
Temperature Adjustment: Dynamically set during sampling to balance quality and diversity

Implementation

Step 1: Implement masked diffusion denoising process

Create a model that predicts all tokens simultaneously from corrupted input.

import torch
import torch.nn as nn
import torch.nn.functional as F

class MaskedDiffusionCodeModel(nn.Module):
    """
    Masked diffusion model for code generation.
    Predicts all code tokens from partially masked input in a single forward pass.
    Multiple iterations refine predictions progressively.
    """

    def __init__(self, vocab_size, hidden_dim=768, num_layers=12):
        super().__init__()
        self.vocab_size = vocab_size
        self.hidden_dim = hidden_dim

        # Token and position embeddings
        self.token_embed = nn.Embedding(vocab_size, hidden_dim)
        self.pos_embed = nn.Embedding(2048, hidden_dim)  # Max sequence length

        # Transformer layers
        self.layers = nn.ModuleList([
            nn.TransformerEncoderLayer(
                d_model=hidden_dim,
                nhead=12,
                dim_feedforward=3072,
                batch_first=True
            )
            for _ in range(num_layers)
        ])

        # Output head
        self.output_head = nn.Linear(hidden_dim, vocab_size)

    def forward(self, input_ids, mask_positions=None):
        """
        Forward pass for masked denoising.
        input_ids: code with some tokens masked (e.g., token_id = 0)
        mask_positions: indices of masked tokens (or None to use token_id == 0)
        """
        batch_size, seq_length = input_ids.shape

        # Embed tokens
        embeddings = self.token_embed(input_ids)

        # Add position embeddings
        positions = torch.arange(seq_length, device=input_ids.device).unsqueeze(0)
        embeddings = embeddings + self.pos_embed(positions)

        # Apply transformer layers
        hidden_states = embeddings
        for layer in self.layers:
            hidden_states = layer(hidden_states)

        # Predict logits for all positions
        logits = self.output_head(hidden_states)  # [batch, seq_length, vocab_size]

        return logits

    def denoise_step(self, input_ids, num_iterations=10, temperature=1.0):
        """
        Iteratively refine predictions by masking uncertain positions.
        """
        current_ids = input_ids.clone()
        confidence_history = []

        for iteration in range(num_iterations):
            # Forward pass: predict all tokens
            logits = self.forward(current_ids)
            probs = torch.softmax(logits, dim=-1)

            # Get confidence for current predictions
            predicted_tokens = logits.argmax(dim=-1)
            max_probs = probs.max(dim=-1)[0]  # [batch, seq_length]

            confidence_history.append(max_probs)

            # Mask uncertain positions for next iteration
            # Lower confidence = more likely to be re-predicted
            uncertainty = 1.0 - max_probs
            threshold = uncertainty.quantile(0.5, dim=1, keepdim=True)  # Mask top 50%

            mask_next = uncertainty > threshold
            current_ids = current_ids.clone()
            current_ids[mask_next] = 0  # Mask

            # Early stopping if confident enough
            if (max_probs > 0.95).float().mean() > 0.95:
                break

        return current_ids, confidence_history

Step 2: Analyze autoregressive-ness of diffusion models

Measure the degree to which models exploit left-to-right sequential structure.

def compute_local_autoregressive_ness(logits, input_ids):
    """
    Measure how much the model uses left-to-right sequential structure.
    Local: position i depends more on positions < i than > i.
    """
    batch_size, seq_length, vocab_size = logits.shape

    # Compute attention patterns (for transformer models)
    # Attention to left positions vs right positions
    left_attention = torch.zeros(batch_size, seq_length, device=logits.device)
    right_attention = torch.zeros(batch_size, seq_length, device=logits.device)

    # Approximation: use gradient information
    # Positions that affect i most are those with highest gradient contribution
    for i in range(seq_length):
        # Hypothetically mask position i, measure logit change
        logits_without_i = logits.clone()
        logits_without_i[:, i, :] = 0

        for j in range(i):
            logits_left = logits.clone()
            logits_left[:, j, :] = 0
            change_left = (logits - logits_left).abs().sum(dim=-1)[:, i]
            left_attention[:, i] += change_left

        for j in range(i + 1, seq_length):
            logits_right = logits.clone()
            logits_right[:, j, :] = 0
            change_right = (logits - logits_right).abs().sum(dim=-1)[:, i]
            right_attention[:, i] += change_right

    # Autoregressive-ness: fraction of influence from left
    total_influence = left_attention + right_attention
    autoregressive_ness = left_attention / (total_influence + 1e-8)

    return autoregressive_ness.mean().item()

def global_autoregressive_ness(model, test_inputs, num_iterations=10):
    """
    Measure global autoregressive-ness: how much does iteration order follow left-to-right?
    """
    # Run diffusion process, track which positions get denoised in which order
    denoising_order = []

    for _ in range(num_iterations):
        # Get model predictions
        logits = model(test_inputs)
        probs = torch.softmax(logits, dim=-1)
        confidence = probs.max(dim=-1)[0]

        # Positions denoised this iteration (high confidence)
        denoised = confidence > confidence.quantile(0.5, dim=-1, keepdim=True)

        # Measure: are leftmost positions denoised first?
        denoised_positions = torch.where(denoised[0])[0]
        if len(denoised_positions) > 0:
            avg_position = denoised_positions.float().mean().item()
            denoising_order.append(avg_position / test_inputs.shape[1])

    # Autoregressive-ness: positions denoised earlier are more leftward
    order_correlation = np.corrcoef(
        np.arange(len(denoising_order)),
        denoising_order
    )[0, 1]

    return order_correlation  # Positive = left-to-right denoising order

Step 3: Implement coupled-GRPO for non-autoregressive optimization

Use complementary mask pairs to reduce gradient variance in RL training.

def coupled_grpo_training(model, code_prompts, correctness_reward_fn,
                         learning_rate=1e-5, num_steps=1000):
    """
    Coupled-GRPO: Use paired complementary masks to reduce variance.
    For each prompt, generate two completions with different mask patterns.
    This provides a variance-reduced gradient estimate.
    """
    optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

    for step in range(num_steps):
        # Sample a batch of prompts
        batch_prompts = random.sample(code_prompts, 16)

        grpo_losses = []

        for prompt in batch_prompts:
            input_ids = tokenize(prompt)

            # Generate with complementary mask pairs
            # Mask pair 1: mask positions {0, 2, 4, ...}
            mask_pattern_1 = torch.arange(input_ids.shape[1]) % 2 == 0
            input_masked_1 = input_ids.clone()
            input_masked_1[mask_pattern_1] = 0

            # Mask pair 2: mask positions {1, 3, 5, ...} (complement)
            mask_pattern_2 = ~mask_pattern_1
            input_masked_2 = input_ids.clone()
            input_masked_2[mask_pattern_2] = 0

            # Get model predictions for both masks
            with torch.no_grad():
                logits_1 = model(input_masked_1)
                logits_2 = model(input_masked_2)

            # Sample outputs
            probs_1 = torch.softmax(logits_1, dim=-1)
            probs_2 = torch.softmax(logits_2, dim=-1)

            samples_1 = torch.multinomial(
                probs_1.view(-1, probs_1.shape[-1]),
                num_samples=1
            ).view(probs_1.shape[:-1])

            samples_2 = torch.multinomial(
                probs_2.view(-1, probs_2.shape[-1]),
                num_samples=1
            ).view(probs_2.shape[:-1])

            # Combine samples: where mask_1, use samples_1; where mask_2, use samples_2
            combined_sample = torch.where(mask_pattern_1, samples_1, samples_2)

            # Score the combined sample
            code = detokenize(combined_sample)
            reward = correctness_reward_fn(prompt, code)

            # Compute log probabilities under both masks
            log_prob_1 = F.log_softmax(logits_1, dim=-1)
            log_prob_2 = F.log_softmax(logits_2, dim=-1)

            # Extract log probs for the actual samples
            log_prob_1_sample = log_prob_1[mask_indices_1].sum()
            log_prob_2_sample = log_prob_2[mask_indices_2].sum()

            # GRPO loss: maximize log probability weighted by reward
            # Use both masks to get variance-reduced estimate
            policy_loss = -(log_prob_1_sample + log_prob_2_sample) * reward / 2

            grpo_losses.append(policy_loss)

        # Backprop
        total_loss = sum(grpo_losses) / len(grpo_losses)
        optimizer.zero_grad()
        total_loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()

        if step % 100 == 0:
            print(f"Step {step}: Loss {total_loss:.4f}")

    return model

Step 4: Evaluate code generation performance

Test on standard code benchmarks with metrics like pass@k.

def evaluate_code_generation(model, test_prompts, reference_solutions,
                            num_samples_per_prompt=5, timeout_seconds=10):
    """
    Evaluate code generation on standard benchmarks.
    Metric: pass@k (does at least one of k samples pass tests)
    """
    passes = []

    for prompt, reference in zip(test_prompts, reference_solutions):
        sample_passes = 0

        for _ in range(num_samples_per_prompt):
            # Generate code
            input_ids = tokenize(prompt)
            logits = model(input_ids)
            probs = torch.softmax(logits, dim=-1)

            # Sample until we get valid code
            for attempt in range(5):
                samples = torch.multinomial(
                    probs.view(-1, probs.shape[-1]),
                    num_samples=1
                ).view(probs.shape[:-1])

                code = detokenize(samples)

                # Check if code is valid and passes tests
                try:
                    exec_globals = {}
                    exec(code, exec_globals)

                    # Run test
                    test_result = exec_globals.get('test', lambda: False)()

                    if test_result:
                        sample_passes += 1
                        break
                except:
                    # Invalid code, try again
                    continue

        # Pass@k: did at least one sample pass?
        passes.append(sample_passes >= 1)

    pass_rate = np.mean(passes)
    return {'pass_rate': pass_rate, 'pass_count': sum(passes)}

Practical Guidance

| Hyperparameter | Recommended Value | Notes | |---|---|---| | Denoising iterations | 10-20 | More iterations = higher quality but slower | | Masking ratio per step | 50% | Gradually denoise by re-predicting uncertain positions | | Coupled mask overlap | 0.0 | Truly complementary masks minimize correlation | | Temperature for sampling | 0.8-1.2 | Higher = more diverse; affects both selection and order | | GRPO reward weight | 0.5-1.0 | Scale of reward signal | | Training steps | 2K-5K | Converges faster than autoregressive models |

When to use DiffuCoder:

You want to improve upon autoregressive code models using global planning
You have a correctness reward function (unit tests, static analysis)
You can afford 10-20 denoising steps per sample (slower than AR at inference)
You want to reduce autoregressive bias in code generation

When NOT to use DiffuCoder:

Inference latency is critical (diffusion is 5-10× slower than autoregressive)
You only care about speed, not quality
Your code generation tasks are simple enough for autoregressive models
You can't define good reward functions for your task

Common pitfalls:

Slow inference: Diffusion requires many iterations. Optimize by early stopping when uncertainty drops below threshold.
Mode collapse: If the model learns to always denoise in left-to-right order, it becomes autoregressive and loses the benefit. Monitor global autoregressive-ness and penalize high values.
Coupled sampling complexity: Ensure mask pairs are truly complementary (no overlap). Test by checking that both masks cover all positions exactly once.
Reward function instability: If GRPO training oscillates, reduce the learning rate or increase batch size for more stable advantage estimates.

Reference

DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation https://arxiv.org/abs/2506.20639

DiffuCoder: Optimizing Masked Diffusion for Code Generation

Core Concept

The key insights are:

Diffusion models can learn to be semi-autoregressive: Rather than always denoising randomly, models learn when left-to-right sequential decoding is beneficial
Coupled sampling improves RL training: Standard policy gradient methods have high variance for non-autoregressive generation; coupled sampling (paired complementary masks) reduces variance without requiring semi-autoregressive decoding
Temperature controls both selection and order: In diffusion models, temperature affects not just which token is chosen but also the order tokens are denoised—unlike autoregressive models where it only affects token selection

These insights lead to better training and more efficient sampling.

Architecture Overview

DiffuCoder modifies a masked diffusion model for code with:

Masked Language Model: Corrupts code tokens randomly; model predicts all simultaneously
Iterative Refinement: Multiple denoising steps gradually clean up predictions
Coupled-GRPO Training: Uses complementary mask pairs to reduce RL gradient variance
Autoregressive-ness Tracking: Monitors when the model transitions between parallel and sequential generation
Temperature Adjustment: Dynamically set during sampling to balance quality and diversity

Implementation

Step 1: Implement masked diffusion denoising process

Create a model that predicts all tokens simultaneously from corrupted input.

import torch
import torch.nn as nn
import torch.nn.functional as F

class MaskedDiffusionCodeModel(nn.Module):
    """
    Masked diffusion model for code generation.
    Predicts all code tokens from partially masked input in a single forward pass.
    Multiple iterations refine predictions progressively.
    """

    def __init__(self, vocab_size, hidden_dim=768, num_layers=12):
        super().__init__()
        self.vocab_size = vocab_size
        self.hidden_dim = hidden_dim

        # Token and position embeddings
        self.token_embed = nn.Embedding(vocab_size, hidden_dim)
        self.pos_embed = nn.Embedding(2048, hidden_dim)  # Max sequence length

        # Transformer layers
        self.layers = nn.ModuleList([
            nn.TransformerEncoderLayer(
                d_model=hidden_dim,
                nhead=12,
                dim_feedforward=3072,
                batch_first=True
            )
            for _ in range(num_layers)
        ])

        # Output head
        self.output_head = nn.Linear(hidden_dim, vocab_size)

    def forward(self, input_ids, mask_positions=None):
        """
        Forward pass for masked denoising.
        input_ids: code with some tokens masked (e.g., token_id = 0)
        mask_positions: indices of masked tokens (or None to use token_id == 0)
        """
        batch_size, seq_length = input_ids.shape

        # Embed tokens
        embeddings = self.token_embed(input_ids)

        # Add position embeddings
        positions = torch.arange(seq_length, device=input_ids.device).unsqueeze(0)
        embeddings = embeddings + self.pos_embed(positions)

        # Apply transformer layers
        hidden_states = embeddings
        for layer in self.layers:
            hidden_states = layer(hidden_states)

        # Predict logits for all positions
        logits = self.output_head(hidden_states)  # [batch, seq_length, vocab_size]

        return logits

    def denoise_step(self, input_ids, num_iterations=10, temperature=1.0):
        """
        Iteratively refine predictions by masking uncertain positions.
        """
        current_ids = input_ids.clone()
        confidence_history = []

        for iteration in range(num_iterations):
            # Forward pass: predict all tokens
            logits = self.forward(current_ids)
            probs = torch.softmax(logits, dim=-1)

            # Get confidence for current predictions
            predicted_tokens = logits.argmax(dim=-1)
            max_probs = probs.max(dim=-1)[0]  # [batch, seq_length]

            confidence_history.append(max_probs)

            # Mask uncertain positions for next iteration
            # Lower confidence = more likely to be re-predicted
            uncertainty = 1.0 - max_probs
            threshold = uncertainty.quantile(0.5, dim=1, keepdim=True)  # Mask top 50%

            mask_next = uncertainty > threshold
            current_ids = current_ids.clone()
            current_ids[mask_next] = 0  # Mask

            # Early stopping if confident enough
            if (max_probs > 0.95).float().mean() > 0.95:
                break

        return current_ids, confidence_history

Step 2: Analyze autoregressive-ness of diffusion models

Measure the degree to which models exploit left-to-right sequential structure.

def compute_local_autoregressive_ness(logits, input_ids):
    """
    Measure how much the model uses left-to-right sequential structure.
    Local: position i depends more on positions < i than > i.
    """
    batch_size, seq_length, vocab_size = logits.shape

    # Compute attention patterns (for transformer models)
    # Attention to left positions vs right positions
    left_attention = torch.zeros(batch_size, seq_length, device=logits.device)
    right_attention = torch.zeros(batch_size, seq_length, device=logits.device)

    # Approximation: use gradient information
    # Positions that affect i most are those with highest gradient contribution
    for i in range(seq_length):
        # Hypothetically mask position i, measure logit change
        logits_without_i = logits.clone()
        logits_without_i[:, i, :] = 0

        for j in range(i):
            logits_left = logits.clone()
            logits_left[:, j, :] = 0
            change_left = (logits - logits_left).abs().sum(dim=-1)[:, i]
            left_attention[:, i] += change_left

        for j in range(i + 1, seq_length):
            logits_right = logits.clone()
            logits_right[:, j, :] = 0
            change_right = (logits - logits_right).abs().sum(dim=-1)[:, i]
            right_attention[:, i] += change_right

    # Autoregressive-ness: fraction of influence from left
    total_influence = left_attention + right_attention
    autoregressive_ness = left_attention / (total_influence + 1e-8)

    return autoregressive_ness.mean().item()

def global_autoregressive_ness(model, test_inputs, num_iterations=10):
    """
    Measure global autoregressive-ness: how much does iteration order follow left-to-right?
    """
    # Run diffusion process, track which positions get denoised in which order
    denoising_order = []

    for _ in range(num_iterations):
        # Get model predictions
        logits = model(test_inputs)
        probs = torch.softmax(logits, dim=-1)
        confidence = probs.max(dim=-1)[0]

        # Positions denoised this iteration (high confidence)
        denoised = confidence > confidence.quantile(0.5, dim=-1, keepdim=True)

        # Measure: are leftmost positions denoised first?
        denoised_positions = torch.where(denoised[0])[0]
        if len(denoised_positions) > 0:
            avg_position = denoised_positions.float().mean().item()
            denoising_order.append(avg_position / test_inputs.shape[1])

    # Autoregressive-ness: positions denoised earlier are more leftward
    order_correlation = np.corrcoef(
        np.arange(len(denoising_order)),
        denoising_order
    )[0, 1]

    return order_correlation  # Positive = left-to-right denoising order

Step 3: Implement coupled-GRPO for non-autoregressive optimization

Use complementary mask pairs to reduce gradient variance in RL training.

def coupled_grpo_training(model, code_prompts, correctness_reward_fn,
                         learning_rate=1e-5, num_steps=1000):
    """
    Coupled-GRPO: Use paired complementary masks to reduce variance.
    For each prompt, generate two completions with different mask patterns.
    This provides a variance-reduced gradient estimate.
    """
    optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

    for step in range(num_steps):
        # Sample a batch of prompts
        batch_prompts = random.sample(code_prompts, 16)

        grpo_losses = []

        for prompt in batch_prompts:
            input_ids = tokenize(prompt)

            # Generate with complementary mask pairs
            # Mask pair 1: mask positions {0, 2, 4, ...}
            mask_pattern_1 = torch.arange(input_ids.shape[1]) % 2 == 0
            input_masked_1 = input_ids.clone()
            input_masked_1[mask_pattern_1] = 0

            # Mask pair 2: mask positions {1, 3, 5, ...} (complement)
            mask_pattern_2 = ~mask_pattern_1
            input_masked_2 = input_ids.clone()
            input_masked_2[mask_pattern_2] = 0

            # Get model predictions for both masks
            with torch.no_grad():
                logits_1 = model(input_masked_1)
                logits_2 = model(input_masked_2)

            # Sample outputs
            probs_1 = torch.softmax(logits_1, dim=-1)
            probs_2 = torch.softmax(logits_2, dim=-1)

            samples_1 = torch.multinomial(
                probs_1.view(-1, probs_1.shape[-1]),
                num_samples=1
            ).view(probs_1.shape[:-1])

            samples_2 = torch.multinomial(
                probs_2.view(-1, probs_2.shape[-1]),
                num_samples=1
            ).view(probs_2.shape[:-1])

            # Combine samples: where mask_1, use samples_1; where mask_2, use samples_2
            combined_sample = torch.where(mask_pattern_1, samples_1, samples_2)

            # Score the combined sample
            code = detokenize(combined_sample)
            reward = correctness_reward_fn(prompt, code)

            # Compute log probabilities under both masks
            log_prob_1 = F.log_softmax(logits_1, dim=-1)
            log_prob_2 = F.log_softmax(logits_2, dim=-1)

            # Extract log probs for the actual samples
            log_prob_1_sample = log_prob_1[mask_indices_1].sum()
            log_prob_2_sample = log_prob_2[mask_indices_2].sum()

            # GRPO loss: maximize log probability weighted by reward
            # Use both masks to get variance-reduced estimate
            policy_loss = -(log_prob_1_sample + log_prob_2_sample) * reward / 2

            grpo_losses.append(policy_loss)

        # Backprop
        total_loss = sum(grpo_losses) / len(grpo_losses)
        optimizer.zero_grad()
        total_loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()

        if step % 100 == 0:
            print(f"Step {step}: Loss {total_loss:.4f}")

    return model

Step 4: Evaluate code generation performance

Test on standard code benchmarks with metrics like pass@k.

def evaluate_code_generation(model, test_prompts, reference_solutions,
                            num_samples_per_prompt=5, timeout_seconds=10):
    """
    Evaluate code generation on standard benchmarks.
    Metric: pass@k (does at least one of k samples pass tests)
    """
    passes = []

    for prompt, reference in zip(test_prompts, reference_solutions):
        sample_passes = 0

        for _ in range(num_samples_per_prompt):
            # Generate code
            input_ids = tokenize(prompt)
            logits = model(input_ids)
            probs = torch.softmax(logits, dim=-1)

            # Sample until we get valid code
            for attempt in range(5):
                samples = torch.multinomial(
                    probs.view(-1, probs.shape[-1]),
                    num_samples=1
                ).view(probs.shape[:-1])

                code = detokenize(samples)

                # Check if code is valid and passes tests
                try:
                    exec_globals = {}
                    exec(code, exec_globals)

                    # Run test
                    test_result = exec_globals.get('test', lambda: False)()

                    if test_result:
                        sample_passes += 1
                        break
                except:
                    # Invalid code, try again
                    continue

        # Pass@k: did at least one sample pass?
        passes.append(sample_passes >= 1)

    pass_rate = np.mean(passes)
    return {'pass_rate': pass_rate, 'pass_count': sum(passes)}

Practical Guidance

When to use DiffuCoder:

You want to improve upon autoregressive code models using global planning
You have a correctness reward function (unit tests, static analysis)
You can afford 10-20 denoising steps per sample (slower than AR at inference)
You want to reduce autoregressive bias in code generation

When NOT to use DiffuCoder:

Inference latency is critical (diffusion is 5-10× slower than autoregressive)
You only care about speed, not quality
Your code generation tasks are simple enough for autoregressive models
You can't define good reward functions for your task

Common pitfalls:

Slow inference: Diffusion requires many iterations. Optimize by early stopping when uncertainty drops below threshold.
Mode collapse: If the model learns to always denoise in left-to-right order, it becomes autoregressive and loses the benefit. Monitor global autoregressive-ness and penalize high values.
Coupled sampling complexity: Ensure mask pairs are truly complementary (no overlap). Test by checking that both masks cover all positions exactly once.
Reward function instability: If GRPO training oscillates, reduce the learning rate or increase batch size for more stable advantage estimates.

Reference

DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation https://arxiv.org/abs/2506.20639

Adoption

ADu2021/diffucoder-diffusion-code

$ install --global

Security Scan Results

SKILL.md

DiffuCoder: Optimizing Masked Diffusion for Code Generation

Core Concept

Architecture Overview

Implementation

Practical Guidance

Reference

Related Skills

ADu2021/flow-map-trajectory-tilting

ADu2021/flexible-data-mixture-of-experts

ADu2021/flexibility-trap-diffusion-reasoning

ADu2021/flex-continuous-agent-evolution

ADu2021/diffucoder-diffusion-code

$ install --global

Security Scan Results

SKILL.md

DiffuCoder: Optimizing Masked Diffusion for Code Generation

Core Concept

Architecture Overview

Implementation

Practical Guidance

Reference

Related Skills

ADu2021/flow-map-trajectory-tilting

ADu2021/flexible-data-mixture-of-experts

ADu2021/flexibility-trap-diffusion-reasoning

ADu2021/flex-continuous-agent-evolution