skills/skillxiv-v0.0.2-claude-opus-4.6/diffucoder-diffusion-code/SKILL.md
Train masked diffusion models for code generation using coupled-GRPO to optimize non-autoregressive generation. Achieves 4.4% improvement on code benchmarks while reducing autoregressive bias.
npx skillsauth add ADu2021/skillXiv diffucoder-diffusion-codeInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Standard autoregressive code models generate left-to-right, committing to each token before seeing what comes next. This is efficient at inference but limits the model's ability to plan globally—if the first tokens are wrong, the model propagates errors forward. Masked diffusion models offer an alternative: denoise all tokens simultaneously across multiple iterations, enabling global planning and potentially better code quality.
However, diffusion models are harder to train and optimize than autoregressive models. DiffuCoder demonstrates that coupled-GRPO, a reinforcement learning method designed specifically for non-autoregressive generation, significantly improves diffusion model code quality. The approach introduces "autoregressive-ness" metrics that measure how much a diffusion model exploits sequential structure, revealing that models can learn to adjust their causality dynamically.
The key insights are:
These insights lead to better training and more efficient sampling.
DiffuCoder modifies a masked diffusion model for code with:
Step 1: Implement masked diffusion denoising process
Create a model that predicts all tokens simultaneously from corrupted input.
import torch
import torch.nn as nn
import torch.nn.functional as F
class MaskedDiffusionCodeModel(nn.Module):
"""
Masked diffusion model for code generation.
Predicts all code tokens from partially masked input in a single forward pass.
Multiple iterations refine predictions progressively.
"""
def __init__(self, vocab_size, hidden_dim=768, num_layers=12):
super().__init__()
self.vocab_size = vocab_size
self.hidden_dim = hidden_dim
# Token and position embeddings
self.token_embed = nn.Embedding(vocab_size, hidden_dim)
self.pos_embed = nn.Embedding(2048, hidden_dim) # Max sequence length
# Transformer layers
self.layers = nn.ModuleList([
nn.TransformerEncoderLayer(
d_model=hidden_dim,
nhead=12,
dim_feedforward=3072,
batch_first=True
)
for _ in range(num_layers)
])
# Output head
self.output_head = nn.Linear(hidden_dim, vocab_size)
def forward(self, input_ids, mask_positions=None):
"""
Forward pass for masked denoising.
input_ids: code with some tokens masked (e.g., token_id = 0)
mask_positions: indices of masked tokens (or None to use token_id == 0)
"""
batch_size, seq_length = input_ids.shape
# Embed tokens
embeddings = self.token_embed(input_ids)
# Add position embeddings
positions = torch.arange(seq_length, device=input_ids.device).unsqueeze(0)
embeddings = embeddings + self.pos_embed(positions)
# Apply transformer layers
hidden_states = embeddings
for layer in self.layers:
hidden_states = layer(hidden_states)
# Predict logits for all positions
logits = self.output_head(hidden_states) # [batch, seq_length, vocab_size]
return logits
def denoise_step(self, input_ids, num_iterations=10, temperature=1.0):
"""
Iteratively refine predictions by masking uncertain positions.
"""
current_ids = input_ids.clone()
confidence_history = []
for iteration in range(num_iterations):
# Forward pass: predict all tokens
logits = self.forward(current_ids)
probs = torch.softmax(logits, dim=-1)
# Get confidence for current predictions
predicted_tokens = logits.argmax(dim=-1)
max_probs = probs.max(dim=-1)[0] # [batch, seq_length]
confidence_history.append(max_probs)
# Mask uncertain positions for next iteration
# Lower confidence = more likely to be re-predicted
uncertainty = 1.0 - max_probs
threshold = uncertainty.quantile(0.5, dim=1, keepdim=True) # Mask top 50%
mask_next = uncertainty > threshold
current_ids = current_ids.clone()
current_ids[mask_next] = 0 # Mask
# Early stopping if confident enough
if (max_probs > 0.95).float().mean() > 0.95:
break
return current_ids, confidence_history
Step 2: Analyze autoregressive-ness of diffusion models
Measure the degree to which models exploit left-to-right sequential structure.
def compute_local_autoregressive_ness(logits, input_ids):
"""
Measure how much the model uses left-to-right sequential structure.
Local: position i depends more on positions < i than > i.
"""
batch_size, seq_length, vocab_size = logits.shape
# Compute attention patterns (for transformer models)
# Attention to left positions vs right positions
left_attention = torch.zeros(batch_size, seq_length, device=logits.device)
right_attention = torch.zeros(batch_size, seq_length, device=logits.device)
# Approximation: use gradient information
# Positions that affect i most are those with highest gradient contribution
for i in range(seq_length):
# Hypothetically mask position i, measure logit change
logits_without_i = logits.clone()
logits_without_i[:, i, :] = 0
for j in range(i):
logits_left = logits.clone()
logits_left[:, j, :] = 0
change_left = (logits - logits_left).abs().sum(dim=-1)[:, i]
left_attention[:, i] += change_left
for j in range(i + 1, seq_length):
logits_right = logits.clone()
logits_right[:, j, :] = 0
change_right = (logits - logits_right).abs().sum(dim=-1)[:, i]
right_attention[:, i] += change_right
# Autoregressive-ness: fraction of influence from left
total_influence = left_attention + right_attention
autoregressive_ness = left_attention / (total_influence + 1e-8)
return autoregressive_ness.mean().item()
def global_autoregressive_ness(model, test_inputs, num_iterations=10):
"""
Measure global autoregressive-ness: how much does iteration order follow left-to-right?
"""
# Run diffusion process, track which positions get denoised in which order
denoising_order = []
for _ in range(num_iterations):
# Get model predictions
logits = model(test_inputs)
probs = torch.softmax(logits, dim=-1)
confidence = probs.max(dim=-1)[0]
# Positions denoised this iteration (high confidence)
denoised = confidence > confidence.quantile(0.5, dim=-1, keepdim=True)
# Measure: are leftmost positions denoised first?
denoised_positions = torch.where(denoised[0])[0]
if len(denoised_positions) > 0:
avg_position = denoised_positions.float().mean().item()
denoising_order.append(avg_position / test_inputs.shape[1])
# Autoregressive-ness: positions denoised earlier are more leftward
order_correlation = np.corrcoef(
np.arange(len(denoising_order)),
denoising_order
)[0, 1]
return order_correlation # Positive = left-to-right denoising order
Step 3: Implement coupled-GRPO for non-autoregressive optimization
Use complementary mask pairs to reduce gradient variance in RL training.
def coupled_grpo_training(model, code_prompts, correctness_reward_fn,
learning_rate=1e-5, num_steps=1000):
"""
Coupled-GRPO: Use paired complementary masks to reduce variance.
For each prompt, generate two completions with different mask patterns.
This provides a variance-reduced gradient estimate.
"""
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
for step in range(num_steps):
# Sample a batch of prompts
batch_prompts = random.sample(code_prompts, 16)
grpo_losses = []
for prompt in batch_prompts:
input_ids = tokenize(prompt)
# Generate with complementary mask pairs
# Mask pair 1: mask positions {0, 2, 4, ...}
mask_pattern_1 = torch.arange(input_ids.shape[1]) % 2 == 0
input_masked_1 = input_ids.clone()
input_masked_1[mask_pattern_1] = 0
# Mask pair 2: mask positions {1, 3, 5, ...} (complement)
mask_pattern_2 = ~mask_pattern_1
input_masked_2 = input_ids.clone()
input_masked_2[mask_pattern_2] = 0
# Get model predictions for both masks
with torch.no_grad():
logits_1 = model(input_masked_1)
logits_2 = model(input_masked_2)
# Sample outputs
probs_1 = torch.softmax(logits_1, dim=-1)
probs_2 = torch.softmax(logits_2, dim=-1)
samples_1 = torch.multinomial(
probs_1.view(-1, probs_1.shape[-1]),
num_samples=1
).view(probs_1.shape[:-1])
samples_2 = torch.multinomial(
probs_2.view(-1, probs_2.shape[-1]),
num_samples=1
).view(probs_2.shape[:-1])
# Combine samples: where mask_1, use samples_1; where mask_2, use samples_2
combined_sample = torch.where(mask_pattern_1, samples_1, samples_2)
# Score the combined sample
code = detokenize(combined_sample)
reward = correctness_reward_fn(prompt, code)
# Compute log probabilities under both masks
log_prob_1 = F.log_softmax(logits_1, dim=-1)
log_prob_2 = F.log_softmax(logits_2, dim=-1)
# Extract log probs for the actual samples
log_prob_1_sample = log_prob_1[mask_indices_1].sum()
log_prob_2_sample = log_prob_2[mask_indices_2].sum()
# GRPO loss: maximize log probability weighted by reward
# Use both masks to get variance-reduced estimate
policy_loss = -(log_prob_1_sample + log_prob_2_sample) * reward / 2
grpo_losses.append(policy_loss)
# Backprop
total_loss = sum(grpo_losses) / len(grpo_losses)
optimizer.zero_grad()
total_loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
if step % 100 == 0:
print(f"Step {step}: Loss {total_loss:.4f}")
return model
Step 4: Evaluate code generation performance
Test on standard code benchmarks with metrics like pass@k.
def evaluate_code_generation(model, test_prompts, reference_solutions,
num_samples_per_prompt=5, timeout_seconds=10):
"""
Evaluate code generation on standard benchmarks.
Metric: pass@k (does at least one of k samples pass tests)
"""
passes = []
for prompt, reference in zip(test_prompts, reference_solutions):
sample_passes = 0
for _ in range(num_samples_per_prompt):
# Generate code
input_ids = tokenize(prompt)
logits = model(input_ids)
probs = torch.softmax(logits, dim=-1)
# Sample until we get valid code
for attempt in range(5):
samples = torch.multinomial(
probs.view(-1, probs.shape[-1]),
num_samples=1
).view(probs.shape[:-1])
code = detokenize(samples)
# Check if code is valid and passes tests
try:
exec_globals = {}
exec(code, exec_globals)
# Run test
test_result = exec_globals.get('test', lambda: False)()
if test_result:
sample_passes += 1
break
except:
# Invalid code, try again
continue
# Pass@k: did at least one sample pass?
passes.append(sample_passes >= 1)
pass_rate = np.mean(passes)
return {'pass_rate': pass_rate, 'pass_count': sum(passes)}
| Hyperparameter | Recommended Value | Notes | |---|---|---| | Denoising iterations | 10-20 | More iterations = higher quality but slower | | Masking ratio per step | 50% | Gradually denoise by re-predicting uncertain positions | | Coupled mask overlap | 0.0 | Truly complementary masks minimize correlation | | Temperature for sampling | 0.8-1.2 | Higher = more diverse; affects both selection and order | | GRPO reward weight | 0.5-1.0 | Scale of reward signal | | Training steps | 2K-5K | Converges faster than autoregressive models |
When to use DiffuCoder:
When NOT to use DiffuCoder:
Common pitfalls:
DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation https://arxiv.org/abs/2506.20639
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.