skills/skillxiv-v0.0.2-claude-opus-4.6/dflash-block-diffusion-speculative-decoding/SKILL.md
Accelerate LLM inference 6x by using block diffusion for parallel token drafting with tight coupling to the target model's hidden representations, achieving higher speedups than existing speculative methods without quality loss.
npx skillsauth add ADu2021/skillXiv dflash-block-diffusion-speculative-decodingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Autoregressive LLMs generate tokens sequentially, creating inference bottlenecks. Speculative decoding helps but is limited by reliance on autoregressive drafting—inherently sequential processes that cap speedups around 2-3x. Diffusion models enable parallel generation but typically underperform in quality. DFlash bridges this gap: parallel drafting with quality matching the target model.
DFlash combines [parallel block diffusion, target model conditioning, KV injection] to enable fast high-quality token drafting. The insight is "the target knows best"—leveraging frozen target model representations substantially improves draft quality without needing massive draft models.
Freeze the target LLM and extract intermediate layer hidden states to guide draft generation. Cache these for efficiency.
# Extract conditioning from target model
def extract_target_conditioning(target_model, input_ids, layer_indices):
"""
Forward target model; extract hidden states at specified layers.
Returns conditioning features for each layer.
"""
conditioning = {}
with torch.no_grad():
output = target_model(
input_ids,
output_hidden_states=True,
return_dict=True
)
for layer_idx in layer_indices:
# Extract and compress hidden state
h = output.hidden_states[layer_idx]
# Project to draft model's key-value dimension
conditioning[layer_idx] = project_to_kv(h)
return conditioning
Inject target model context directly into draft model's Key-Value cache at every decoder layer, replacing standard feature fusion.
# KV injection into draft model cache
def inject_target_conditioning_to_kv(
draft_kv_cache, target_conditioning, layer_idx,
injection_strength=0.5
):
"""
Inject target model hidden states directly into KV cache.
Enables acceptance scaling with depth.
"""
k, v = draft_kv_cache[layer_idx]
# Target features as additional context in value
target_feat = target_conditioning[layer_idx]
# Blend target features with draft KV
v_augmented = v + injection_strength * target_feat
draft_kv_cache[layer_idx] = (k, v_augmented)
return draft_kv_cache
Initialize diffusion process with random anchor tokens at block boundaries. Use scheduled denoising to generate multiple tokens in parallel.
# Block diffusion generation
def block_diffusion_generate(
draft_model, context, block_size=4, num_steps=15
):
"""
Generate block_size tokens in parallel using diffusion.
Anchor tokens at block boundaries guide generation.
"""
batch_size = context.shape[0]
seq_len = context.shape[1]
# Initialize block: random anchors at boundaries
block_tokens = torch.randint(
0, draft_model.vocab_size,
(batch_size, block_size)
)
# Reverse diffusion: denoise tokens over num_steps
for step in range(num_steps):
noise_schedule = 1.0 - (step / num_steps)
# Predict token logits
logits = draft_model.denoise(
context, block_tokens,
timestep=step, num_steps=num_steps
)
# Sample new tokens with scheduled noise
block_tokens = sample_with_noise(
logits, block_tokens, noise_schedule
)
return block_tokens
During training, weight loss to prioritize early tokens in block, which are most predictable.
# Loss weighting for block diffusion training
def weighted_diffusion_loss(
predictions, targets, position_weights=None
):
"""
Weight tokens early in block more heavily.
Early tokens are easier to predict; later tokens need guidance.
"""
if position_weights is None:
# Default: exponential decay favoring early positions
block_size = targets.shape[-1]
position_weights = torch.exp(
-torch.arange(block_size, dtype=torch.float32) * 0.3
)
position_weights = position_weights / position_weights.sum()
# Per-position loss weighting
loss = F.cross_entropy(
predictions.reshape(-1, predictions.shape[-1]),
targets.reshape(-1),
reduction='none'
)
loss = loss.reshape(targets.shape)
# Apply position weighting
weighted_loss = (loss * position_weights).mean()
return weighted_loss
Generate drafts in parallel blocks; verify and accept valid tokens; fall back to target model only when needed.
# Speculative decoding with DFlash
def flash_speculative_decode(
target_model, draft_model, input_ids,
max_new_tokens=100, block_size=4
):
"""
Generate tokens using DFlash speculative decoding.
"""
generated = input_ids.clone()
target_conditioning = extract_target_conditioning(
target_model, input_ids, layer_indices=[12, 24]
)
while generated.shape[1] < input_ids.shape[1] + max_new_tokens:
# Draft: generate block in parallel
draft_block = block_diffusion_generate(
draft_model, generated, block_size=block_size
)
# Verify against target model
with torch.no_grad():
target_logits = target_model(
torch.cat([generated, draft_block], dim=1)
).logits
# Acceptance checking: compare probabilities
accepted = verify_tokens(
draft_block,
target_logits[:, -block_size:],
temperature=1.0
)
# Add verified tokens; fall back to target if none accepted
num_accepted = accepted.sum().item()
if num_accepted > 0:
generated = torch.cat([
generated,
draft_block[:, :num_accepted]
], dim=1)
else:
# Fall back to target model
next_token = target_model(generated).logits[:, -1].argmax(dim=-1)
generated = torch.cat([generated, next_token.unsqueeze(1)], dim=1)
return generated
When to use: High-throughput inference settings (serving, batch processing) where latency is critical. Less beneficial for interactive single-token-at-a-time scenarios.
Hyperparameters:
Common pitfalls:
Scaling: Speedup scales with block size and acceptance rate. Typical 4-6x speedup on standard benchmarks. Works best with strong target models and sufficient compute for parallel drafting.
Paper: https://arxiv.org/abs/2602.06036 Code: Available at author's repository Related work: Speculative decoding, EAGLE, diffusion-based generation, fast inference
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.