Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

ADu2021/dflash-block-diffusion-speculative-decoding

Name: dflash-block-diffusion-speculative-decoding
Author: ADu2021

skills/skillxiv-v0.0.2-claude-opus-4.6/dflash-block-diffusion-speculative-decoding/SKILL.md

npx skillsauth add ADu2021/skillXiv dflash-block-diffusion-speculative-decoding

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

DFlash: Block Diffusion for Flash Speculative Decoding

Problem Context

Autoregressive LLMs generate tokens sequentially, creating inference bottlenecks. Speculative decoding helps but is limited by reliance on autoregressive drafting—inherently sequential processes that cap speedups around 2-3x. Diffusion models enable parallel generation but typically underperform in quality. DFlash bridges this gap: parallel drafting with quality matching the target model.

Core Concept

DFlash combines [parallel block diffusion, target model conditioning, KV injection] to enable fast high-quality token drafting. The insight is "the target knows best"—leveraging frozen target model representations substantially improves draft quality without needing massive draft models.

Architecture Overview

Draft model: Small diffusion model generating multiple tokens in parallel
Target conditioning: Extract hidden representations from frozen target model; inject into draft KV cache per layer
KV injection design: Directly insert target context features into cache rather than standard fusion
Training: Random anchor sampling for block initialization; loss weighting prioritizing early positions
Verification: Standard acceptance checking; discard rejected tokens

Implementation

Step 1: Extract target model representations for conditioning

Freeze the target LLM and extract intermediate layer hidden states to guide draft generation. Cache these for efficiency.

# Extract conditioning from target model
def extract_target_conditioning(target_model, input_ids, layer_indices):
    """
    Forward target model; extract hidden states at specified layers.
    Returns conditioning features for each layer.
    """
    conditioning = {}
    with torch.no_grad():
        output = target_model(
            input_ids,
            output_hidden_states=True,
            return_dict=True
        )
        for layer_idx in layer_indices:
            # Extract and compress hidden state
            h = output.hidden_states[layer_idx]
            # Project to draft model's key-value dimension
            conditioning[layer_idx] = project_to_kv(h)
    return conditioning

Step 2: Design KV injection mechanism

Inject target model context directly into draft model's Key-Value cache at every decoder layer, replacing standard feature fusion.

# KV injection into draft model cache
def inject_target_conditioning_to_kv(
    draft_kv_cache, target_conditioning, layer_idx,
    injection_strength=0.5
):
    """
    Inject target model hidden states directly into KV cache.
    Enables acceptance scaling with depth.
    """
    k, v = draft_kv_cache[layer_idx]

    # Target features as additional context in value
    target_feat = target_conditioning[layer_idx]

    # Blend target features with draft KV
    v_augmented = v + injection_strength * target_feat

    draft_kv_cache[layer_idx] = (k, v_augmented)
    return draft_kv_cache

Step 3: Implement block diffusion with anchor sampling

Initialize diffusion process with random anchor tokens at block boundaries. Use scheduled denoising to generate multiple tokens in parallel.

# Block diffusion generation
def block_diffusion_generate(
    draft_model, context, block_size=4, num_steps=15
):
    """
    Generate block_size tokens in parallel using diffusion.
    Anchor tokens at block boundaries guide generation.
    """
    batch_size = context.shape[0]
    seq_len = context.shape[1]

    # Initialize block: random anchors at boundaries
    block_tokens = torch.randint(
        0, draft_model.vocab_size,
        (batch_size, block_size)
    )

    # Reverse diffusion: denoise tokens over num_steps
    for step in range(num_steps):
        noise_schedule = 1.0 - (step / num_steps)

        # Predict token logits
        logits = draft_model.denoise(
            context, block_tokens,
            timestep=step, num_steps=num_steps
        )

        # Sample new tokens with scheduled noise
        block_tokens = sample_with_noise(
            logits, block_tokens, noise_schedule
        )

    return block_tokens

Step 4: Apply loss weighting favoring early positions

During training, weight loss to prioritize early tokens in block, which are most predictable.

# Loss weighting for block diffusion training
def weighted_diffusion_loss(
    predictions, targets, position_weights=None
):
    """
    Weight tokens early in block more heavily.
    Early tokens are easier to predict; later tokens need guidance.
    """
    if position_weights is None:
        # Default: exponential decay favoring early positions
        block_size = targets.shape[-1]
        position_weights = torch.exp(
            -torch.arange(block_size, dtype=torch.float32) * 0.3
        )
        position_weights = position_weights / position_weights.sum()

    # Per-position loss weighting
    loss = F.cross_entropy(
        predictions.reshape(-1, predictions.shape[-1]),
        targets.reshape(-1),
        reduction='none'
    )
    loss = loss.reshape(targets.shape)

    # Apply position weighting
    weighted_loss = (loss * position_weights).mean()
    return weighted_loss

Step 5: Integrate speculative decoding with verification

Generate drafts in parallel blocks; verify and accept valid tokens; fall back to target model only when needed.

# Speculative decoding with DFlash
def flash_speculative_decode(
    target_model, draft_model, input_ids,
    max_new_tokens=100, block_size=4
):
    """
    Generate tokens using DFlash speculative decoding.
    """
    generated = input_ids.clone()
    target_conditioning = extract_target_conditioning(
        target_model, input_ids, layer_indices=[12, 24]
    )

    while generated.shape[1] < input_ids.shape[1] + max_new_tokens:
        # Draft: generate block in parallel
        draft_block = block_diffusion_generate(
            draft_model, generated, block_size=block_size
        )

        # Verify against target model
        with torch.no_grad():
            target_logits = target_model(
                torch.cat([generated, draft_block], dim=1)
            ).logits

        # Acceptance checking: compare probabilities
        accepted = verify_tokens(
            draft_block,
            target_logits[:, -block_size:],
            temperature=1.0
        )

        # Add verified tokens; fall back to target if none accepted
        num_accepted = accepted.sum().item()
        if num_accepted > 0:
            generated = torch.cat([
                generated,
                draft_block[:, :num_accepted]
            ], dim=1)
        else:
            # Fall back to target model
            next_token = target_model(generated).logits[:, -1].argmax(dim=-1)
            generated = torch.cat([generated, next_token.unsqueeze(1)], dim=1)

    return generated

Practical Guidance

When to use: High-throughput inference settings (serving, batch processing) where latency is critical. Less beneficial for interactive single-token-at-a-time scenarios.

Hyperparameters:

Block size (2-8): larger blocks → more parallelism but lower acceptance rates; 4 is sweet spot
Diffusion steps (10-20): more steps → better quality but slower; 15 typical
Injection strength (0.3-0.7): balance target guidance with draft autonomy
Layer indices for conditioning: use middle-to-later layers (12-24 in 32-layer models)

Common pitfalls:

Target model hidden state dimension must align with draft model; use projection layers
Block boundaries matter: random initialization worse than structured patterns; try position-aware initialization
Acceptance rate drops if draft model too weak; consider larger draft (7B-13B for 70B target)
Memory overhead from storing target conditioning; cache selectively for long sequences

Scaling: Speedup scales with block size and acceptance rate. Typical 4-6x speedup on standard benchmarks. Works best with strong target models and sufficient compute for parallel drafting.

Reference

Paper: https://arxiv.org/abs/2602.06036 Code: Available at author's repository Related work: Speculative decoding, EAGLE, diffusion-based generation, fast inference

ADu2021/dflash-block-diffusion-speculative-decoding

skills/skillxiv-v0.0.2-claude-opus-4.6/dflash-block-diffusion-speculative-decoding/SKILL.md

Accelerate LLM inference 6x by using block diffusion for parallel token drafting with tight coupling to the target model's hidden representations, achieving higher speedups than existing speculative methods without quality loss.

2 stars

testing

Updated Apr 17, 2026

$ install --global

skillsauth

npx skillsauth add ADu2021/skillXiv dflash-block-diffusion-speculative-decoding

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 17, 2026, 5:33 AM10.8s1 file scanned

SKILL.md

name:: dflash-block-diffusion-speculative-decoding
title:: DFlash: Block Diffusion for Flash Speculative Decoding
version:: 0.0.2
engine:: skillxiv-v0.0.2-claude-opus-4.6
license:: MIT
url:: https://arxiv.org/abs/2602.06036
keywords:: [Speculative Decoding, Diffusion Models, Parallel Generation, Inference Acceleration, Target Model Conditioning]
description:: Accelerate LLM inference 6x by using block diffusion for parallel token drafting with tight coupling to the target model's hidden representations, achieving higher speedups than existing speculative methods without quality loss.

DFlash: Block Diffusion for Flash Speculative Decoding

Problem Context

Core Concept

Architecture Overview

Draft model: Small diffusion model generating multiple tokens in parallel
Target conditioning: Extract hidden representations from frozen target model; inject into draft KV cache per layer
KV injection design: Directly insert target context features into cache rather than standard fusion
Training: Random anchor sampling for block initialization; loss weighting prioritizing early positions
Verification: Standard acceptance checking; discard rejected tokens

Implementation

Step 1: Extract target model representations for conditioning

Freeze the target LLM and extract intermediate layer hidden states to guide draft generation. Cache these for efficiency.

# Extract conditioning from target model
def extract_target_conditioning(target_model, input_ids, layer_indices):
    """
    Forward target model; extract hidden states at specified layers.
    Returns conditioning features for each layer.
    """
    conditioning = {}
    with torch.no_grad():
        output = target_model(
            input_ids,
            output_hidden_states=True,
            return_dict=True
        )
        for layer_idx in layer_indices:
            # Extract and compress hidden state
            h = output.hidden_states[layer_idx]
            # Project to draft model's key-value dimension
            conditioning[layer_idx] = project_to_kv(h)
    return conditioning

Step 2: Design KV injection mechanism

Inject target model context directly into draft model's Key-Value cache at every decoder layer, replacing standard feature fusion.

# KV injection into draft model cache
def inject_target_conditioning_to_kv(
    draft_kv_cache, target_conditioning, layer_idx,
    injection_strength=0.5
):
    """
    Inject target model hidden states directly into KV cache.
    Enables acceptance scaling with depth.
    """
    k, v = draft_kv_cache[layer_idx]

    # Target features as additional context in value
    target_feat = target_conditioning[layer_idx]

    # Blend target features with draft KV
    v_augmented = v + injection_strength * target_feat

    draft_kv_cache[layer_idx] = (k, v_augmented)
    return draft_kv_cache

Step 3: Implement block diffusion with anchor sampling

Initialize diffusion process with random anchor tokens at block boundaries. Use scheduled denoising to generate multiple tokens in parallel.

# Block diffusion generation
def block_diffusion_generate(
    draft_model, context, block_size=4, num_steps=15
):
    """
    Generate block_size tokens in parallel using diffusion.
    Anchor tokens at block boundaries guide generation.
    """
    batch_size = context.shape[0]
    seq_len = context.shape[1]

    # Initialize block: random anchors at boundaries
    block_tokens = torch.randint(
        0, draft_model.vocab_size,
        (batch_size, block_size)
    )

    # Reverse diffusion: denoise tokens over num_steps
    for step in range(num_steps):
        noise_schedule = 1.0 - (step / num_steps)

        # Predict token logits
        logits = draft_model.denoise(
            context, block_tokens,
            timestep=step, num_steps=num_steps
        )

        # Sample new tokens with scheduled noise
        block_tokens = sample_with_noise(
            logits, block_tokens, noise_schedule
        )

    return block_tokens

Step 4: Apply loss weighting favoring early positions

During training, weight loss to prioritize early tokens in block, which are most predictable.

# Loss weighting for block diffusion training
def weighted_diffusion_loss(
    predictions, targets, position_weights=None
):
    """
    Weight tokens early in block more heavily.
    Early tokens are easier to predict; later tokens need guidance.
    """
    if position_weights is None:
        # Default: exponential decay favoring early positions
        block_size = targets.shape[-1]
        position_weights = torch.exp(
            -torch.arange(block_size, dtype=torch.float32) * 0.3
        )
        position_weights = position_weights / position_weights.sum()

    # Per-position loss weighting
    loss = F.cross_entropy(
        predictions.reshape(-1, predictions.shape[-1]),
        targets.reshape(-1),
        reduction='none'
    )
    loss = loss.reshape(targets.shape)

    # Apply position weighting
    weighted_loss = (loss * position_weights).mean()
    return weighted_loss

Step 5: Integrate speculative decoding with verification

Generate drafts in parallel blocks; verify and accept valid tokens; fall back to target model only when needed.

# Speculative decoding with DFlash
def flash_speculative_decode(
    target_model, draft_model, input_ids,
    max_new_tokens=100, block_size=4
):
    """
    Generate tokens using DFlash speculative decoding.
    """
    generated = input_ids.clone()
    target_conditioning = extract_target_conditioning(
        target_model, input_ids, layer_indices=[12, 24]
    )

    while generated.shape[1] < input_ids.shape[1] + max_new_tokens:
        # Draft: generate block in parallel
        draft_block = block_diffusion_generate(
            draft_model, generated, block_size=block_size
        )

        # Verify against target model
        with torch.no_grad():
            target_logits = target_model(
                torch.cat([generated, draft_block], dim=1)
            ).logits

        # Acceptance checking: compare probabilities
        accepted = verify_tokens(
            draft_block,
            target_logits[:, -block_size:],
            temperature=1.0
        )

        # Add verified tokens; fall back to target if none accepted
        num_accepted = accepted.sum().item()
        if num_accepted > 0:
            generated = torch.cat([
                generated,
                draft_block[:, :num_accepted]
            ], dim=1)
        else:
            # Fall back to target model
            next_token = target_model(generated).logits[:, -1].argmax(dim=-1)
            generated = torch.cat([generated, next_token.unsqueeze(1)], dim=1)

    return generated

Practical Guidance

When to use: High-throughput inference settings (serving, batch processing) where latency is critical. Less beneficial for interactive single-token-at-a-time scenarios.

Hyperparameters:

Block size (2-8): larger blocks → more parallelism but lower acceptance rates; 4 is sweet spot
Diffusion steps (10-20): more steps → better quality but slower; 15 typical
Injection strength (0.3-0.7): balance target guidance with draft autonomy
Layer indices for conditioning: use middle-to-later layers (12-24 in 32-layer models)

Common pitfalls:

Target model hidden state dimension must align with draft model; use projection layers
Block boundaries matter: random initialization worse than structured patterns; try position-aware initialization
Acceptance rate drops if draft model too weak; consider larger draft (7B-13B for 70B target)
Memory overhead from storing target conditioning; cache selectively for long sequences

Scaling: Speedup scales with block size and acceptance rate. Typical 4-6x speedup on standard benchmarks. Works best with strong target models and sufficient compute for parallel drafting.

Reference

Paper: https://arxiv.org/abs/2602.06036 Code: Available at author's repository Related work: Speculative decoding, EAGLE, diffusion-based generation, fast inference

Related Skills

ADu2021/flow-map-trajectory-tilting

testing

VerifiedTrustedCommunity

Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flow-map-trajectory-tilting

ADu2021/flexible-data-mixture-of-experts

testing

VerifiedTrustedCommunity

Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flexible-data-mixture-of-experts

ADu2021/flexibility-trap-diffusion-reasoning

data-ai

VerifiedTrustedCommunity

Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flexibility-trap-diffusion-reasoning

ADu2021/flex-continuous-agent-evolution

devops

VerifiedTrustedCommunity

Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flex-continuous-agent-evolution

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/ADu2021/skillXiv.git

# Copy into Claude Code skills folder (global)
cp -r skillXiv/skills/skillxiv-v0.0.2-claude-opus-4.6/dflash-block-diffusion-speculative-decoding ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

ADu2021/skillXiv

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT