BlockFFN: Sparse Mixture-of-Experts Optimized for End-Device Deployment

Standard Mixture-of-Experts architectures suffer from two fundamental constraints that prevent efficient on-device deployment: non-differentiable routing mechanisms that limit training convergence, and low chunk-level sparsity where consecutive tokens activate overlapping expert sets, blocking parallelization opportunities. BlockFFN addresses both problems through differentiable routers and specialized loss functions that enforce structured sparsity patterns compatible with hardware acceleration.

The core insight is that chunk-level sparsity (whether the same expert is active across consecutive tokens) is what enables acceleration kernels, not just token-level sparsity. By learning when to deactivate experts across sequences of tokens, the model becomes compatible with speculative decoding and GEMM-based acceleration while remaining highly parameter-efficient.

Core Concept

BlockFFN replaces TopK routing with a differentiable linear router combined with ReLU gating, enabling end-to-end optimization. Critically, it introduces two novel loss functions: Activation Locality Loss encourages neighboring tokens to activate similar expert sets, and Chunk Sparsification Loss directly minimizes the probability that an expert activates within L-token windows.

This architecture design enables hardware accelerators to compute only the union of activated experts across n tokens in parallel, masking unnecessary activations post-computation. The result is genuine speedup on edge devices rather than theoretical parameter reduction.

Architecture Overview

Router Module: Linear transformation with ReLU activation followed by RMSNorm, enabling differentiable expert selection unlike non-differentiable TopK
Expert Modules: Non-gated MLPs with Swish activation, sized smaller than typical (d_e << d_h) for fine-grained sparsity control
Activation Pattern Learning: Sigmoid-sharpened activation patterns for smooth gradients during Activation Locality Loss computation
Chunk-Level Sparsity Mechanism: Probabilistic computation of expert activation within L-token windows using binary models
CUTLASS Acceleration Kernel: GEMM-based implementation processing n=32 tokens simultaneously, computing only the union of activated experts

Implementation

Router Architecture Setup

The router combines simplicity with differentiability, avoiding TopK bottlenecks while maintaining effective expert selection.

import torch
import torch.nn as nn
import torch.nn.functional as F

class BlockFFNRouter(nn.Module):
    def __init__(self, hidden_dim, num_experts, sharpness=1.0):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.num_experts = num_experts
        self.sharpness = sharpness

        # Linear router without bias (simpler gradients)
        self.router_linear = nn.Linear(hidden_dim, num_experts, bias=False)
        self.rms_norm = nn.RMSNorm(num_experts)

    def forward(self, x):
        # x: (batch, seq_len, hidden_dim)

        # Linear routing scores
        scores = self.router_linear(x)  # (batch, seq_len, num_experts)

        # ReLU activation for sparsity
        scores = F.relu(scores)

        # RMSNorm for numerical stability
        scores = self.rms_norm(scores)

        # Compute activation probabilities via sigmoid with sharpness
        probs = torch.sigmoid(self.sharpness * scores)

        return probs, scores

class BlockFFNExpert(nn.Module):
    def __init__(self, hidden_dim, expert_dim, dropout=0.1):
        super().__init__()
        # Smaller expert dimensions for finer granularity
        self.fc1 = nn.Linear(hidden_dim, expert_dim)
        self.fc2 = nn.Linear(expert_dim, hidden_dim)
        self.dropout = nn.Dropout(dropout)
        self.activation = nn.SiLU()  # Swish

    def forward(self, x):
        x = self.fc1(x)
        x = self.activation(x)
        x = self.dropout(x)
        x = self.fc2(x)
        return x

The router avoids gradient blocking by using ReLU and normalization instead of TopK, enabling reliable training.

Activation Locality Loss

This loss encourages neighboring tokens to select similar expert sets, which creates structured sparsity compatible with acceleration.

def activation_locality_loss(activation_probs, alpha=2.0):
    """
    Minimize divergence between neighboring token activation patterns.

    Args:
        activation_probs: (batch, seq_len, num_experts) probabilities from sigmoid
        alpha: sharpness parameter for approximation (higher = sharper transition)

    Returns:
        locality loss encouraging similar patterns in adjacent tokens
    """
    batch_size, seq_len, num_experts = activation_probs.shape

    # Sharpen probabilities for binary-like patterns
    sharpened = torch.sigmoid(alpha * (2 * activation_probs - 1))

    # Compute BCE between adjacent tokens
    locality_loss = 0.0
    for i in range(seq_len - 1):
        # Binary cross entropy between token i and i+1
        curr = sharpened[:, i, :]  # (batch, num_experts)
        next_token = sharpened[:, i + 1, :]

        # BCE: -[p*log(q) + (1-p)*log(1-q)]
        bce = -(curr * torch.log(next_token + 1e-8) +
                (1 - curr) * torch.log(1 - next_token + 1e-8))
        locality_loss += bce.mean()

    return locality_loss / max(seq_len - 1, 1)

Chunk Sparsification Loss

This loss directly minimizes the probability that an expert activates anywhere within an L-token window, enabling acceleration kernel compatibility.

def chunk_sparsification_loss(activation_probs, chunk_length=8):
    """
    Minimize probability that expert i activates within L-token chunk.

    Probability that expert i does NOT activate in any token: ∏(1 - p_ik)
    Probability that expert i DOES activate: 1 - ∏(1 - p_ik)

    Args:
        activation_probs: (batch, seq_len, num_experts)
        chunk_length: L in the paper (typically 8 for speculative decoding)

    Returns:
        sparsification loss encouraging deactivation across chunks
    """
    batch_size, seq_len, num_experts = activation_probs.shape
    num_chunks = (seq_len + chunk_length - 1) // chunk_length

    sparsif_loss = 0.0
    for chunk_idx in range(num_chunks):
        start = chunk_idx * chunk_length
        end = min(start + chunk_length, seq_len)

        # Get probabilities for this chunk
        chunk_probs = activation_probs[:, start:end, :]  # (batch, L, num_experts)

        # Compute probability of NOT activating: ∏(1 - p_ik)
        prob_no_activation = torch.prod(1 - chunk_probs, dim=1)  # (batch, num_experts)

        # Probability of activation: 1 - ∏(1 - p_ik)
        prob_activation = 1 - prob_no_activation

        # Loss: minimize activation probability (encourage deactivation)
        sparsif_loss += prob_activation.mean()

    return sparsif_loss / num_chunks

Full BlockFFN Layer with Combined Losses

Integrate the router, experts, and loss functions into a complete layer.

class BlockFFNLayer(nn.Module):
    def __init__(self, hidden_dim, num_experts, expert_dim, chunk_length=8):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.num_experts = num_experts
        self.chunk_length = chunk_length

        self.router = BlockFFNRouter(hidden_dim, num_experts)
        self.experts = nn.ModuleList(
            [BlockFFNExpert(hidden_dim, expert_dim) for _ in range(num_experts)]
        )
        self.norm = nn.LayerNorm(hidden_dim)

    def forward(self, x):
        batch_size, seq_len, hidden_dim = x.shape

        # Get activation probabilities from router
        probs, _ = self.router(x)  # (batch, seq_len, num_experts)

        # Apply experts with gating
        expert_outputs = []
        for expert in self.experts:
            expert_output = expert(x)  # (batch, seq_len, hidden_dim)
            expert_outputs.append(expert_output)

        # Stack expert outputs
        expert_outputs = torch.stack(expert_outputs, dim=2)  # (batch, seq_len, num_experts, hidden_dim)

        # Gate with probabilities
        gated = probs.unsqueeze(-1) * expert_outputs  # (batch, seq_len, num_experts, hidden_dim)

        # Sum across experts
        output = gated.sum(dim=2)  # (batch, seq_len, hidden_dim)

        # Normalize and residual
        output = self.norm(output + x)

        return output, probs

# Training loop with combined losses
def train_block_ffn_step(model, batch, optimizer, lambda_al=0.1, lambda_cs=0.1):
    x = batch['input']  # (batch, seq_len, hidden_dim)
    targets = batch['targets']

    # Forward pass
    output, activation_probs = model.layer(x)

    # Main language modeling loss
    logits = model.lm_head(output)
    lm_loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))

    # Auxiliary losses
    al_loss = activation_locality_loss(activation_probs)
    cs_loss = chunk_sparsification_loss(activation_probs, chunk_length=8)

    # Combined loss
    total_loss = lm_loss + lambda_al * al_loss + lambda_cs * cs_loss

    optimizer.zero_grad()
    total_loss.backward()
    optimizer.step()

    return total_loss.item()

Adaptive Factor Scheduling

As training progresses, gradually increase λ_cs to enforce stronger chunk-level sparsity.

def schedule_sparsity_factor(step, total_steps, lambda_cs_max=0.5):
    """
    Linearly increase sparsification weight from 0 to lambda_cs_max.
    Prevents early-stage training instability.
    """
    return lambda_cs_max * (step / total_steps)

# In training loop
for step, batch in enumerate(train_loader):
    lambda_cs = schedule_sparsity_factor(step, total_steps, lambda_cs_max=0.5)
    loss = train_block_ffn_step(model, batch, optimizer, lambda_cs=lambda_cs)

Practical Guidance

Key Hyperparameters

| Parameter | Value | Notes | |-----------|-------|-------| | Router Sharpness α | 2.0 | Increases with training; controls activation discretization | | λ_al (Activation Locality) | 0.1-0.3 | Weight for locality loss; higher = more aligned neighboring patterns | | λ_cs (Chunk Sparsification) | 0.1-0.5 | Weight for chunk loss; schedule to increase over training | | Chunk Length L | 8 tokens | Aligned with speculative decoding batches (32 tokens = 4 chunks) | | Number of Experts | 40 (optimal) | Empirically optimal for sparsity-performance tradeoff | | Expert Dimension | d_e << d_h | 256-512 for 2048-dim hidden states | | Token-Level Sparsity Target | 80-84% | Achieved via router without explicit TopK | | Chunk-Level Sparsity Target | 70%+ | Primary metric for acceleration compatibility |

When to Use

Deploying LLMs to edge devices (Jetson Orin, mobile hardware) with strict memory constraints
Requiring real-time inference speedup on consumer-grade GPUs (RTX 4080, A100)
Building systems that need both parameter efficiency AND hardware-friendly sparsity patterns
Implementing speculative decoding where the model handles 32-token batches simultaneously
Creating energy-efficient inference pipelines for mobile or embedded applications

When NOT to Use

Training from scratch with small models (<0.1B parameters); sparsity overhead exceeds benefits
Applications with no hardware acceleration support (pure CPU inference gets minimal speedup)
Scenarios where absolute accuracy is critical and you cannot tolerate 1-2% performance drops
Highly specialized domains where model capacity is already tight (sparsity compounds scaling down)

Common Pitfalls

Over-aggressive chunk length: Using L=4 instead of L=8 causes unbalanced expert loads and training instability; match your hardware's speculative batch size
Neglecting locality warmup: Applying full λ_al from step 0 causes optimization conflicts; start with λ_al=0, gradually increase to 0.1 over first 1000 steps
Mismatched expert dimensionality: Setting expert_dim too large (>d_h/2) negates sparsity benefits; keep d_e < 1/4 of hidden dimension
Ignoring chunk boundary effects: Edge chunks (last partial chunk) behave differently; pad seq_len to multiples of chunk_length for consistency
Static routing weights: Using fixed λ_cs throughout training leads to degraded initial convergence; implement adaptive scheduling to increase gradually

Reference

Tian, Y., Wang, Z., Liu, S., et al. (2024). BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity. arXiv preprint arXiv:2507.08771.

BlockFFN: Sparse Mixture-of-Experts Optimized for End-Device Deployment

Core Concept

Architecture Overview

Router Module: Linear transformation with ReLU activation followed by RMSNorm, enabling differentiable expert selection unlike non-differentiable TopK
Expert Modules: Non-gated MLPs with Swish activation, sized smaller than typical (d_e << d_h) for fine-grained sparsity control
Activation Pattern Learning: Sigmoid-sharpened activation patterns for smooth gradients during Activation Locality Loss computation
Chunk-Level Sparsity Mechanism: Probabilistic computation of expert activation within L-token windows using binary models
CUTLASS Acceleration Kernel: GEMM-based implementation processing n=32 tokens simultaneously, computing only the union of activated experts

Implementation

Router Architecture Setup

The router combines simplicity with differentiability, avoiding TopK bottlenecks while maintaining effective expert selection.

import torch
import torch.nn as nn
import torch.nn.functional as F

class BlockFFNRouter(nn.Module):
    def __init__(self, hidden_dim, num_experts, sharpness=1.0):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.num_experts = num_experts
        self.sharpness = sharpness

        # Linear router without bias (simpler gradients)
        self.router_linear = nn.Linear(hidden_dim, num_experts, bias=False)
        self.rms_norm = nn.RMSNorm(num_experts)

    def forward(self, x):
        # x: (batch, seq_len, hidden_dim)

        # Linear routing scores
        scores = self.router_linear(x)  # (batch, seq_len, num_experts)

        # ReLU activation for sparsity
        scores = F.relu(scores)

        # RMSNorm for numerical stability
        scores = self.rms_norm(scores)

        # Compute activation probabilities via sigmoid with sharpness
        probs = torch.sigmoid(self.sharpness * scores)

        return probs, scores

class BlockFFNExpert(nn.Module):
    def __init__(self, hidden_dim, expert_dim, dropout=0.1):
        super().__init__()
        # Smaller expert dimensions for finer granularity
        self.fc1 = nn.Linear(hidden_dim, expert_dim)
        self.fc2 = nn.Linear(expert_dim, hidden_dim)
        self.dropout = nn.Dropout(dropout)
        self.activation = nn.SiLU()  # Swish

    def forward(self, x):
        x = self.fc1(x)
        x = self.activation(x)
        x = self.dropout(x)
        x = self.fc2(x)
        return x

The router avoids gradient blocking by using ReLU and normalization instead of TopK, enabling reliable training.

Activation Locality Loss

This loss encourages neighboring tokens to select similar expert sets, which creates structured sparsity compatible with acceleration.

def activation_locality_loss(activation_probs, alpha=2.0):
    """
    Minimize divergence between neighboring token activation patterns.

    Args:
        activation_probs: (batch, seq_len, num_experts) probabilities from sigmoid
        alpha: sharpness parameter for approximation (higher = sharper transition)

    Returns:
        locality loss encouraging similar patterns in adjacent tokens
    """
    batch_size, seq_len, num_experts = activation_probs.shape

    # Sharpen probabilities for binary-like patterns
    sharpened = torch.sigmoid(alpha * (2 * activation_probs - 1))

    # Compute BCE between adjacent tokens
    locality_loss = 0.0
    for i in range(seq_len - 1):
        # Binary cross entropy between token i and i+1
        curr = sharpened[:, i, :]  # (batch, num_experts)
        next_token = sharpened[:, i + 1, :]

        # BCE: -[p*log(q) + (1-p)*log(1-q)]
        bce = -(curr * torch.log(next_token + 1e-8) +
                (1 - curr) * torch.log(1 - next_token + 1e-8))
        locality_loss += bce.mean()

    return locality_loss / max(seq_len - 1, 1)

Chunk Sparsification Loss

This loss directly minimizes the probability that an expert activates anywhere within an L-token window, enabling acceleration kernel compatibility.

def chunk_sparsification_loss(activation_probs, chunk_length=8):
    """
    Minimize probability that expert i activates within L-token chunk.

    Probability that expert i does NOT activate in any token: ∏(1 - p_ik)
    Probability that expert i DOES activate: 1 - ∏(1 - p_ik)

    Args:
        activation_probs: (batch, seq_len, num_experts)
        chunk_length: L in the paper (typically 8 for speculative decoding)

    Returns:
        sparsification loss encouraging deactivation across chunks
    """
    batch_size, seq_len, num_experts = activation_probs.shape
    num_chunks = (seq_len + chunk_length - 1) // chunk_length

    sparsif_loss = 0.0
    for chunk_idx in range(num_chunks):
        start = chunk_idx * chunk_length
        end = min(start + chunk_length, seq_len)

        # Get probabilities for this chunk
        chunk_probs = activation_probs[:, start:end, :]  # (batch, L, num_experts)

        # Compute probability of NOT activating: ∏(1 - p_ik)
        prob_no_activation = torch.prod(1 - chunk_probs, dim=1)  # (batch, num_experts)

        # Probability of activation: 1 - ∏(1 - p_ik)
        prob_activation = 1 - prob_no_activation

        # Loss: minimize activation probability (encourage deactivation)
        sparsif_loss += prob_activation.mean()

    return sparsif_loss / num_chunks

Full BlockFFN Layer with Combined Losses

Integrate the router, experts, and loss functions into a complete layer.

class BlockFFNLayer(nn.Module):
    def __init__(self, hidden_dim, num_experts, expert_dim, chunk_length=8):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.num_experts = num_experts
        self.chunk_length = chunk_length

        self.router = BlockFFNRouter(hidden_dim, num_experts)
        self.experts = nn.ModuleList(
            [BlockFFNExpert(hidden_dim, expert_dim) for _ in range(num_experts)]
        )
        self.norm = nn.LayerNorm(hidden_dim)

    def forward(self, x):
        batch_size, seq_len, hidden_dim = x.shape

        # Get activation probabilities from router
        probs, _ = self.router(x)  # (batch, seq_len, num_experts)

        # Apply experts with gating
        expert_outputs = []
        for expert in self.experts:
            expert_output = expert(x)  # (batch, seq_len, hidden_dim)
            expert_outputs.append(expert_output)

        # Stack expert outputs
        expert_outputs = torch.stack(expert_outputs, dim=2)  # (batch, seq_len, num_experts, hidden_dim)

        # Gate with probabilities
        gated = probs.unsqueeze(-1) * expert_outputs  # (batch, seq_len, num_experts, hidden_dim)

        # Sum across experts
        output = gated.sum(dim=2)  # (batch, seq_len, hidden_dim)

        # Normalize and residual
        output = self.norm(output + x)

        return output, probs

# Training loop with combined losses
def train_block_ffn_step(model, batch, optimizer, lambda_al=0.1, lambda_cs=0.1):
    x = batch['input']  # (batch, seq_len, hidden_dim)
    targets = batch['targets']

    # Forward pass
    output, activation_probs = model.layer(x)

    # Main language modeling loss
    logits = model.lm_head(output)
    lm_loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))

    # Auxiliary losses
    al_loss = activation_locality_loss(activation_probs)
    cs_loss = chunk_sparsification_loss(activation_probs, chunk_length=8)

    # Combined loss
    total_loss = lm_loss + lambda_al * al_loss + lambda_cs * cs_loss

    optimizer.zero_grad()
    total_loss.backward()
    optimizer.step()

    return total_loss.item()

Adaptive Factor Scheduling

As training progresses, gradually increase λ_cs to enforce stronger chunk-level sparsity.

def schedule_sparsity_factor(step, total_steps, lambda_cs_max=0.5):
    """
    Linearly increase sparsification weight from 0 to lambda_cs_max.
    Prevents early-stage training instability.
    """
    return lambda_cs_max * (step / total_steps)

# In training loop
for step, batch in enumerate(train_loader):
    lambda_cs = schedule_sparsity_factor(step, total_steps, lambda_cs_max=0.5)
    loss = train_block_ffn_step(model, batch, optimizer, lambda_cs=lambda_cs)

Practical Guidance

Key Hyperparameters

When to Use

Deploying LLMs to edge devices (Jetson Orin, mobile hardware) with strict memory constraints
Requiring real-time inference speedup on consumer-grade GPUs (RTX 4080, A100)
Building systems that need both parameter efficiency AND hardware-friendly sparsity patterns
Implementing speculative decoding where the model handles 32-token batches simultaneously
Creating energy-efficient inference pipelines for mobile or embedded applications

When NOT to Use

Training from scratch with small models (<0.1B parameters); sparsity overhead exceeds benefits
Applications with no hardware acceleration support (pure CPU inference gets minimal speedup)
Scenarios where absolute accuracy is critical and you cannot tolerate 1-2% performance drops
Highly specialized domains where model capacity is already tight (sparsity compounds scaling down)

Common Pitfalls

Over-aggressive chunk length: Using L=4 instead of L=8 causes unbalanced expert loads and training instability; match your hardware's speculative batch size
Neglecting locality warmup: Applying full λ_al from step 0 causes optimization conflicts; start with λ_al=0, gradually increase to 0.1 over first 1000 steps
Mismatched expert dimensionality: Setting expert_dim too large (>d_h/2) negates sparsity benefits; keep d_e < 1/4 of hidden dimension
Ignoring chunk boundary effects: Edge chunks (last partial chunk) behave differently; pad seq_len to multiples of chunk_length for consistency
Static routing weights: Using fixed λ_cs throughout training leads to degraded initial convergence; implement adaptive scheduling to increase gradually

Reference

Tian, Y., Wang, Z., Liu, S., et al. (2024). BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity. arXiv preprint arXiv:2507.08771.

Adoption

ADu2021/block-ffn-sparse-mixture-experts

$ install --global

Security Scan Results

SKILL.md

BlockFFN: Sparse Mixture-of-Experts Optimized for End-Device Deployment

Core Concept

Architecture Overview

Implementation

Router Architecture Setup

Activation Locality Loss

Chunk Sparsification Loss

Full BlockFFN Layer with Combined Losses

Adaptive Factor Scheduling

Practical Guidance

Key Hyperparameters

When to Use

When NOT to Use

Common Pitfalls

Reference

Related Skills

ADu2021/flow-map-trajectory-tilting

ADu2021/flexible-data-mixture-of-experts

ADu2021/flexibility-trap-diffusion-reasoning

ADu2021/flex-continuous-agent-evolution

ADu2021/block-ffn-sparse-mixture-experts

$ install --global

Security Scan Results

SKILL.md

BlockFFN: Sparse Mixture-of-Experts Optimized for End-Device Deployment

Core Concept

Architecture Overview

Implementation

Router Architecture Setup

Activation Locality Loss

Chunk Sparsification Loss

Full BlockFFN Layer with Combined Losses

Adaptive Factor Scheduling

Practical Guidance

Key Hyperparameters

When to Use

When NOT to Use

Common Pitfalls

Reference

Related Skills

ADu2021/flow-map-trajectory-tilting

ADu2021/flexible-data-mixture-of-experts

ADu2021/flexibility-trap-diffusion-reasoning

ADu2021/flex-continuous-agent-evolution