skills/skillxiv-v0.0.2-claude-opus-4.6/block-ffn-sparse-mixture-experts/SKILL.md
Deploy efficient MoE models on resource-constrained edge devices by learning chunk-level activation sparsity that achieves 3.67× speedup. Use when you need to compress LLMs for on-device inference while maintaining reasoning quality and supporting speculative decoding acceleration.
npx skillsauth add ADu2021/skillXiv block-ffn-sparse-mixture-expertsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Standard Mixture-of-Experts architectures suffer from two fundamental constraints that prevent efficient on-device deployment: non-differentiable routing mechanisms that limit training convergence, and low chunk-level sparsity where consecutive tokens activate overlapping expert sets, blocking parallelization opportunities. BlockFFN addresses both problems through differentiable routers and specialized loss functions that enforce structured sparsity patterns compatible with hardware acceleration.
The core insight is that chunk-level sparsity (whether the same expert is active across consecutive tokens) is what enables acceleration kernels, not just token-level sparsity. By learning when to deactivate experts across sequences of tokens, the model becomes compatible with speculative decoding and GEMM-based acceleration while remaining highly parameter-efficient.
BlockFFN replaces TopK routing with a differentiable linear router combined with ReLU gating, enabling end-to-end optimization. Critically, it introduces two novel loss functions: Activation Locality Loss encourages neighboring tokens to activate similar expert sets, and Chunk Sparsification Loss directly minimizes the probability that an expert activates within L-token windows.
This architecture design enables hardware accelerators to compute only the union of activated experts across n tokens in parallel, masking unnecessary activations post-computation. The result is genuine speedup on edge devices rather than theoretical parameter reduction.
The router combines simplicity with differentiability, avoiding TopK bottlenecks while maintaining effective expert selection.
import torch
import torch.nn as nn
import torch.nn.functional as F
class BlockFFNRouter(nn.Module):
def __init__(self, hidden_dim, num_experts, sharpness=1.0):
super().__init__()
self.hidden_dim = hidden_dim
self.num_experts = num_experts
self.sharpness = sharpness
# Linear router without bias (simpler gradients)
self.router_linear = nn.Linear(hidden_dim, num_experts, bias=False)
self.rms_norm = nn.RMSNorm(num_experts)
def forward(self, x):
# x: (batch, seq_len, hidden_dim)
# Linear routing scores
scores = self.router_linear(x) # (batch, seq_len, num_experts)
# ReLU activation for sparsity
scores = F.relu(scores)
# RMSNorm for numerical stability
scores = self.rms_norm(scores)
# Compute activation probabilities via sigmoid with sharpness
probs = torch.sigmoid(self.sharpness * scores)
return probs, scores
class BlockFFNExpert(nn.Module):
def __init__(self, hidden_dim, expert_dim, dropout=0.1):
super().__init__()
# Smaller expert dimensions for finer granularity
self.fc1 = nn.Linear(hidden_dim, expert_dim)
self.fc2 = nn.Linear(expert_dim, hidden_dim)
self.dropout = nn.Dropout(dropout)
self.activation = nn.SiLU() # Swish
def forward(self, x):
x = self.fc1(x)
x = self.activation(x)
x = self.dropout(x)
x = self.fc2(x)
return x
The router avoids gradient blocking by using ReLU and normalization instead of TopK, enabling reliable training.
This loss encourages neighboring tokens to select similar expert sets, which creates structured sparsity compatible with acceleration.
def activation_locality_loss(activation_probs, alpha=2.0):
"""
Minimize divergence between neighboring token activation patterns.
Args:
activation_probs: (batch, seq_len, num_experts) probabilities from sigmoid
alpha: sharpness parameter for approximation (higher = sharper transition)
Returns:
locality loss encouraging similar patterns in adjacent tokens
"""
batch_size, seq_len, num_experts = activation_probs.shape
# Sharpen probabilities for binary-like patterns
sharpened = torch.sigmoid(alpha * (2 * activation_probs - 1))
# Compute BCE between adjacent tokens
locality_loss = 0.0
for i in range(seq_len - 1):
# Binary cross entropy between token i and i+1
curr = sharpened[:, i, :] # (batch, num_experts)
next_token = sharpened[:, i + 1, :]
# BCE: -[p*log(q) + (1-p)*log(1-q)]
bce = -(curr * torch.log(next_token + 1e-8) +
(1 - curr) * torch.log(1 - next_token + 1e-8))
locality_loss += bce.mean()
return locality_loss / max(seq_len - 1, 1)
This loss directly minimizes the probability that an expert activates anywhere within an L-token window, enabling acceleration kernel compatibility.
def chunk_sparsification_loss(activation_probs, chunk_length=8):
"""
Minimize probability that expert i activates within L-token chunk.
Probability that expert i does NOT activate in any token: ∏(1 - p_ik)
Probability that expert i DOES activate: 1 - ∏(1 - p_ik)
Args:
activation_probs: (batch, seq_len, num_experts)
chunk_length: L in the paper (typically 8 for speculative decoding)
Returns:
sparsification loss encouraging deactivation across chunks
"""
batch_size, seq_len, num_experts = activation_probs.shape
num_chunks = (seq_len + chunk_length - 1) // chunk_length
sparsif_loss = 0.0
for chunk_idx in range(num_chunks):
start = chunk_idx * chunk_length
end = min(start + chunk_length, seq_len)
# Get probabilities for this chunk
chunk_probs = activation_probs[:, start:end, :] # (batch, L, num_experts)
# Compute probability of NOT activating: ∏(1 - p_ik)
prob_no_activation = torch.prod(1 - chunk_probs, dim=1) # (batch, num_experts)
# Probability of activation: 1 - ∏(1 - p_ik)
prob_activation = 1 - prob_no_activation
# Loss: minimize activation probability (encourage deactivation)
sparsif_loss += prob_activation.mean()
return sparsif_loss / num_chunks
Integrate the router, experts, and loss functions into a complete layer.
class BlockFFNLayer(nn.Module):
def __init__(self, hidden_dim, num_experts, expert_dim, chunk_length=8):
super().__init__()
self.hidden_dim = hidden_dim
self.num_experts = num_experts
self.chunk_length = chunk_length
self.router = BlockFFNRouter(hidden_dim, num_experts)
self.experts = nn.ModuleList(
[BlockFFNExpert(hidden_dim, expert_dim) for _ in range(num_experts)]
)
self.norm = nn.LayerNorm(hidden_dim)
def forward(self, x):
batch_size, seq_len, hidden_dim = x.shape
# Get activation probabilities from router
probs, _ = self.router(x) # (batch, seq_len, num_experts)
# Apply experts with gating
expert_outputs = []
for expert in self.experts:
expert_output = expert(x) # (batch, seq_len, hidden_dim)
expert_outputs.append(expert_output)
# Stack expert outputs
expert_outputs = torch.stack(expert_outputs, dim=2) # (batch, seq_len, num_experts, hidden_dim)
# Gate with probabilities
gated = probs.unsqueeze(-1) * expert_outputs # (batch, seq_len, num_experts, hidden_dim)
# Sum across experts
output = gated.sum(dim=2) # (batch, seq_len, hidden_dim)
# Normalize and residual
output = self.norm(output + x)
return output, probs
# Training loop with combined losses
def train_block_ffn_step(model, batch, optimizer, lambda_al=0.1, lambda_cs=0.1):
x = batch['input'] # (batch, seq_len, hidden_dim)
targets = batch['targets']
# Forward pass
output, activation_probs = model.layer(x)
# Main language modeling loss
logits = model.lm_head(output)
lm_loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
# Auxiliary losses
al_loss = activation_locality_loss(activation_probs)
cs_loss = chunk_sparsification_loss(activation_probs, chunk_length=8)
# Combined loss
total_loss = lm_loss + lambda_al * al_loss + lambda_cs * cs_loss
optimizer.zero_grad()
total_loss.backward()
optimizer.step()
return total_loss.item()
As training progresses, gradually increase λ_cs to enforce stronger chunk-level sparsity.
def schedule_sparsity_factor(step, total_steps, lambda_cs_max=0.5):
"""
Linearly increase sparsification weight from 0 to lambda_cs_max.
Prevents early-stage training instability.
"""
return lambda_cs_max * (step / total_steps)
# In training loop
for step, batch in enumerate(train_loader):
lambda_cs = schedule_sparsity_factor(step, total_steps, lambda_cs_max=0.5)
loss = train_block_ffn_step(model, batch, optimizer, lambda_cs=lambda_cs)
| Parameter | Value | Notes | |-----------|-------|-------| | Router Sharpness α | 2.0 | Increases with training; controls activation discretization | | λ_al (Activation Locality) | 0.1-0.3 | Weight for locality loss; higher = more aligned neighboring patterns | | λ_cs (Chunk Sparsification) | 0.1-0.5 | Weight for chunk loss; schedule to increase over training | | Chunk Length L | 8 tokens | Aligned with speculative decoding batches (32 tokens = 4 chunks) | | Number of Experts | 40 (optimal) | Empirically optimal for sparsity-performance tradeoff | | Expert Dimension | d_e << d_h | 256-512 for 2048-dim hidden states | | Token-Level Sparsity Target | 80-84% | Achieved via router without explicit TopK | | Chunk-Level Sparsity Target | 70%+ | Primary metric for acceleration compatibility |
Tian, Y., Wang, Z., Liu, S., et al. (2024). BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity. arXiv preprint arXiv:2507.08771.
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.