skills/skillxiv-v0.0.2-claude-opus-4.6/efficient-long-context-reasoning/SKILL.md
Achieve 10× higher decoding throughput on long prompts by replacing 50% of cross-attention layers with gated memory units (GMUs) combining SSMs and attention. Maintains reasoning capability while reducing memory I/O bottleneck from O(d_kv·N) to O(d_h).
npx skillsauth add ADu2021/skillXiv efficient-long-context-reasoningInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Standard transformer decoders using full attention struggle with long generation because decoding is I/O-bound: each new token must read cached key-value tensors proportional to prompt length, causing memory bandwidth bottlenecks. For a 32K generation length following a 2K prompt, the model spends most time moving data rather than computing. State Space Models (SSMs) solve I/O through linear recurrence but lose expressiveness of full attention. The decoder-hybrid-decoder architecture elegantly balances these concerns using Gated Memory Units (GMUs): a simple mechanism that mixes SSM efficiency with selective attention, replacing 50% of cross-attention layers while preserving reasoning capability and reducing memory I/O by 16× per layer.
When generating long reasoning traces, extended outputs, or multi-step planning, efficiency per generated token becomes critical. A 10× decoding speedup translates to seconds saved per query in production, or enables longer generation within latency budgets. The hybrid approach maintains strong performance on reasoning benchmarks—models with GMU layers outperform standard decoders on AIME and GPT-4 analogs.
The decoder-hybrid-decoder architecture consists of a standard Samba decoder (using SSMs for efficient self-attention during generation) paired with a cross-decoder handling prompt context. The innovation is selective GMU replacement: approximately 50% of cross-attention layers that attend to cached prompts become GMU layers. Each GMU takes two inputs—current layer hidden states and mixed representations from previous layers—and combines them through learnable projections and element-wise multiplication with SiLU activation. This design reduces memory I/O from O(d_kv·N) for full attention to O(d_h) for GMU (where N is prompt length and d_h is head dimension), achieving dramatic speedups. The architecture preserves linear pre-filling time complexity while reducing decoding memory bottleneck, enabling strong scaling to long contexts without sacrificing reasoning ability.
This example demonstrates the Gated Memory Unit that combines SSM representations with selective attention through gating.
# Gated Memory Unit for efficient long-context attention
import torch
import torch.nn as nn
import torch.nn.functional as F
class GatedMemoryUnit(nn.Module):
"""Hybrid SSM-attention mechanism reducing memory I/O for long contexts."""
def __init__(self, hidden_dim: int, num_heads: int = 32):
super().__init__()
self.hidden_dim = hidden_dim
self.head_dim = hidden_dim // num_heads
self.num_heads = num_heads
# SSM-style recurrent processing
self.recurrent_proj = nn.Linear(hidden_dim, hidden_dim)
self.recurrent_state = None # Maintains state across positions
# Gating mechanism: combines current layer and mixed representations
self.current_proj = nn.Linear(hidden_dim, hidden_dim)
self.mixed_proj = nn.Linear(hidden_dim, hidden_dim)
self.gate_proj = nn.Linear(2 * hidden_dim, hidden_dim)
self.out_proj = nn.Linear(hidden_dim, hidden_dim)
def forward(self, current_hidden: torch.Tensor, mixed_representations: torch.Tensor):
"""Process with gated memory mechanism.
current_hidden: [batch, seq_len, hidden_dim] current layer activations
mixed_representations: [batch, seq_len, hidden_dim] from previous GMU layers"""
batch_size, seq_len, hidden_dim = current_hidden.shape
# Project current layer to memory update signal
current_proj = self.current_proj(current_hidden)
# Project mixed representations (SSM-style state from previous layers)
mixed_proj = self.mixed_proj(mixed_representations)
# Gating: learn how much to take from each source
gate_input = torch.cat([current_proj, mixed_proj], dim=-1)
gate_logits = self.gate_proj(gate_input)
gate = torch.sigmoid(gate_logits) # [batch, seq_len, hidden_dim]
# Gated combination: selective mixing via element-wise multiplication
output = gate * current_proj + (1 - gate) * mixed_proj
# Apply SiLU activation for non-linearity
output = F.silu(output)
# Project to output
output = self.out_proj(output)
# Residual connection would be added by parent layer
return output
def memory_complexity(self, seq_len: int, batch_size: int) -> int:
"""Compute memory I/O for this layer vs standard attention."""
# GMU: O(d_h) = hidden_dim
gmu_io = batch_size * seq_len * self.head_dim
# Standard cross-attention: O(d_kv * N) = hidden_dim * seq_len
standard_io = batch_size * seq_len * self.hidden_dim
return {
'gmu_io_bytes': gmu_io * 4, # 4 bytes per float32
'standard_io_bytes': standard_io * 4,
'reduction_ratio': standard_io / gmu_io
}
This example shows the hybrid decoder combining SSM self-attention with GMU-enhanced cross-attention.
class DecoderHybridDecoder(nn.Module):
"""Decoder-hybrid-decoder architecture for efficient long-context generation."""
def __init__(self, hidden_dim: int = 4096, num_layers: int = 40, num_heads: int = 32):
super().__init__()
self.hidden_dim = hidden_dim
self.num_layers = num_layers
# Samba self-decoder: efficient on generated tokens
self.self_decoder_layers = nn.ModuleList([
SambaLayer(hidden_dim, num_heads)
for _ in range(num_layers // 2) # First half: self-attention on generation
])
# Cross-decoder: attends to prompt context
# Replace ~50% of cross-attention with GMU for efficiency
self.cross_decoder_layers = nn.ModuleList()
for i in range(num_layers // 2):
if i % 2 == 0:
# GMU layer: efficient, reduces I/O
self.cross_decoder_layers.append({
'type': 'gmu',
'module': GatedMemoryUnit(hidden_dim, num_heads)
})
else:
# Standard cross-attention: high quality but expensive
self.cross_decoder_layers.append({
'type': 'cross_attn',
'module': StandardCrossAttention(hidden_dim, num_heads)
})
self.norm = nn.LayerNorm(hidden_dim)
def forward_with_cache(self, generated_tokens: torch.Tensor,
prompt_cache: torch.Tensor,
position: int):
"""Efficient generation: attend to cached prompt, maintain state."""
# Process through self-decoder (on generated sequence)
hidden = generated_tokens
for layer in self.self_decoder_layers:
hidden = layer(hidden)
# Process through cross-decoder with GMU optimization
mixed_repr = None # Accumulated from previous GMU layers
for i, cross_layer_config in enumerate(self.cross_decoder_layers):
layer_type = cross_layer_config['type']
layer = cross_layer_config['module']
if layer_type == 'gmu':
# GMU: efficient, reduces memory I/O from O(d_kv*N) to O(d_h)
if mixed_repr is None:
mixed_repr = hidden.clone()
hidden = layer(hidden, mixed_repr)
mixed_repr = hidden # Accumulate for next GMU
else:
# Standard attention: higher quality but memory expensive
hidden = layer(hidden, prompt_cache, position)
# Final normalization
hidden = self.norm(hidden)
return hidden
def estimate_decoding_efficiency(self, prompt_length: int, generation_length: int, batch_size: int = 1):
"""Estimate I/O efficiency gains from GMU optimization."""
# Count GMU vs attention layers
num_gmu = sum(1 for layer_config in self.cross_decoder_layers
if layer_config['type'] == 'gmu')
num_attn = len(self.cross_decoder_layers) - num_gmu
# Memory I/O per generated token
attn_io_per_token = num_attn * batch_size * prompt_length * self.hidden_dim * 4
gmu_io_per_token = num_gmu * batch_size * prompt_length * self.hidden_dim // 32 * 4
# Total for full generation
total_io = (attn_io_per_token + gmu_io_per_token) * generation_length
# Baseline (all attention)
baseline_io = num_attn * batch_size * prompt_length * self.hidden_dim * 4 * generation_length
return {
'total_io_bytes': total_io,
'baseline_io_bytes': baseline_io,
'throughput_improvement': baseline_io / total_io if total_io > 0 else 1.0,
'gmu_layers': num_gmu,
'attention_layers': num_attn
}
This example demonstrates the μP++ hyperparameter scaling for fair architecture comparison.
class MuPPlusPlus:
"""Hyperparameter scaling laws accounting for depth and width."""
@staticmethod
def compute_scaled_lr(base_lr: float, width_ratio: float, depth_ratio: float) -> float:
"""Compute learning rate for scaled model using μP++ laws.
width_ratio: ratio of hidden_dim relative to base
depth_ratio: ratio of num_layers relative to base"""
# μP++ scaling: LR ∝ 1 / (width * depth)
scaled_lr = base_lr / (width_ratio * depth_ratio)
return scaled_lr
@staticmethod
def compute_scaled_hidden_dim(base_hidden: int, target_flops: float,
base_flops: float) -> int:
"""Compute hidden dimension for target compute budget."""
ratio = (target_flops / base_flops) ** 0.5
return int(base_hidden * ratio)
@staticmethod
def compute_scaled_depth(base_depth: int, target_flops: float,
base_flops: float, hidden_dim: int,
base_hidden: int) -> int:
"""Compute depth maintaining compute budget iso-parametrically."""
width_ratio = hidden_dim / base_hidden
target_depth_ratio = (target_flops / base_flops) / (width_ratio ** 2)
return int(base_depth * target_depth_ratio)
@staticmethod
def compare_architectures(arch1_config: dict, arch2_config: dict) -> dict:
"""Fair comparison of two architectures with μP++ scaling."""
# Estimate flops for each architecture
def estimate_flops(config):
return (config['hidden_dim'] ** 2 * config['num_layers'] *
config['seq_len'])
flops1 = estimate_flops(arch1_config)
flops2 = estimate_flops(arch2_config)
# Scale to same compute budget
if flops1 < flops2:
arch1_config['scaled'] = arch1_config
arch2_config['scaled'] = {
**arch2_config,
'hidden_dim': MuPPlusPlus.compute_scaled_hidden_dim(
arch2_config['hidden_dim'], flops1, flops2
)
}
else:
arch1_config['scaled'] = {
**arch1_config,
'hidden_dim': MuPPlusPlus.compute_scaled_hidden_dim(
arch1_config['hidden_dim'], flops2, flops1
)
}
arch2_config['scaled'] = arch2_config
return {
'arch1_flops': flops1,
'arch2_flops': flops2,
'scaled_arch1': arch1_config['scaled'],
'scaled_arch2': arch2_config['scaled']
}
| Hyperparameter | Recommended Value | Purpose | |---|---|---| | GMU replacement ratio | 50% (every other layer) | Balance efficiency and quality | | Samba self-decoder depth | num_layers / 2 | Efficient generation | | Cross-decoder GMU/attention ratio | 1:1 | Equal hybrid distribution | | Sliding window size (optional) | 512 tokens | Reduce attention memory for very long prompts | | Learning rate scaling | 1 / (width * depth) | μP++ stable scaling | | Hidden dimension | 4096+ | Maintain capacity despite GMU | | Batch size (inference) | 1-8 | Memory-constrained generation | | Generation length | 32K+ | Where efficiency gains matter |
When to use: Apply decoder-hybrid-decoder when generating long outputs (>4K tokens) where decoding latency is critical. Use for reasoning models that must maintain quality (AIME, code generation) while scaling to long contexts. Ideal for applications like long-form writing, extended planning, or retrieval-augmented generation with large context windows.
When NOT to use: Skip for short-generation tasks (< 512 tokens) where I/O overhead is negligible. Avoid if you need maximum reasoning capability and cannot afford any quality-efficiency tradeoff. Don't use if your sequences are so long that even linear attention becomes problematic. Skip for tasks where every token must attend to every prompt token—GMU naturally attends selectively.
Common pitfalls: Setting GMU ratio wrong—too many GMU layers (>60%) degrades reasoning; too few (< 40%) wastes efficiency. Not using μP++ for scaling causes unfair comparisons and training instability. Setting sliding window too aggressively loses critical context. Forgetting to tune learning rates per scaled model causes divergence. Not benchmarking decoding latency separately from accuracy—quality improvements can be offset by worse throughput. Mixing GMU and standard attention randomly instead of alternating causes layer-wise imbalance.
Phi4 Team. (2025). Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation. arXiv preprint arXiv:2507.06607. https://arxiv.org/abs/2507.06607
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.