skills/skillxiv-v0.0.2-claude-opus-4.6/flash-prefill/SKILL.md
Accelerates long-context LLM prefilling by identifying sparse attention patterns without expensive scoring, using block-level approximations and dynamic thresholding. Achieves 27.78x speedup at 256K tokens while maintaining accuracy.
npx skillsauth add ADu2021/skillXiv flash-prefillInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Long-context LLM inference suffers from quadratic attention complexity during prefilling. Computing full attention over sequences of 100K+ tokens incurs enormous memory and compute costs. Existing sparse attention methods like Top-k and Top-p require expensive score computation, sorting operations, and handle long-tail distributions poorly, creating additional overhead that partially cancels efficiency gains.
FlashPrefill eliminates this overhead through instantaneous pattern discovery: identifying which attention blocks matter without materializing full score matrices. By using block-level proxies and max-based thresholding, it achieves 27.78x speedup on 256K sequences while maintaining near-identical accuracy.
Rather than computing all Q-K dot products then sorting (O(L²) or worse), FlashPrefill:
The key insight: attention patterns are often predictable at block granularity without computing token-level scores. Vertical patterns (attending to few positions) and slash patterns (attending to sliding windows) can be identified from approximate block interactions.
The algorithm operates in three stages executed sequentially in a fused manner.
Stage 1: Block-Level Attention Approximation
Compute approximate block importance scores using pooled features. For a sequence of length L with block size B, create L/B "block queries" by averaging queries in each block.
# Compute block-level attention scores
def block_attention_scores(Q, K, block_size):
"""
Q: query tensor [seq_len, dim]
K: key tensor [seq_len, dim]
block_size: B (e.g., 64 or 128)
Returns: score matrix [num_blocks, num_blocks]
"""
seq_len = Q.shape[0]
num_blocks = (seq_len + block_size - 1) // block_size
# Pool queries and keys by block
Q_blocks = []
K_blocks = []
for i in range(num_blocks):
start = i * block_size
end = min((i + 1) * block_size, seq_len)
# Average pool within block
Q_blocks.append(Q[start:end].mean(dim=0, keepdim=True))
K_blocks.append(K[start:end].mean(dim=0, keepdim=True))
Q_blocks = torch.cat(Q_blocks, dim=0) # [num_blocks, dim]
K_blocks = torch.cat(K_blocks, dim=0) # [num_blocks, dim]
# Compute block-level attention scores
block_scores = torch.matmul(Q_blocks, K_blocks.t()) / math.sqrt(Q.shape[-1])
return block_scores # [num_blocks, num_blocks]
Stage 2: Pattern Identification and Dynamic Thresholding
Identify which blocks to attend to using a single-pass max reduction for the threshold.
# Identify sparse pattern via dynamic thresholding
def identify_sparse_pattern(block_scores, threshold_percentile=80):
"""
block_scores: [num_blocks, num_blocks]
threshold_percentile: cutoff for sparsity (e.g., 80 means keep top 20%)
Returns: boolean mask [num_blocks, num_blocks] indicating which blocks to compute
"""
# Single-pass max to establish dynamic threshold
max_score = block_scores.max()
min_score = block_scores.min()
# Dynamic threshold: alpha * max_score
# alpha often set to 0.1-0.3 depending on desired sparsity
threshold = 0.15 * max_score + 0.85 * min_score
# Binary mask: attend to blocks above threshold
pattern_mask = block_scores > threshold
return pattern_mask
Stage 3: Fused Block-Sparse Attention
Execute attention only over identified sparse blocks. Materialize full token-level attention only for selected block pairs.
# Compute sparse attention efficiently
def sparse_block_attention(Q, K, V, pattern_mask, block_size):
"""
Q, K, V: [seq_len, dim] tensors
pattern_mask: [num_blocks, num_blocks] boolean mask
block_size: B
Returns: attended values [seq_len, dim]
"""
seq_len = Q.shape[0]
num_blocks = pattern_mask.shape[0]
# Initialize output
output = torch.zeros_like(Q)
attention_sum = torch.zeros(seq_len, 1, device=Q.device)
# Iterate only over unmasked (attended) block pairs
for i in range(num_blocks):
for j in range(num_blocks):
if not pattern_mask[i, j]:
continue # Skip masked blocks
# Compute full attention for this block pair only
q_start, q_end = i * block_size, min((i + 1) * block_size, seq_len)
k_start, k_end = j * block_size, min((j + 1) * block_size, seq_len)
Q_block = Q[q_start:q_end]
K_block = K[k_start:k_end]
V_block = V[k_start:k_end]
# Standard scaled dot-product attention for this block pair
scores = torch.matmul(Q_block, K_block.t()) / math.sqrt(Q.shape[-1])
attn_weights = torch.softmax(scores, dim=-1)
output[q_start:q_end] += torch.matmul(attn_weights, V_block)
attention_sum[q_start:q_end] += attn_weights.sum(dim=-1, keepdim=True)
# Normalize by attention weights
output = output / (attention_sum + 1e-10)
return output
Hyperparameters:
When to Apply:
When NOT to Apply:
Key Pitfalls:
Integration Notes: Works as a drop-in replacement for standard attention in prefilling phases; requires minimal code changes to attention kernels and integrates with Flash-Attention style implementations.
Evidence: Achieves 27.78x speedup at 256K token lengths on Llama-3 models; maintains <1% accuracy drop across diverse vision-language and text-only models; largest gains on retrieval-augmented generation tasks.
Reference: https://arxiv.org/abs/2603.06199
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.