Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

ADu2021/flash-prefill

Name: flash-prefill
Author: ADu2021

skills/skillxiv-v0.0.2-claude-opus-4.6/flash-prefill/SKILL.md

npx skillsauth add ADu2021/skillXiv flash-prefill

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

FlashPrefill: Achieving 27x Speedup via Instant Sparse Attention Pattern Discovery

Long-context LLM inference suffers from quadratic attention complexity during prefilling. Computing full attention over sequences of 100K+ tokens incurs enormous memory and compute costs. Existing sparse attention methods like Top-k and Top-p require expensive score computation, sorting operations, and handle long-tail distributions poorly, creating additional overhead that partially cancels efficiency gains.

FlashPrefill eliminates this overhead through instantaneous pattern discovery: identifying which attention blocks matter without materializing full score matrices. By using block-level proxies and max-based thresholding, it achieves 27.78x speedup on 256K sequences while maintaining near-identical accuracy.

Core Concept

Rather than computing all Q-K dot products then sorting (O(L²) or worse), FlashPrefill:

Instant Pattern Recognition: Uses uniformly distributed query probes to simultaneously identify vertical (column-sparse), slash (diagonal), and block-sparse patterns from block-level statistics alone
Block Approximation: Computes block-pair importance via fused 2D-reduction kernels, reducing memory traffic from O(L²/B) to O((L/B)²)
Dynamic Thresholding: Replaces sorting with single-pass max reduction to determine pruning thresholds, avoiding cumulative summation overhead

The key insight: attention patterns are often predictable at block granularity without computing token-level scores. Vertical patterns (attending to few positions) and slash patterns (attending to sliding windows) can be identified from approximate block interactions.

Architecture Overview

Block-Level Proxy Computation: Compute block-level attention scores using average-pooled keys/queries within each block
Three Pattern Types Detected: Vertical (few columns), Slash (diagonal bands), Block (rectangular dense regions)
Fused Kernel Implementation: Single-pass kernel computes all block interactions without intermediate materialization
Physical Index Jumping: Use identified sparse block indices to implement block-sparse attention efficiently

Implementation Steps

The algorithm operates in three stages executed sequentially in a fused manner.

Stage 1: Block-Level Attention Approximation

Compute approximate block importance scores using pooled features. For a sequence of length L with block size B, create L/B "block queries" by averaging queries in each block.

# Compute block-level attention scores
def block_attention_scores(Q, K, block_size):
    """
    Q: query tensor [seq_len, dim]
    K: key tensor [seq_len, dim]
    block_size: B (e.g., 64 or 128)

    Returns: score matrix [num_blocks, num_blocks]
    """
    seq_len = Q.shape[0]
    num_blocks = (seq_len + block_size - 1) // block_size

    # Pool queries and keys by block
    Q_blocks = []
    K_blocks = []

    for i in range(num_blocks):
        start = i * block_size
        end = min((i + 1) * block_size, seq_len)

        # Average pool within block
        Q_blocks.append(Q[start:end].mean(dim=0, keepdim=True))
        K_blocks.append(K[start:end].mean(dim=0, keepdim=True))

    Q_blocks = torch.cat(Q_blocks, dim=0)  # [num_blocks, dim]
    K_blocks = torch.cat(K_blocks, dim=0)  # [num_blocks, dim]

    # Compute block-level attention scores
    block_scores = torch.matmul(Q_blocks, K_blocks.t()) / math.sqrt(Q.shape[-1])
    return block_scores  # [num_blocks, num_blocks]

Stage 2: Pattern Identification and Dynamic Thresholding

Identify which blocks to attend to using a single-pass max reduction for the threshold.

# Identify sparse pattern via dynamic thresholding
def identify_sparse_pattern(block_scores, threshold_percentile=80):
    """
    block_scores: [num_blocks, num_blocks]
    threshold_percentile: cutoff for sparsity (e.g., 80 means keep top 20%)

    Returns: boolean mask [num_blocks, num_blocks] indicating which blocks to compute
    """
    # Single-pass max to establish dynamic threshold
    max_score = block_scores.max()
    min_score = block_scores.min()

    # Dynamic threshold: alpha * max_score
    # alpha often set to 0.1-0.3 depending on desired sparsity
    threshold = 0.15 * max_score + 0.85 * min_score

    # Binary mask: attend to blocks above threshold
    pattern_mask = block_scores > threshold

    return pattern_mask

Stage 3: Fused Block-Sparse Attention

Execute attention only over identified sparse blocks. Materialize full token-level attention only for selected block pairs.

# Compute sparse attention efficiently
def sparse_block_attention(Q, K, V, pattern_mask, block_size):
    """
    Q, K, V: [seq_len, dim] tensors
    pattern_mask: [num_blocks, num_blocks] boolean mask
    block_size: B

    Returns: attended values [seq_len, dim]
    """
    seq_len = Q.shape[0]
    num_blocks = pattern_mask.shape[0]

    # Initialize output
    output = torch.zeros_like(Q)
    attention_sum = torch.zeros(seq_len, 1, device=Q.device)

    # Iterate only over unmasked (attended) block pairs
    for i in range(num_blocks):
        for j in range(num_blocks):
            if not pattern_mask[i, j]:
                continue  # Skip masked blocks

            # Compute full attention for this block pair only
            q_start, q_end = i * block_size, min((i + 1) * block_size, seq_len)
            k_start, k_end = j * block_size, min((j + 1) * block_size, seq_len)

            Q_block = Q[q_start:q_end]
            K_block = K[k_start:k_end]
            V_block = V[k_start:k_end]

            # Standard scaled dot-product attention for this block pair
            scores = torch.matmul(Q_block, K_block.t()) / math.sqrt(Q.shape[-1])
            attn_weights = torch.softmax(scores, dim=-1)
            output[q_start:q_end] += torch.matmul(attn_weights, V_block)
            attention_sum[q_start:q_end] += attn_weights.sum(dim=-1, keepdim=True)

    # Normalize by attention weights
    output = output / (attention_sum + 1e-10)
    return output

Practical Guidance

Hyperparameters:

Block size: Typically 64-256 tokens (larger blocks = faster but less precise pattern discovery)
Threshold α: 0.1-0.3 depending on desired sparsity; 0.15 is a good default
Pattern types: All three (vertical, slash, block) should be checked; enable/disable based on profiling

When to Apply:

Long-context inference (100K+ tokens) where attention is the bottleneck
Prefilling phase specifically—not decoding, where sparse patterns are less predictable
Models with uniform attention patterns (e.g., many retrieval-based tasks)

When NOT to Apply:

Short sequences (<4K tokens) where full attention is already fast
Decoding phase (generation token-by-token) where causal masking changes pattern structure
Tasks requiring dense attention across all positions (e.g., some translation tasks)

Key Pitfalls:

Setting block_size too large eliminates fine-grained patterns; too small creates overhead
Threshold too high = missing important attention; too low = no speedup
Pattern types not aligned with workload (e.g., slash patterns matter for retrieval but not generation)
Not accounting for causal masking if using in autoregressive settings

Integration Notes: Works as a drop-in replacement for standard attention in prefilling phases; requires minimal code changes to attention kernels and integrates with Flash-Attention style implementations.

Evidence: Achieves 27.78x speedup at 256K token lengths on Llama-3 models; maintains <1% accuracy drop across diverse vision-language and text-only models; largest gains on retrieval-augmented generation tasks.

Reference: https://arxiv.org/abs/2603.06199

ADu2021/flash-prefill

skills/skillxiv-v0.0.2-claude-opus-4.6/flash-prefill/SKILL.md

Accelerates long-context LLM prefilling by identifying sparse attention patterns without expensive scoring, using block-level approximations and dynamic thresholding. Achieves 27.78x speedup at 256K tokens while maintaining accuracy.

2 stars

data-ai

Updated Apr 17, 2026

$ install --global

skillsauth

npx skillsauth add ADu2021/skillXiv flash-prefill

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 17, 2026, 5:34 AM4.3s1 file scanned

SKILL.md

name:: flash-prefill
title:: FlashPrefill: Instantaneous Pattern Discovery for Ultra-Fast Long-Context Prefilling
version:: 0.0.2
engine:: skillxiv-v0.0.2-claude-opus-4.6
license:: MIT
url:: https://arxiv.org/abs/2603.06199
keywords:: [LLM Inference, Attention, Sparse Attention, Long Context, Prefilling]
description:: Accelerates long-context LLM prefilling by identifying sparse attention patterns without expensive scoring, using block-level approximations and dynamic thresholding. Achieves 27.78x speedup at 256K tokens while maintaining accuracy.

FlashPrefill: Achieving 27x Speedup via Instant Sparse Attention Pattern Discovery

Core Concept

Rather than computing all Q-K dot products then sorting (O(L²) or worse), FlashPrefill:

Instant Pattern Recognition: Uses uniformly distributed query probes to simultaneously identify vertical (column-sparse), slash (diagonal), and block-sparse patterns from block-level statistics alone
Block Approximation: Computes block-pair importance via fused 2D-reduction kernels, reducing memory traffic from O(L²/B) to O((L/B)²)
Dynamic Thresholding: Replaces sorting with single-pass max reduction to determine pruning thresholds, avoiding cumulative summation overhead

Architecture Overview

Block-Level Proxy Computation: Compute block-level attention scores using average-pooled keys/queries within each block
Three Pattern Types Detected: Vertical (few columns), Slash (diagonal bands), Block (rectangular dense regions)
Fused Kernel Implementation: Single-pass kernel computes all block interactions without intermediate materialization
Physical Index Jumping: Use identified sparse block indices to implement block-sparse attention efficiently

Implementation Steps

The algorithm operates in three stages executed sequentially in a fused manner.

Stage 1: Block-Level Attention Approximation

Compute approximate block importance scores using pooled features. For a sequence of length L with block size B, create L/B "block queries" by averaging queries in each block.

# Compute block-level attention scores
def block_attention_scores(Q, K, block_size):
    """
    Q: query tensor [seq_len, dim]
    K: key tensor [seq_len, dim]
    block_size: B (e.g., 64 or 128)

    Returns: score matrix [num_blocks, num_blocks]
    """
    seq_len = Q.shape[0]
    num_blocks = (seq_len + block_size - 1) // block_size

    # Pool queries and keys by block
    Q_blocks = []
    K_blocks = []

    for i in range(num_blocks):
        start = i * block_size
        end = min((i + 1) * block_size, seq_len)

        # Average pool within block
        Q_blocks.append(Q[start:end].mean(dim=0, keepdim=True))
        K_blocks.append(K[start:end].mean(dim=0, keepdim=True))

    Q_blocks = torch.cat(Q_blocks, dim=0)  # [num_blocks, dim]
    K_blocks = torch.cat(K_blocks, dim=0)  # [num_blocks, dim]

    # Compute block-level attention scores
    block_scores = torch.matmul(Q_blocks, K_blocks.t()) / math.sqrt(Q.shape[-1])
    return block_scores  # [num_blocks, num_blocks]

Stage 2: Pattern Identification and Dynamic Thresholding

Identify which blocks to attend to using a single-pass max reduction for the threshold.

# Identify sparse pattern via dynamic thresholding
def identify_sparse_pattern(block_scores, threshold_percentile=80):
    """
    block_scores: [num_blocks, num_blocks]
    threshold_percentile: cutoff for sparsity (e.g., 80 means keep top 20%)

    Returns: boolean mask [num_blocks, num_blocks] indicating which blocks to compute
    """
    # Single-pass max to establish dynamic threshold
    max_score = block_scores.max()
    min_score = block_scores.min()

    # Dynamic threshold: alpha * max_score
    # alpha often set to 0.1-0.3 depending on desired sparsity
    threshold = 0.15 * max_score + 0.85 * min_score

    # Binary mask: attend to blocks above threshold
    pattern_mask = block_scores > threshold

    return pattern_mask

Stage 3: Fused Block-Sparse Attention

Execute attention only over identified sparse blocks. Materialize full token-level attention only for selected block pairs.

# Compute sparse attention efficiently
def sparse_block_attention(Q, K, V, pattern_mask, block_size):
    """
    Q, K, V: [seq_len, dim] tensors
    pattern_mask: [num_blocks, num_blocks] boolean mask
    block_size: B

    Returns: attended values [seq_len, dim]
    """
    seq_len = Q.shape[0]
    num_blocks = pattern_mask.shape[0]

    # Initialize output
    output = torch.zeros_like(Q)
    attention_sum = torch.zeros(seq_len, 1, device=Q.device)

    # Iterate only over unmasked (attended) block pairs
    for i in range(num_blocks):
        for j in range(num_blocks):
            if not pattern_mask[i, j]:
                continue  # Skip masked blocks

            # Compute full attention for this block pair only
            q_start, q_end = i * block_size, min((i + 1) * block_size, seq_len)
            k_start, k_end = j * block_size, min((j + 1) * block_size, seq_len)

            Q_block = Q[q_start:q_end]
            K_block = K[k_start:k_end]
            V_block = V[k_start:k_end]

            # Standard scaled dot-product attention for this block pair
            scores = torch.matmul(Q_block, K_block.t()) / math.sqrt(Q.shape[-1])
            attn_weights = torch.softmax(scores, dim=-1)
            output[q_start:q_end] += torch.matmul(attn_weights, V_block)
            attention_sum[q_start:q_end] += attn_weights.sum(dim=-1, keepdim=True)

    # Normalize by attention weights
    output = output / (attention_sum + 1e-10)
    return output

Practical Guidance

Hyperparameters:

Block size: Typically 64-256 tokens (larger blocks = faster but less precise pattern discovery)
Threshold α: 0.1-0.3 depending on desired sparsity; 0.15 is a good default
Pattern types: All three (vertical, slash, block) should be checked; enable/disable based on profiling

When to Apply:

Long-context inference (100K+ tokens) where attention is the bottleneck
Prefilling phase specifically—not decoding, where sparse patterns are less predictable
Models with uniform attention patterns (e.g., many retrieval-based tasks)

When NOT to Apply:

Short sequences (<4K tokens) where full attention is already fast
Decoding phase (generation token-by-token) where causal masking changes pattern structure
Tasks requiring dense attention across all positions (e.g., some translation tasks)

Key Pitfalls:

Setting block_size too large eliminates fine-grained patterns; too small creates overhead
Threshold too high = missing important attention; too low = no speedup
Pattern types not aligned with workload (e.g., slash patterns matter for retrieval but not generation)
Not accounting for causal masking if using in autoregressive settings

Reference: https://arxiv.org/abs/2603.06199

Related Skills

ADu2021/flow-map-trajectory-tilting

testing

VerifiedTrustedCommunity

Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flow-map-trajectory-tilting

ADu2021/flexible-data-mixture-of-experts

testing

VerifiedTrustedCommunity

Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flexible-data-mixture-of-experts

ADu2021/flexibility-trap-diffusion-reasoning

data-ai

VerifiedTrustedCommunity

Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flexibility-trap-diffusion-reasoning

ADu2021/flex-continuous-agent-evolution

devops

VerifiedTrustedCommunity

Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flex-continuous-agent-evolution

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/ADu2021/skillXiv.git

# Copy into Claude Code skills folder (global)
cp -r skillXiv/skills/skillxiv-v0.0.2-claude-opus-4.6/flash-prefill ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

ADu2021/skillXiv

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT