Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

ADu2021/casa-vl-fusion

Name: casa-vl-fusion
Author: ADu2021

skills/skillxiv-v0.0.2-claude-opus-4.6/casa-vl-fusion/SKILL.md

npx skillsauth add ADu2021/skillXiv casa-vl-fusion

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Overview

CASA revisits cross-attention (CA) as a practical alternative to direct token insertion for vision-language fusion. Token insertion becomes prohibitively expensive for high-resolution images and video, while CA offers efficient fusing with careful design. Five key design differences restore CA's competitiveness.

Core Technique

The key insight is that cross-attention requires specific design choices to match or exceed token-insertion performance.

Five Critical Design Differences:

# CASA architecture components
class CASAVisionLanguageModel:
    def __init__(self):
        # D1: Separate parameter layers for cross-attention
        self.text_self_attention = SelfAttentionLayer()
        self.cross_attention = CrossAttentionLayer()  # Not shared

        # D2: Joint text-image attention with local windows
        self.local_window_size = 128

        # D3: Reduced self-attention layers for CA layers
        self.num_self_attn = 16
        self.num_cross_attn = 8  # Replaces some self-attn

        # D4: Optional image token FFN updates
        self.image_ffn = FFNLayer()

        # D5: Visual history via gist tokens
        self.gist_tokens = None

    def forward(self, text_tokens, image_features, prev_gist=None):
        """
        Process text and image with CASA design principles.
        """
        # Maintain text self-attention for robustness
        text_hidden = self.text_self_attention(text_tokens)

        # Joint attention: text attends to image + preceding text
        # within local windows for efficiency
        attended = self.cross_attention(
            query=text_hidden,
            key_value_image=image_features,
            key_value_text=text_hidden,
            window_size=self.local_window_size
        )

        # Optional: update image embeddings via FFN
        image_features = self.image_ffn(image_features)

        # D5: Compress current image into gist tokens for next round
        gist_tokens = self.compute_gist(image_features)

        return attended, gist_tokens

Gist Tokens for Visual History: Preserve compressed representations of past images without growing memory.

def compute_gist_tokens(image_features, num_gist=8):
    """
    Compress image features into small number of gist tokens
    representing essential visual information for future frames.
    """
    # Average pooling over spatial dimensions
    spatial_mean = torch.mean(image_features, dim=(1, 2))  # [batch, hidden]

    # Project to gist token dimension
    gist = apply_projection(spatial_mean, output_dim=hidden_dim)

    # Take top-k tokens by importance score
    importance_scores = compute_importance(gist)
    gist_tokens = select_top_k(gist, importance_scores, k=num_gist)

    return gist_tokens

Streaming Efficiency with Constant Memory: Unlike token insertion, KV cache scales with gist tokens, not image resolution.

def streaming_forward(model, text_query, new_frame, history_gist):
    """
    Process new frame without storing all prior image tokens.
    Memory is O(gist_tokens), not O(image_resolution).
    """
    # Current image gist
    gist_current = model.compute_gist(new_frame)

    # Combine historical gists (constant size)
    gist_memory = history_gist + [gist_current]

    # Cross-attention over gists (efficient)
    output = model.cross_attention(
        query=text_query,
        key_value=gist_memory
    )

    # Memory complexity: O(num_frames * gist_tokens)
    # vs O(num_frames * image_resolution²) for token insertion

    return output, gist_memory

When to Use This Technique

Use CASA when:

Processing high-resolution images or video streams
Memory bandwidth is constrained
Multi-image conversations with streaming
Token insertion memory costs are prohibitive

When NOT to Use This Technique

Avoid this approach if:

Single low-resolution image tasks (token insertion suffices)
Fine-grained pixel-level understanding needed (lose spatial detail)
Very few images/frames (token insertion memory manageable)

Implementation Notes

The framework requires:

Separate cross-attention and self-attention layer implementations
Local windowing mechanism for joint text-image attention
Gist token computation and compression
Streaming inference pipeline for video

Key Performance

Near-constant memory costs for streaming video
Comparable or superior performance to token insertion
Efficient multi-image conversation support
Strong baseline on various VLM benchmarks

References

Cross-attention as efficient alternative to token insertion
Local windowing for joint text-image attention
Gist tokens for visual memory compression
Streaming-friendly architecture design

ADu2021/casa-vl-fusion

skills/skillxiv-v0.0.2-claude-opus-4.6/casa-vl-fusion/SKILL.md

Replace token-insertion for fusing vision and language with efficient cross-attention that maintains separate text self-attention. Enables text tokens to attend images within local windows, preserves gist tokens from prior images, and maintains near-constant memory costs for streaming video—more practical than direct token insertion for resource-constrained applications.

2 stars

testing

Updated Apr 16, 2026

$ install --global

skillsauth

npx skillsauth add ADu2021/skillXiv casa-vl-fusion

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 24, 2026, 8:55 PM1.8s1 file scanned

SKILL.md

name:: casa-vl-fusion
title:: CASA: Cross-Attention via Self-Attention for Efficient VL Fusion
version:: 0.0.2
engine:: skillxiv-v0.0.2-claude-opus-4.6
license:: MIT
url:: https://arxiv.org/abs/2512.19535
keywords:: [vision-language, cross-attention, efficient, multi-image, streaming]
description:: Replace token-insertion for fusing vision and language with efficient cross-attention that maintains separate text self-attention. Enables text tokens to attend images within local windows, preserves gist tokens from prior images, and maintains near-constant memory costs for streaming video—more practical than direct token insertion for resource-constrained applications.

Overview

Core Technique

The key insight is that cross-attention requires specific design choices to match or exceed token-insertion performance.

Five Critical Design Differences:

# CASA architecture components
class CASAVisionLanguageModel:
    def __init__(self):
        # D1: Separate parameter layers for cross-attention
        self.text_self_attention = SelfAttentionLayer()
        self.cross_attention = CrossAttentionLayer()  # Not shared

        # D2: Joint text-image attention with local windows
        self.local_window_size = 128

        # D3: Reduced self-attention layers for CA layers
        self.num_self_attn = 16
        self.num_cross_attn = 8  # Replaces some self-attn

        # D4: Optional image token FFN updates
        self.image_ffn = FFNLayer()

        # D5: Visual history via gist tokens
        self.gist_tokens = None

    def forward(self, text_tokens, image_features, prev_gist=None):
        """
        Process text and image with CASA design principles.
        """
        # Maintain text self-attention for robustness
        text_hidden = self.text_self_attention(text_tokens)

        # Joint attention: text attends to image + preceding text
        # within local windows for efficiency
        attended = self.cross_attention(
            query=text_hidden,
            key_value_image=image_features,
            key_value_text=text_hidden,
            window_size=self.local_window_size
        )

        # Optional: update image embeddings via FFN
        image_features = self.image_ffn(image_features)

        # D5: Compress current image into gist tokens for next round
        gist_tokens = self.compute_gist(image_features)

        return attended, gist_tokens

Gist Tokens for Visual History: Preserve compressed representations of past images without growing memory.

def compute_gist_tokens(image_features, num_gist=8):
    """
    Compress image features into small number of gist tokens
    representing essential visual information for future frames.
    """
    # Average pooling over spatial dimensions
    spatial_mean = torch.mean(image_features, dim=(1, 2))  # [batch, hidden]

    # Project to gist token dimension
    gist = apply_projection(spatial_mean, output_dim=hidden_dim)

    # Take top-k tokens by importance score
    importance_scores = compute_importance(gist)
    gist_tokens = select_top_k(gist, importance_scores, k=num_gist)

    return gist_tokens

Streaming Efficiency with Constant Memory: Unlike token insertion, KV cache scales with gist tokens, not image resolution.

def streaming_forward(model, text_query, new_frame, history_gist):
    """
    Process new frame without storing all prior image tokens.
    Memory is O(gist_tokens), not O(image_resolution).
    """
    # Current image gist
    gist_current = model.compute_gist(new_frame)

    # Combine historical gists (constant size)
    gist_memory = history_gist + [gist_current]

    # Cross-attention over gists (efficient)
    output = model.cross_attention(
        query=text_query,
        key_value=gist_memory
    )

    # Memory complexity: O(num_frames * gist_tokens)
    # vs O(num_frames * image_resolution²) for token insertion

    return output, gist_memory

When to Use This Technique

Use CASA when:

Processing high-resolution images or video streams
Memory bandwidth is constrained
Multi-image conversations with streaming
Token insertion memory costs are prohibitive

When NOT to Use This Technique

Avoid this approach if:

Single low-resolution image tasks (token insertion suffices)
Fine-grained pixel-level understanding needed (lose spatial detail)
Very few images/frames (token insertion memory manageable)

Implementation Notes

The framework requires:

Separate cross-attention and self-attention layer implementations
Local windowing mechanism for joint text-image attention
Gist token computation and compression
Streaming inference pipeline for video

Key Performance

Near-constant memory costs for streaming video
Comparable or superior performance to token insertion
Efficient multi-image conversation support
Strong baseline on various VLM benchmarks

References

Cross-attention as efficient alternative to token insertion
Local windowing for joint text-image attention
Gist tokens for visual memory compression
Streaming-friendly architecture design

Related Skills

ADu2021/flow-map-trajectory-tilting

testing

VerifiedTrustedCommunity

Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flow-map-trajectory-tilting

ADu2021/flexible-data-mixture-of-experts

testing

VerifiedTrustedCommunity

Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flexible-data-mixture-of-experts

ADu2021/flexibility-trap-diffusion-reasoning

data-ai

VerifiedTrustedCommunity

Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flexibility-trap-diffusion-reasoning

ADu2021/flex-continuous-agent-evolution

devops

VerifiedTrustedCommunity

Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flex-continuous-agent-evolution

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/ADu2021/skillXiv.git

# Copy into Claude Code skills folder (global)
cp -r skillXiv/skills/skillxiv-v0.0.2-claude-opus-4.6/casa-vl-fusion ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

ADu2021/skillXiv

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT