skills/skillxiv-v0.0.2-claude-opus-4.6/casa-vl-fusion/SKILL.md
Replace token-insertion for fusing vision and language with efficient cross-attention that maintains separate text self-attention. Enables text tokens to attend images within local windows, preserves gist tokens from prior images, and maintains near-constant memory costs for streaming video—more practical than direct token insertion for resource-constrained applications.
npx skillsauth add ADu2021/skillXiv casa-vl-fusionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
CASA revisits cross-attention (CA) as a practical alternative to direct token insertion for vision-language fusion. Token insertion becomes prohibitively expensive for high-resolution images and video, while CA offers efficient fusing with careful design. Five key design differences restore CA's competitiveness.
The key insight is that cross-attention requires specific design choices to match or exceed token-insertion performance.
Five Critical Design Differences:
# CASA architecture components
class CASAVisionLanguageModel:
def __init__(self):
# D1: Separate parameter layers for cross-attention
self.text_self_attention = SelfAttentionLayer()
self.cross_attention = CrossAttentionLayer() # Not shared
# D2: Joint text-image attention with local windows
self.local_window_size = 128
# D3: Reduced self-attention layers for CA layers
self.num_self_attn = 16
self.num_cross_attn = 8 # Replaces some self-attn
# D4: Optional image token FFN updates
self.image_ffn = FFNLayer()
# D5: Visual history via gist tokens
self.gist_tokens = None
def forward(self, text_tokens, image_features, prev_gist=None):
"""
Process text and image with CASA design principles.
"""
# Maintain text self-attention for robustness
text_hidden = self.text_self_attention(text_tokens)
# Joint attention: text attends to image + preceding text
# within local windows for efficiency
attended = self.cross_attention(
query=text_hidden,
key_value_image=image_features,
key_value_text=text_hidden,
window_size=self.local_window_size
)
# Optional: update image embeddings via FFN
image_features = self.image_ffn(image_features)
# D5: Compress current image into gist tokens for next round
gist_tokens = self.compute_gist(image_features)
return attended, gist_tokens
Gist Tokens for Visual History: Preserve compressed representations of past images without growing memory.
def compute_gist_tokens(image_features, num_gist=8):
"""
Compress image features into small number of gist tokens
representing essential visual information for future frames.
"""
# Average pooling over spatial dimensions
spatial_mean = torch.mean(image_features, dim=(1, 2)) # [batch, hidden]
# Project to gist token dimension
gist = apply_projection(spatial_mean, output_dim=hidden_dim)
# Take top-k tokens by importance score
importance_scores = compute_importance(gist)
gist_tokens = select_top_k(gist, importance_scores, k=num_gist)
return gist_tokens
Streaming Efficiency with Constant Memory: Unlike token insertion, KV cache scales with gist tokens, not image resolution.
def streaming_forward(model, text_query, new_frame, history_gist):
"""
Process new frame without storing all prior image tokens.
Memory is O(gist_tokens), not O(image_resolution).
"""
# Current image gist
gist_current = model.compute_gist(new_frame)
# Combine historical gists (constant size)
gist_memory = history_gist + [gist_current]
# Cross-attention over gists (efficient)
output = model.cross_attention(
query=text_query,
key_value=gist_memory
)
# Memory complexity: O(num_frames * gist_tokens)
# vs O(num_frames * image_resolution²) for token insertion
return output, gist_memory
Use CASA when:
Avoid this approach if:
The framework requires:
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.