skills/c2rope-causal-continuous-rotary-positional/SKILL.md
Implement C²RoPE (Causal Continuous Rotary Positional Encoding) for multimodal transformers that process 2D/3D visual data alongside text. Replaces standard 1D RoPE with a triplet (m, x, y) positional index and Chebyshev causal masking to preserve spatial locality in vision-language models. Trigger phrases: - "implement C2RoPE positional encoding" - "fix spatial locality loss in vision-language RoPE" - "add 2D-aware rotary embeddings for image tokens" - "implement Chebyshev causal masking for visual attention" - "modify RoPE for multimodal 3D reasoning" - "spatially-aware positional encoding for multi-view images"
npx skillsauth add ndpvt-web/arxiv-claude-skills c2rope-causal-continuous-rotary-positionalInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to implement C²RoPE, a drop-in replacement for standard RoPE in vision-language models that fixes two fundamental problems: (1) spatial locality loss caused by flattening 2D image patches into 1D sequences, and (2) long-term attention decay that causes models to neglect earlier visual tokens as sequence length grows. C²RoPE achieves this by constructing a triplet positional index (temporal, x, y) with a frequency allocation strategy, and introducing Chebyshev distance-based causal masking for visual self-attention.
The core problem: Standard RoPE assigns each token a single integer position index m = 0, 1, 2, ..., then computes rotation matrices R(m) that cause attention to decay with distance |m_q - m_k|. When a 24x24 image grid is flattened row-by-row, two vertically adjacent patches (e.g., positions 0 and 24) receive distant indices, breaking spatial continuity along the column dimension. Additionally, image tokens placed early in the sequence receive progressively less attention as text tokens extend the sequence -- a temporal decay bias inherited from language modeling that is inappropriate for visual content.
C²RoPE's solution has two components. First, it replaces the scalar position index m with a triplet (m, x, y) where m is the original temporal index and (x, y) are Cartesian coordinates centered at the image. For a sqrt(v) x sqrt(v) image patch grid, coordinates range from (1 - sqrt(v)/2) to (sqrt(v)/2 - 1) along each axis. The embedding dimension d is then split: the first (d - d_spatial) dimensions encode temporal position m using standard RoPE frequencies theta_i = 10000^(-2(i-1)/d), while the remaining d_spatial dimensions interleave x and y coordinates. In the paper's configuration, d=128 uses 96 dimensions for temporal and 32 for spatial. The spatial dimensions use higher-frequency slots (lower dimension indices within their allocation) so they capture fine-grained spatial distinctions without disrupting the LLM's pretrained temporal position understanding in the lower frequencies.
Second, Chebyshev Causal Masking replaces the standard causal (lower-triangular) attention mask for visual self-attention. Instead of enforcing that token i can only attend to tokens j <= i (which is arbitrary for 2D image patches), C²RoPE computes the Chebyshev distance from the image center for each token: d_cheb = max(|x|, |y|). Tokens at the same Chebyshev distance form a "ring" around the center and are treated as causally equivalent -- they can attend to each other freely. Tokens can attend to any token with equal or smaller Chebyshev distance (closer to center), but not to tokens with larger distance. This creates concentric square rings of causality emanating from the image center, which is a natural prior for visual processing where context flows from global (center) to peripheral regions.
Identify the existing RoPE implementation in your model codebase. In HuggingFace Transformers-based models, this is typically in modeling_llama.py or equivalent, in the LlamaRotaryEmbedding class and the rotate_half / apply_rotary_pos_emb functions. Note the embedding dimension d (typically 128 per head).
Define the triplet position index builder. For each image in the input, compute the spatial grid dimensions (e.g., 24x24 = 576 patches). Assign each patch coordinates (x, y) centered at the image: x = col - (W-1)/2, y = (H-1)/2 - row where row, col are the patch's grid position. Retain the original temporal index m from the sequence position.
Implement the frequency allocation split. Partition the RoPE dimension d into temporal dimensions d_t and spatial dimensions d_s (paper uses d_t=96, d_s=32 for d=128). For temporal dimensions, compute frequencies as standard: theta_i = base^(-2i/d) for i in 0..d_t/2. For spatial dimensions, interleave x and y: odd spatial slots encode x, even encode y, using frequencies from the corresponding dimension indices.
Build the extended rotation matrix. For each token position, construct the rotation angles as:
# Temporal component (first d_t dimensions)
angles_t = m * theta[0:d_t//2] # shape: (seq_len, d_t//2)
# Spatial component (last d_s dimensions, interleaved x/y)
angles_x = x * theta[d_t//2::2] # even slots of spatial portion
angles_y = y * theta[d_t//2+1::2] # odd slots of spatial portion
angles_s = interleave(angles_x, angles_y) # shape: (seq_len, d_s//2)
angles = concat(angles_t, angles_s) # shape: (seq_len, d//2)
Apply the rotation to queries and keys using the standard RoPE mechanism (cos/sin rotation), but now with the extended angle tensor from step 4. For text tokens, set x=0 and y=0 so the spatial component contributes nothing and behavior matches standard RoPE exactly.
Implement Chebyshev Causal Masking. For each image's visual tokens, compute d_cheb[i] = max(|x_i|, |y_i|) for every token i. Build the attention mask where mask[i][j] = 1 if d_cheb[j] <= d_cheb[i], else 0. Tokens at equal Chebyshev distance can attend to each other (both directions). This mask only applies to visual-to-visual attention; text-to-text and cross-modal attention retain standard causal masking.
Handle multi-image inputs. When multiple images appear in one sequence (e.g., multi-view 3D), reset the (x, y) coordinate system for each image independently. The temporal index m continues incrementing across images. Each image gets its own Chebyshev mask block; cross-image attention uses standard causal masking.
Integrate with the forward pass. Modify the model's attention module to accept the extended position IDs (shape: batch x seq_len x 3 instead of batch x seq_len) and route them through your modified RoPE. Inject the Chebyshev mask into the attention score computation alongside any existing masks.
Validate with a diagnostic test. Feed an image with a known spatial pattern (e.g., a checkerboard) and verify that (a) attention weights between vertically adjacent patches are comparable to horizontally adjacent ones (spatial continuity restored), and (b) center patches receive attention from peripheral ones but not vice-versa (Chebyshev causality).
Fine-tune or evaluate. The modification is compatible with pretrained LLM weights since text tokens behave identically to standard RoPE. Image encoder weights may benefit from a short fine-tuning phase to adapt to the new positional structure.
Example 1: Adding C²RoPE to a LLaVA-style model
User: "I'm building a LLaVA variant for 3D scene QA. Images get flattened to 576 tokens each but the model struggles with spatial questions like 'what is behind the chair'. Help me implement C²RoPE."
Approach:
LlamaAttention.forward)def build_c2rope_position_ids(input_ids, image_token_id, patch_grid_h=24, patch_grid_w=24):
"""Build triplet position IDs for C²RoPE."""
batch_size, seq_len = input_ids.shape
# Default: temporal-only positions for text tokens
pos_ids = torch.zeros(batch_size, seq_len, 3, dtype=torch.float)
pos_ids[:, :, 0] = torch.arange(seq_len).unsqueeze(0) # temporal m
for b in range(batch_size):
img_mask = (input_ids[b] == image_token_id)
img_starts = torch.where(img_mask)[0]
if len(img_starts) == 0:
continue
# Group consecutive image tokens into images
splits = torch.where(torch.diff(img_starts) > 1)[0] + 1
image_groups = torch.tensor_split(img_starts, splits.tolist())
for group in image_groups:
n_patches = len(group)
h, w = patch_grid_h, patch_grid_w
for idx, seq_pos in enumerate(group):
row, col = idx // w, idx % w
x = col - (w - 1) / 2.0
y = (h - 1) / 2.0 - row
pos_ids[b, seq_pos, 1] = x # spatial x
pos_ids[b, seq_pos, 2] = y # spatial y
return pos_ids
class C2RotaryEmbedding(nn.Module):
def __init__(self, dim, base=10000, d_temporal=96, d_spatial=32):
super().__init__()
assert d_temporal + d_spatial == dim
self.d_temporal = d_temporal
self.d_spatial = d_spatial
# Standard RoPE frequencies for all dim//2 slots
inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
self.register_buffer("inv_freq", inv_freq)
def forward(self, position_ids):
"""position_ids: (batch, seq_len, 3) -> (m, x, y)"""
m = position_ids[..., 0] # (batch, seq_len)
x = position_ids[..., 1]
y = position_ids[..., 2]
# Temporal angles: first d_temporal//2 frequency slots
freq_t = self.inv_freq[:self.d_temporal // 2]
angles_t = m.unsqueeze(-1) * freq_t # (batch, seq, d_t//2)
# Spatial angles: last d_spatial//2 frequency slots, interleaved
freq_s = self.inv_freq[self.d_temporal // 2:]
angles_x = x.unsqueeze(-1) * freq_s[0::2] # even spatial slots
angles_y = y.unsqueeze(-1) * freq_s[1::2] # odd spatial slots
# Interleave x and y
angles_s = torch.stack([angles_x, angles_y], dim=-1).flatten(-2)
angles = torch.cat([angles_t, angles_s], dim=-1) # (batch, seq, dim//2)
cos = angles.cos()
sin = angles.sin()
return cos, sin
def build_chebyshev_mask(position_ids, image_token_mask):
"""Chebyshev causal mask for visual self-attention."""
x = position_ids[..., 1] # (batch, seq_len)
y = position_ids[..., 2]
d_cheb = torch.max(x.abs(), y.abs()) # (batch, seq_len)
# For visual tokens: allow attention to tokens with <= Chebyshev distance
# d_cheb[i] >= d_cheb[j] means token i can attend to token j
vis_mask = d_cheb.unsqueeze(-1) >= d_cheb.unsqueeze(-2) # (batch, seq, seq)
# Standard causal mask for non-visual tokens
seq_len = position_ids.shape[1]
causal = torch.tril(torch.ones(seq_len, seq_len, dtype=torch.bool))
# Combine: use Chebyshev where both tokens are visual, causal elsewhere
both_visual = image_token_mask.unsqueeze(-1) & image_token_mask.unsqueeze(-2)
mask = torch.where(both_visual, vis_mask, causal.unsqueeze(0))
return mask
Output: The model now assigns spatially-aware positional encodings to image patches and uses center-outward causal masking for visual attention, improving spatial reasoning on 3D scene QA benchmarks.
Example 2: Diagnosing and fixing attention decay on early image tokens
User: "My vision-language model ignores details from the first image when given 16 multi-view images. Attention visualization shows the first few images get almost no attention from later tokens."
Approach:
Output: After applying C²RoPE, attention maps show that early images receive comparable attention weight to later images for visual self-attention, and spatial questions ("what is next to the sofa in view 1?") improve significantly.
Example 3: Minimal patch -- adding spatial continuity without Chebyshev masking
User: "I want the spatial continuity fix but don't want to change the attention mask. Can I use just the triplet positional encoding part?"
Approach:
# Minimal change: just replace position_ids construction
# In your model's prepare_inputs method:
position_ids = build_c2rope_position_ids(input_ids, IMAGE_TOKEN_ID)
cos, sin = c2rope_embed(position_ids)
# Apply cos, sin to Q, K as usual -- no mask changes needed
Output: Spatial reasoning improves partially (column-adjacency artifacts eliminated), but long-range visual token neglect persists without Chebyshev masking.
Paper: C²RoPE: Causal Continuous Rotary Positional Encoding for 3D Large Multimodal-Models Reasoning (ICRA 2026) Code: github.com/ErikZ719/C2RoPE Key sections to study: Section 3 (Method) for the triplet index construction and frequency allocation formulas; Section 3.3 for Chebyshev Causal Masking derivation; Table 1 for benchmark comparisons showing +4.3 EM@1 on ScanQA and +1.2 on SQA3D over the LLaVA-3D baseline.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".