skills/skillxiv-v0.0.2-claude-opus-4.6/avey-recurrence-free-ranker/SKILL.md
Avey architecture pairs a ranker with autoregressive processor to select relevant tokens, decoupling context window from sequence length for efficient long-range processing.
npx skillsauth add ADu2021/skillXiv avey-recurrence-free-rankerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Avey is a recurrence-free and attention-free architecture that decouples sequence length from context width, enabling processing of arbitrarily long sequences. It pairs a ranker module that identifies top-k relevant token splits with a neural processor that contextualizes only selected tokens. This weighted-selective-split interaction mechanism preserves semantic information regardless of sequence position while avoiding the quadratic scaling of full attention.
Create a module that identifies top-k relevant token splits using MaxSim scoring:
import torch
import torch.nn as nn
class RankerModule(nn.Module):
"""
Identifies top-k relevant token splits from full sequence.
Uses MaxSim scoring: max similarity between query and candidate splits.
"""
def __init__(self, embedding_dim, split_size=16):
super().__init__()
self.split_size = split_size
self.embedding_dim = embedding_dim
# Query projection for relevance computation
self.query_proj = nn.Linear(embedding_dim, embedding_dim)
def forward(self, embeddings, k=32):
"""
Args:
embeddings: [batch_size, seq_len, embedding_dim]
k: number of top splits to select
Returns:
selected_splits: [batch_size, k, split_size, embedding_dim]
relevance_scores: [batch_size, k] normalized weights
"""
batch_size, seq_len, dim = embeddings.shape
# Partition sequence into splits
num_splits = (seq_len + self.split_size - 1) // self.split_size
padded_len = num_splits * self.split_size
# Pad if necessary
if padded_len > seq_len:
padding = torch.zeros(
batch_size, padded_len - seq_len, dim,
device=embeddings.device
)
padded_embeddings = torch.cat([embeddings, padding], dim=1)
else:
padded_embeddings = embeddings
# Reshape into splits
splits = padded_embeddings.reshape(
batch_size, num_splits, self.split_size, dim
)
# Compute split-level representations (average pooling)
split_reps = splits.mean(dim=2) # [batch, num_splits, dim]
# Compute query (from first token or learnable)
query = self.query_proj(embeddings[:, 0]) # [batch, dim]
query = query.unsqueeze(1) # [batch, 1, dim]
# MaxSim: max similarity of query to each split
similarities = torch.matmul(query, split_reps.transpose(1, 2))
# [batch, 1, num_splits]
similarities = similarities.squeeze(1) # [batch, num_splits]
# Select top-k splits
top_k = min(k, num_splits)
top_scores, top_indices = torch.topk(similarities, top_k, dim=1)
# Normalize scores to weights
relevance_weights = torch.softmax(top_scores, dim=1)
# Gather selected splits
selected_splits = splits[
torch.arange(batch_size).unsqueeze(1),
top_indices
] # [batch, k, split_size, dim]
return selected_splits, relevance_weights, top_indices
Expand token embeddings via position-wise networks:
class EnricherUnit(nn.Module):
"""
Expands embeddings using position-wise feed-forward network.
Projects to higher dimension then back down.
"""
def __init__(self, embedding_dim, expansion_ratio=4):
super().__init__()
hidden_dim = int(embedding_dim * expansion_ratio)
self.feed_forward = nn.Sequential(
nn.Linear(embedding_dim, hidden_dim),
nn.GELU(),
nn.Linear(hidden_dim, embedding_dim)
)
self.norm = nn.LayerNorm(embedding_dim)
def forward(self, x):
"""
Args:
x: [batch, seq_len, embedding_dim] or [batch, k, split_size, embedding_dim]
"""
# Flatten and process
shape = x.shape
x_flat = x.reshape(-1, shape[-1])
enriched = self.feed_forward(x_flat)
output = self.norm(enriched + x_flat)
return output.reshape(shape)
Enable inter-embedding interactions within selected splits:
class ContextualizerUnit(nn.Module):
"""
Enables interactions between selected embeddings.
Uses self-attention or simplified interaction mechanism.
"""
def __init__(self, embedding_dim, num_heads=4):
super().__init__()
self.attention = nn.MultiheadAttention(
embedding_dim,
num_heads,
batch_first=True,
dropout=0.1
)
self.norm = nn.LayerNorm(embedding_dim)
def forward(self, selected_splits):
"""
Args:
selected_splits: [batch, k, split_size, embedding_dim]
"""
batch, k, split_size, dim = selected_splits.shape
# Flatten k and split_size into sequence dimension
x = selected_splits.reshape(batch, k * split_size, dim)
# Self-attention
attn_out, _ = self.attention(x, x, x)
# Residual + norm
contextualized = self.norm(attn_out + x)
return contextualized.reshape(batch, k, split_size, dim)
Combine contextualized and original features:
class FuserUnit(nn.Module):
"""
Combines contextualized features with bypassed original features.
Learns optimal blending weights.
"""
def __init__(self, embedding_dim):
super().__init__()
self.gate = nn.Sequential(
nn.Linear(embedding_dim * 2, embedding_dim),
nn.Sigmoid()
)
self.norm = nn.LayerNorm(embedding_dim)
def forward(self, contextualized, original_splits):
"""
Args:
contextualized: [batch, k, split_size, embedding_dim]
original_splits: [batch, k, split_size, embedding_dim]
"""
# Concatenate features
combined = torch.cat([contextualized, original_splits], dim=-1)
# Learn gating weights
shape = combined.shape[:-1]
combined_flat = combined.reshape(-1, combined.shape[-1])
gate = self.gate(combined_flat).reshape(*shape, -1)
# Blend: gate * contextualized + (1 - gate) * original
fused = gate * contextualized + (1 - gate) * original_splits
return self.norm(fused)
Combine modules into end-to-end architecture:
class AveyProcessor(nn.Module):
"""
Complete Avey architecture: Ranker + Processor (Enricher + Contextualizer + Fuser).
"""
def __init__(self, embedding_dim, split_size=16, k=32):
super().__init__()
self.ranker = RankerModule(embedding_dim, split_size)
self.enricher = EnricherUnit(embedding_dim)
self.contextualizer = ContextualizerUnit(embedding_dim)
self.fuser = FuserUnit(embedding_dim)
self.k = k
def forward(self, embeddings):
"""
Args:
embeddings: [batch, seq_len, embedding_dim]
Returns:
output: [batch, seq_len] processed representations
"""
# Step 1: Rank and select splits
selected_splits, relevance_weights, indices = self.ranker(
embeddings, k=self.k
)
# Step 2: Enrich selected embeddings
enriched = self.enricher(selected_splits)
# Step 3: Contextualize within splits
contextualized = self.contextualizer(enriched)
# Step 4: Fuse with original splits
fused = self.fuser(contextualized, selected_splits)
# Step 5: Weight by relevance and reconstruct
batch, k, split_size, dim = fused.shape
weighted_fused = fused * relevance_weights.reshape(batch, k, 1, 1)
# Aggregate back to sequence
output = weighted_fused.reshape(batch, -1, dim)
return output[:, :embeddings.shape[1]]
Paper: arXiv:2506.11305 Key metrics: 64k generalization from 512-token training, superior long-context vs. Mamba/RWKV Related work: Sparse attention, mixture-of-experts, efficient transformers, retrieval-augmented generation
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.