skills/skillxiv-v0.0.2-claude-opus-4.6/dope-denoising-rotary-embeddings/SKILL.md
Improve long-context length extrapolation by denoising instabilities in Rotary Position Embeddings (RoPE) through spectral analysis and selective head rewriting—training-free post-hoc intervention for longer context windows.
npx skillsauth add ADu2021/skillXiv dope-denoising-rotary-embeddingsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
RoPE (Rotary Position Embedding) enables models to extrapolate beyond training context, but its low-frequency components concentrate energy in narrow angular cones, creating over-aligned attention patterns that destabilize performance on long sequences. DoPE (Denoising Rotary Position Embedding) identifies and corrects these problematic attention heads using spectral analysis, improving extrapolation without fine-tuning.
The core insight is that certain heads amplify positional noise—they suffer from attention sinks where energy concentrates on specific tokens. By detecting these heads via entropy and reparameterizing their attention maps with isotropic noise, DoPE stabilizes extrapolation and improves needle-in-haystack and in-context learning tasks.
RoPE uses low-frequency sinusoidal encodings to maintain rotational equivariance across position shifts. However, these low frequencies create pathological activation patterns: they concentrate spectral energy, producing massive attention scores that collapse into sinks (single tokens grabbing all attention) rather than distributing appropriately.
DoPE solves this by:
This achieves long-context extrapolation gains while keeping the model fully frozen—denoising happens at inference without model retraining.
Step 1: Entropy Computation. Compute singular values of query/key representations and calculate truncated entropy to identify problematic heads.
import numpy as np
def compute_truncated_entropy(representations, r=32):
"""
Compute truncated matrix entropy for head identification.
representations: (seq_len, hidden_dim) tensor of queries or keys
r: number of singular values to use for entropy calculation
"""
U, S, Vt = np.linalg.svd(representations, full_matrices=False)
S_sorted = np.sort(S)[::-1] # descending order
S_trunc = S_sorted[:min(r, len(S_sorted))]
S_norm = S_trunc / np.sum(S_trunc) # normalize
entropy = -np.sum(S_norm * np.log(S_norm + 1e-8))
return entropy
Step 2: Head Selection. After profiling on matched-length data, identify heads with low entropy (concentration) and mark them for denoising.
def identify_problematic_heads(model, calibration_data, entropy_threshold=0.5):
"""
Identify heads to denoise by computing entropy across attention heads.
Returns list of (layer_idx, head_idx) tuples for heads with entropy < threshold.
"""
problematic_heads = []
for layer_idx, layer in enumerate(model.layers):
entropies = []
for head_idx in range(layer.num_heads):
# Extract head representation during forward pass
head_reps = get_head_representations(layer, calibration_data, head_idx)
ent = compute_truncated_entropy(head_reps)
entropies.append(ent)
# Select bottom-k heads (lowest entropy = highest concentration)
for head_idx, ent in enumerate(entropies):
if ent < entropy_threshold:
problematic_heads.append((layer_idx, head_idx))
return problematic_heads
Step 3: Gaussian Reparameterization. At inference, replace RoPE in problematic heads with isotropic Gaussian noise.
def apply_dope_denoising(model, problematic_heads, strategy='gaussian'):
"""
Apply denoising strategies to identified heads at inference time.
strategy: 'gaussian' (replace with noise), 'mask-parts' (suppress low freq), 'mask-all' (remove RoPE)
"""
for layer_idx, head_idx in problematic_heads:
layer = model.layers[layer_idx]
if strategy == 'gaussian':
# Replace positional encoding with isotropic Gaussian noise
layer.attention.heads[head_idx].use_rope = False
layer.attention.heads[head_idx].use_gaussian_noise = True
elif strategy == 'mask-parts':
# Suppress low-frequency RoPE bands
layer.attention.heads[head_idx].suppress_low_freq_rope = True
elif strategy == 'mask-all':
# Completely disable RoPE
layer.attention.heads[head_idx].use_rope = False
return model
When to Use: Long-context reasoning tasks (document QA, long-form summarization, many-shot ICL) where models need to reference distant context or integrate information across 10K+ token windows.
Hyperparameters:
Pitfalls:
When NOT to Use: Short-context or fixed-length tasks where extrapolation is not a concern; denoising adds inference overhead without benefit.
Integration: Apply at inference after model loading; pairs well with ALiBi or other position-independent attention variants for complementary improvements.
Reference: https://arxiv.org/abs/2511.09146
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.