Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

ADu2021/dope-denoising-rotary-embeddings

Name: dope-denoising-rotary-embeddings
Author: ADu2021

skills/skillxiv-v0.0.2-claude-opus-4.6/dope-denoising-rotary-embeddings/SKILL.md

npx skillsauth add ADu2021/skillXiv dope-denoising-rotary-embeddings

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Stabilize Long-Context Reasoning by Denoising Rotary Position Embeddings

RoPE (Rotary Position Embedding) enables models to extrapolate beyond training context, but its low-frequency components concentrate energy in narrow angular cones, creating over-aligned attention patterns that destabilize performance on long sequences. DoPE (Denoising Rotary Position Embedding) identifies and corrects these problematic attention heads using spectral analysis, improving extrapolation without fine-tuning.

The core insight is that certain heads amplify positional noise—they suffer from attention sinks where energy concentrates on specific tokens. By detecting these heads via entropy and reparameterizing their attention maps with isotropic noise, DoPE stabilizes extrapolation and improves needle-in-haystack and in-context learning tasks.

Core Concept

RoPE uses low-frequency sinusoidal encodings to maintain rotational equivariance across position shifts. However, these low frequencies create pathological activation patterns: they concentrate spectral energy, producing massive attention scores that collapse into sinks (single tokens grabbing all attention) rather than distributing appropriately.

DoPE solves this by:

Spectral Detection: Identify heads where RoPE induces concentrated energy using truncated matrix entropy on query/key representations
Head Selection: Sort by entropy and choose heads to denoise (typically 1-32 per layer)
Reparameterization: Replace problematic RoPE with isotropic Gaussian noise in selected heads, breaking the alignment pathology

This achieves long-context extrapolation gains while keeping the model fully frozen—denoising happens at inference without model retraining.

Architecture Overview

Spectral Analysis Layer: Compute truncated matrix entropy ℋ = 1/r Σᵢ₌₁ʳ λᵢ log λᵢ on RoPE-transformed representations to quantify concentration
Entropy Thresholding: Select heads with entropy below a calibrated threshold (indicating spectral amplification)
Three Denoising Strategies:
- DoPE-by-parts: Suppress low-frequency RoPE bands
- DoPE-by-all: Completely mask RoPE
- DoPE-by-Gaussian: Replace with isotropic Gaussian noise
Inference-time Application: Apply denoising only at inference; no training needed

Implementation Steps

Step 1: Entropy Computation. Compute singular values of query/key representations and calculate truncated entropy to identify problematic heads.

import numpy as np

def compute_truncated_entropy(representations, r=32):
    """
    Compute truncated matrix entropy for head identification.
    representations: (seq_len, hidden_dim) tensor of queries or keys
    r: number of singular values to use for entropy calculation
    """
    U, S, Vt = np.linalg.svd(representations, full_matrices=False)
    S_sorted = np.sort(S)[::-1]  # descending order
    S_trunc = S_sorted[:min(r, len(S_sorted))]
    S_norm = S_trunc / np.sum(S_trunc)  # normalize
    entropy = -np.sum(S_norm * np.log(S_norm + 1e-8))
    return entropy

Step 2: Head Selection. After profiling on matched-length data, identify heads with low entropy (concentration) and mark them for denoising.

def identify_problematic_heads(model, calibration_data, entropy_threshold=0.5):
    """
    Identify heads to denoise by computing entropy across attention heads.
    Returns list of (layer_idx, head_idx) tuples for heads with entropy < threshold.
    """
    problematic_heads = []
    for layer_idx, layer in enumerate(model.layers):
        entropies = []
        for head_idx in range(layer.num_heads):
            # Extract head representation during forward pass
            head_reps = get_head_representations(layer, calibration_data, head_idx)
            ent = compute_truncated_entropy(head_reps)
            entropies.append(ent)

        # Select bottom-k heads (lowest entropy = highest concentration)
        for head_idx, ent in enumerate(entropies):
            if ent < entropy_threshold:
                problematic_heads.append((layer_idx, head_idx))

    return problematic_heads

Step 3: Gaussian Reparameterization. At inference, replace RoPE in problematic heads with isotropic Gaussian noise.

def apply_dope_denoising(model, problematic_heads, strategy='gaussian'):
    """
    Apply denoising strategies to identified heads at inference time.
    strategy: 'gaussian' (replace with noise), 'mask-parts' (suppress low freq), 'mask-all' (remove RoPE)
    """
    for layer_idx, head_idx in problematic_heads:
        layer = model.layers[layer_idx]

        if strategy == 'gaussian':
            # Replace positional encoding with isotropic Gaussian noise
            layer.attention.heads[head_idx].use_rope = False
            layer.attention.heads[head_idx].use_gaussian_noise = True
        elif strategy == 'mask-parts':
            # Suppress low-frequency RoPE bands
            layer.attention.heads[head_idx].suppress_low_freq_rope = True
        elif strategy == 'mask-all':
            # Completely disable RoPE
            layer.attention.heads[head_idx].use_rope = False

    return model

Practical Guidance

When to Use: Long-context reasoning tasks (document QA, long-form summarization, many-shot ICL) where models need to reference distant context or integrate information across 10K+ token windows.

Hyperparameters:

Entropy threshold: calibrate on data of training length; typically 0.3–0.7
Number of heads to denoise: 1–32 per layer; start conservative (1–8) and increase if extrapolation degrades

Pitfalls:

Over-denoising (denoising too many heads) can discard useful positional information and hurt in-distribution performance
Entropy threshold must be calibrated on matched-length data; using mismatched data degrades selection accuracy
Does not solve fundamental context window limits; it improves extrapolation stability but not unbounded scaling

When NOT to Use: Short-context or fixed-length tasks where extrapolation is not a concern; denoising adds inference overhead without benefit.

Integration: Apply at inference after model loading; pairs well with ALiBi or other position-independent attention variants for complementary improvements.

Reference: https://arxiv.org/abs/2511.09146

ADu2021/dope-denoising-rotary-embeddings

skills/skillxiv-v0.0.2-claude-opus-4.6/dope-denoising-rotary-embeddings/SKILL.md

Improve long-context length extrapolation by denoising instabilities in Rotary Position Embeddings (RoPE) through spectral analysis and selective head rewriting—training-free post-hoc intervention for longer context windows.

2 stars

data-ai

Updated Apr 17, 2026

$ install --global

skillsauth

npx skillsauth add ADu2021/skillXiv dope-denoising-rotary-embeddings

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 17, 2026, 5:33 AM7.9s1 file scanned

SKILL.md

name:: dope-denoising-rotary-embeddings
title:: DoPE: Denoising Rotary Position Embedding
version:: 0.0.2
engine:: skillxiv-v0.0.2-claude-opus-4.6
license:: MIT
url:: https://arxiv.org/abs/2511.09146
keywords:: [Position Embeddings, RoPE, Long-context, Length Extrapolation, Attention Stability]
description:: Improve long-context length extrapolation by denoising instabilities in Rotary Position Embeddings (RoPE) through spectral analysis and selective head rewriting—training-free post-hoc intervention for longer context windows.

Stabilize Long-Context Reasoning by Denoising Rotary Position Embeddings

Core Concept

DoPE solves this by:

Spectral Detection: Identify heads where RoPE induces concentrated energy using truncated matrix entropy on query/key representations
Head Selection: Sort by entropy and choose heads to denoise (typically 1-32 per layer)
Reparameterization: Replace problematic RoPE with isotropic Gaussian noise in selected heads, breaking the alignment pathology

This achieves long-context extrapolation gains while keeping the model fully frozen—denoising happens at inference without model retraining.

Architecture Overview

Spectral Analysis Layer: Compute truncated matrix entropy ℋ = 1/r Σᵢ₌₁ʳ λᵢ log λᵢ on RoPE-transformed representations to quantify concentration
Entropy Thresholding: Select heads with entropy below a calibrated threshold (indicating spectral amplification)
Three Denoising Strategies:
- DoPE-by-parts: Suppress low-frequency RoPE bands
- DoPE-by-all: Completely mask RoPE
- DoPE-by-Gaussian: Replace with isotropic Gaussian noise
Inference-time Application: Apply denoising only at inference; no training needed

Implementation Steps

Step 1: Entropy Computation. Compute singular values of query/key representations and calculate truncated entropy to identify problematic heads.

import numpy as np

def compute_truncated_entropy(representations, r=32):
    """
    Compute truncated matrix entropy for head identification.
    representations: (seq_len, hidden_dim) tensor of queries or keys
    r: number of singular values to use for entropy calculation
    """
    U, S, Vt = np.linalg.svd(representations, full_matrices=False)
    S_sorted = np.sort(S)[::-1]  # descending order
    S_trunc = S_sorted[:min(r, len(S_sorted))]
    S_norm = S_trunc / np.sum(S_trunc)  # normalize
    entropy = -np.sum(S_norm * np.log(S_norm + 1e-8))
    return entropy

Step 2: Head Selection. After profiling on matched-length data, identify heads with low entropy (concentration) and mark them for denoising.

def identify_problematic_heads(model, calibration_data, entropy_threshold=0.5):
    """
    Identify heads to denoise by computing entropy across attention heads.
    Returns list of (layer_idx, head_idx) tuples for heads with entropy < threshold.
    """
    problematic_heads = []
    for layer_idx, layer in enumerate(model.layers):
        entropies = []
        for head_idx in range(layer.num_heads):
            # Extract head representation during forward pass
            head_reps = get_head_representations(layer, calibration_data, head_idx)
            ent = compute_truncated_entropy(head_reps)
            entropies.append(ent)

        # Select bottom-k heads (lowest entropy = highest concentration)
        for head_idx, ent in enumerate(entropies):
            if ent < entropy_threshold:
                problematic_heads.append((layer_idx, head_idx))

    return problematic_heads

Step 3: Gaussian Reparameterization. At inference, replace RoPE in problematic heads with isotropic Gaussian noise.

def apply_dope_denoising(model, problematic_heads, strategy='gaussian'):
    """
    Apply denoising strategies to identified heads at inference time.
    strategy: 'gaussian' (replace with noise), 'mask-parts' (suppress low freq), 'mask-all' (remove RoPE)
    """
    for layer_idx, head_idx in problematic_heads:
        layer = model.layers[layer_idx]

        if strategy == 'gaussian':
            # Replace positional encoding with isotropic Gaussian noise
            layer.attention.heads[head_idx].use_rope = False
            layer.attention.heads[head_idx].use_gaussian_noise = True
        elif strategy == 'mask-parts':
            # Suppress low-frequency RoPE bands
            layer.attention.heads[head_idx].suppress_low_freq_rope = True
        elif strategy == 'mask-all':
            # Completely disable RoPE
            layer.attention.heads[head_idx].use_rope = False

    return model

Practical Guidance

When to Use: Long-context reasoning tasks (document QA, long-form summarization, many-shot ICL) where models need to reference distant context or integrate information across 10K+ token windows.

Hyperparameters:

Entropy threshold: calibrate on data of training length; typically 0.3–0.7
Number of heads to denoise: 1–32 per layer; start conservative (1–8) and increase if extrapolation degrades

Pitfalls:

Over-denoising (denoising too many heads) can discard useful positional information and hurt in-distribution performance
Entropy threshold must be calibrated on matched-length data; using mismatched data degrades selection accuracy
Does not solve fundamental context window limits; it improves extrapolation stability but not unbounded scaling

When NOT to Use: Short-context or fixed-length tasks where extrapolation is not a concern; denoising adds inference overhead without benefit.

Integration: Apply at inference after model loading; pairs well with ALiBi or other position-independent attention variants for complementary improvements.

Reference: https://arxiv.org/abs/2511.09146

Related Skills

ADu2021/flow-map-trajectory-tilting

testing

VerifiedTrustedCommunity

Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flow-map-trajectory-tilting

ADu2021/flexible-data-mixture-of-experts

testing

VerifiedTrustedCommunity

Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flexible-data-mixture-of-experts

ADu2021/flexibility-trap-diffusion-reasoning

data-ai

VerifiedTrustedCommunity

Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flexibility-trap-diffusion-reasoning

ADu2021/flex-continuous-agent-evolution

devops

VerifiedTrustedCommunity

Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flex-continuous-agent-evolution

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/ADu2021/skillXiv.git

# Copy into Claude Code skills folder (global)
cp -r skillXiv/skills/skillxiv-v0.0.2-claude-opus-4.6/dope-denoising-rotary-embeddings ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

ADu2021/skillXiv

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT