Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

ADu2021/attnrl-attention-compass-reasoning-exploration

Name: attnrl-attention-compass-reasoning-exploration
Author: ADu2021

skills/skillxiv-v0.0.2-claude-opus-4.6/attnrl-attention-compass-reasoning-exploration/SKILL.md

npx skillsauth add ADu2021/skillXiv attnrl-attention-compass-reasoning-exploration

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

AttnRL: Attention as a Compass for Efficient Reasoning Exploration

Current reinforcement learning approaches to reasoning in LLMs sample from a broad action space—every possible continuation at every step—resulting in high variance and wasted computation. Blind exploration treats all tokens equally, but reasoning models naturally develop attention patterns that highlight critical decision points. AttnRL exploits this signal.

The core observation is that attention weights correlate with reasoning behavior. When an LLM attends strongly to certain tokens during reasoning, those positions are likely decision points worth exploring. By branching exploration from high-attention regions, we concentrate computational effort on promising paths rather than sampling uniformly.

Core Concept

AttnRL uses a three-component approach:

Attention-guided branching: Identify tokens with high attention scores (above a percentile threshold, e.g., 75th)
Adaptive sampling: Adjust rollout counts based on problem difficulty and batch statistics
One-step off-policy updates: Use importance weighting to train from non-uniformly sampled trajectories

The result is a focused exploration strategy that reduces computational cost while improving convergence on math benchmarks like AIME and Olympiad-style problems.

Architecture Overview

Attention extractor: Collect attention patterns from intermediate transformer layers
Branching policy: Decide which tokens to explore from (threshold-based or learned)
Sampler: Generate multiple rollouts, concentrating samples near high-attention positions
Off-policy corrector: Re-weight samples to account for non-uniform exploration distribution
Reward signal: Process supervision (step-level correctness signals) or outcome supervision

Implementation Steps

First, extract and aggregate attention patterns from the model. High attention scores signal tokens worth exploring from:

def extract_attention_branches(model, prompt, attention_percentile=75):
    """
    Identify high-attention tokens for exploration branching.

    Args:
        model: Language model with attention hooks
        prompt: Input prompt to analyze
        attention_percentile: Threshold for high-attention identification

    Returns:
        branch_positions: List of (step, token_id) tuples for exploration
    """
    attention_weights = []  # Shape: (num_layers, num_heads, seq_len, seq_len)

    # Forward pass with attention capture
    with torch.no_grad():
        _ = model(prompt)
        attention_weights = model.get_attention_weights()

    # Average across heads and layers
    avg_attention = attention_weights.mean(dim=(0, 1))  # Shape: (seq_len, seq_len)

    # For each token position, find max attention score
    max_attention_per_token = avg_attention.max(dim=1).values

    # Identify high-attention positions
    threshold = torch.quantile(max_attention_per_token, attention_percentile / 100)
    branch_positions = torch.where(max_attention_per_token > threshold)[0].tolist()

    return branch_positions

Next, implement adaptive sampling that adjusts rollout counts based on problem difficulty:

def adaptive_sampler(prompt, model, branch_positions, base_rollouts=4, max_rollouts=16):
    """
    Allocate rollout budget based on problem difficulty and history.

    Args:
        prompt: Input problem
        model: Policy model
        branch_positions: Tokens to branch from
        base_rollouts: Minimum rollouts per prompt
        max_rollouts: Maximum allowed rollouts

    Returns:
        rollouts: List of (trajectory, logprob) pairs
        rollout_counts: Actual number per position
    """
    rollouts = []
    rollout_counts = {}

    for pos in branch_positions:
        # Estimate difficulty from prompt length and reasoning structure
        difficulty_score = estimate_difficulty(prompt)  # 0-1 scale

        # Allocate more rollouts to harder problems
        num_rollouts = int(base_rollouts + difficulty_score * (max_rollouts - base_rollouts))

        # Generate rollouts from this branching point
        for _ in range(num_rollouts):
            # Continue generation from position `pos`
            continuation = model.generate_from_position(prompt, pos)
            logprob = model.get_logprob(continuation, prompt)
            rollouts.append((continuation, logprob))

        rollout_counts[pos] = num_rollouts

    return rollouts, rollout_counts

Finally, implement off-policy correction to account for non-uniform sampling:

def compute_off_policy_loss(rollouts, rewards, proposal_logprobs,
                           uniform_baseline_logprobs, clip_ratio=0.2):
    """
    Off-policy PPO-style loss with importance weighting.

    Args:
        rollouts: List of (trajectory, proposal_logprob) tuples
        rewards: Corresponding rewards (from process supervision)
        proposal_logprobs: Log probabilities from branching policy
        uniform_baseline_logprobs: Log prob under uniform distribution (for weighting)
        clip_ratio: PPO clipping threshold

    Returns:
        loss: Scalar loss for backpropagation
    """
    # Importance weights: correct for non-uniform sampling
    importance_weights = torch.exp(proposal_logprobs - uniform_baseline_logprobs)
    importance_weights = torch.clamp(importance_weights, max=5.0)  # Prevent extreme weights

    # Normalize advantages
    advantages = rewards - rewards.mean()
    advantages = advantages / (advantages.std() + 1e-8)

    # PPO-style clipped loss with importance weighting
    ratio = importance_weights
    surr1 = ratio * advantages
    surr2 = torch.clamp(ratio, 1 - clip_ratio, 1 + clip_ratio) * advantages

    loss = -torch.min(surr1, surr2).mean()
    return loss

Practical Guidance

When to use AttnRL:

Reasoning tasks with clear intermediate steps (math, logic, code generation)
Tasks where you have process supervision (step-level correctness signals)
Computationally constrained settings where efficient exploration matters
Models showing clear attention patterns during reasoning

When NOT to use:

Tasks without clear intermediate steps (e.g., one-shot generation)
Outcome-supervised settings without process rewards
Models with poor attention pattern alignment (e.g., some distilled models)
Tasks where exploration must cover low-attention regions (creative tasks)

Hyperparameter tuning:

| Parameter | Default | Range | Impact | |-----------|---------|-------|--------| | attention_percentile | 75 | 50-90 | Higher = explore fewer branches | | base_rollouts | 4 | 2-8 | Minimum samples per branch | | max_rollouts | 16 | 8-32 | Cap for high-difficulty problems | | importance_weight_clip | 5.0 | 2.0-10.0 | Stability of off-policy correction | | clip_ratio | 0.2 | 0.1-0.3 | PPO trust region width |

Common pitfalls:

Stale attention patterns: Attention distributions change during training. Re-extract branching positions every N steps rather than once at initialization.
Over-fitting to attention: Attention is a heuristic; it's sometimes wrong. Always maintain a small uniform random sampling baseline (5-10% of budget) to explore low-attention regions.
Ignoring off-policy bias: Without importance weighting, the policy diverges from the uniform branching distribution. Always apply off-policy correction.
Process supervision quality: Poor step-level rewards lead to bad gradients regardless of branch quality. Validate process rewards on simple examples first.

Integration checklist:

[ ] Verify model produces interpretable attention patterns on your task
[ ] Collect or generate process supervision labels (step correctness) for training set
[ ] Set attention_percentile based on average branches per problem (target 2-5 branches)
[ ] Validate importance weights stay in range [0.2, 5.0] to detect distribution mismatch
[ ] Compare efficiency (samples per correct output) vs. baseline uniform PSRL

Reference: https://arxiv.org/abs/2509.26628

ADu2021/attnrl-attention-compass-reasoning-exploration

skills/skillxiv-v0.0.2-claude-opus-4.6/attnrl-attention-compass-reasoning-exploration/SKILL.md

Guide LLM exploration in reasoning tasks using attention patterns as navigation signals. This technique branches exploration from high-attention tokens (likely reasoning steps) and applies adaptive sampling to maintain effective gradients, significantly improving training efficiency for mathematical reasoning.

2 stars

testing

Updated Apr 16, 2026

$ install --global

skillsauth

npx skillsauth add ADu2021/skillXiv attnrl-attention-compass-reasoning-exploration

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 24, 2026, 8:55 PM1.8s1 file scanned

SKILL.md

name:: attnrl-attention-compass-reasoning-exploration
title:: Attention as a Compass: Efficient Exploration for Process-Supervised RL in Reasoning Models
version:: 0.0.2
engine:: skillxiv-v0.0.2-claude-opus-4.6
license:: MIT
url:: https://arxiv.org/abs/2509.26628
keywords:: [process-supervision, RL-reasoning, attention-mechanism, efficient-exploration, math-reasoning]
description:: Guide LLM exploration in reasoning tasks using attention patterns as navigation signals. This technique branches exploration from high-attention tokens (likely reasoning steps) and applies adaptive sampling to maintain effective gradients, significantly improving training efficiency for mathematical reasoning.

AttnRL: Attention as a Compass for Efficient Reasoning Exploration

Core Concept

AttnRL uses a three-component approach:

Attention-guided branching: Identify tokens with high attention scores (above a percentile threshold, e.g., 75th)
Adaptive sampling: Adjust rollout counts based on problem difficulty and batch statistics
One-step off-policy updates: Use importance weighting to train from non-uniformly sampled trajectories

The result is a focused exploration strategy that reduces computational cost while improving convergence on math benchmarks like AIME and Olympiad-style problems.

Architecture Overview

Attention extractor: Collect attention patterns from intermediate transformer layers
Branching policy: Decide which tokens to explore from (threshold-based or learned)
Sampler: Generate multiple rollouts, concentrating samples near high-attention positions
Off-policy corrector: Re-weight samples to account for non-uniform exploration distribution
Reward signal: Process supervision (step-level correctness signals) or outcome supervision

Implementation Steps

First, extract and aggregate attention patterns from the model. High attention scores signal tokens worth exploring from:

def extract_attention_branches(model, prompt, attention_percentile=75):
    """
    Identify high-attention tokens for exploration branching.

    Args:
        model: Language model with attention hooks
        prompt: Input prompt to analyze
        attention_percentile: Threshold for high-attention identification

    Returns:
        branch_positions: List of (step, token_id) tuples for exploration
    """
    attention_weights = []  # Shape: (num_layers, num_heads, seq_len, seq_len)

    # Forward pass with attention capture
    with torch.no_grad():
        _ = model(prompt)
        attention_weights = model.get_attention_weights()

    # Average across heads and layers
    avg_attention = attention_weights.mean(dim=(0, 1))  # Shape: (seq_len, seq_len)

    # For each token position, find max attention score
    max_attention_per_token = avg_attention.max(dim=1).values

    # Identify high-attention positions
    threshold = torch.quantile(max_attention_per_token, attention_percentile / 100)
    branch_positions = torch.where(max_attention_per_token > threshold)[0].tolist()

    return branch_positions

Next, implement adaptive sampling that adjusts rollout counts based on problem difficulty:

def adaptive_sampler(prompt, model, branch_positions, base_rollouts=4, max_rollouts=16):
    """
    Allocate rollout budget based on problem difficulty and history.

    Args:
        prompt: Input problem
        model: Policy model
        branch_positions: Tokens to branch from
        base_rollouts: Minimum rollouts per prompt
        max_rollouts: Maximum allowed rollouts

    Returns:
        rollouts: List of (trajectory, logprob) pairs
        rollout_counts: Actual number per position
    """
    rollouts = []
    rollout_counts = {}

    for pos in branch_positions:
        # Estimate difficulty from prompt length and reasoning structure
        difficulty_score = estimate_difficulty(prompt)  # 0-1 scale

        # Allocate more rollouts to harder problems
        num_rollouts = int(base_rollouts + difficulty_score * (max_rollouts - base_rollouts))

        # Generate rollouts from this branching point
        for _ in range(num_rollouts):
            # Continue generation from position `pos`
            continuation = model.generate_from_position(prompt, pos)
            logprob = model.get_logprob(continuation, prompt)
            rollouts.append((continuation, logprob))

        rollout_counts[pos] = num_rollouts

    return rollouts, rollout_counts

Finally, implement off-policy correction to account for non-uniform sampling:

def compute_off_policy_loss(rollouts, rewards, proposal_logprobs,
                           uniform_baseline_logprobs, clip_ratio=0.2):
    """
    Off-policy PPO-style loss with importance weighting.

    Args:
        rollouts: List of (trajectory, proposal_logprob) tuples
        rewards: Corresponding rewards (from process supervision)
        proposal_logprobs: Log probabilities from branching policy
        uniform_baseline_logprobs: Log prob under uniform distribution (for weighting)
        clip_ratio: PPO clipping threshold

    Returns:
        loss: Scalar loss for backpropagation
    """
    # Importance weights: correct for non-uniform sampling
    importance_weights = torch.exp(proposal_logprobs - uniform_baseline_logprobs)
    importance_weights = torch.clamp(importance_weights, max=5.0)  # Prevent extreme weights

    # Normalize advantages
    advantages = rewards - rewards.mean()
    advantages = advantages / (advantages.std() + 1e-8)

    # PPO-style clipped loss with importance weighting
    ratio = importance_weights
    surr1 = ratio * advantages
    surr2 = torch.clamp(ratio, 1 - clip_ratio, 1 + clip_ratio) * advantages

    loss = -torch.min(surr1, surr2).mean()
    return loss

Practical Guidance

When to use AttnRL:

Reasoning tasks with clear intermediate steps (math, logic, code generation)
Tasks where you have process supervision (step-level correctness signals)
Computationally constrained settings where efficient exploration matters
Models showing clear attention patterns during reasoning

When NOT to use:

Tasks without clear intermediate steps (e.g., one-shot generation)
Outcome-supervised settings without process rewards
Models with poor attention pattern alignment (e.g., some distilled models)
Tasks where exploration must cover low-attention regions (creative tasks)

Hyperparameter tuning:

Common pitfalls:

Stale attention patterns: Attention distributions change during training. Re-extract branching positions every N steps rather than once at initialization.
Over-fitting to attention: Attention is a heuristic; it's sometimes wrong. Always maintain a small uniform random sampling baseline (5-10% of budget) to explore low-attention regions.
Ignoring off-policy bias: Without importance weighting, the policy diverges from the uniform branching distribution. Always apply off-policy correction.
Process supervision quality: Poor step-level rewards lead to bad gradients regardless of branch quality. Validate process rewards on simple examples first.

Integration checklist:

[ ] Verify model produces interpretable attention patterns on your task
[ ] Collect or generate process supervision labels (step correctness) for training set
[ ] Set attention_percentile based on average branches per problem (target 2-5 branches)
[ ] Validate importance weights stay in range [0.2, 5.0] to detect distribution mismatch
[ ] Compare efficiency (samples per correct output) vs. baseline uniform PSRL

Reference: https://arxiv.org/abs/2509.26628

Related Skills

ADu2021/flow-map-trajectory-tilting

testing

VerifiedTrustedCommunity

Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flow-map-trajectory-tilting

ADu2021/flexible-data-mixture-of-experts

testing

VerifiedTrustedCommunity

Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flexible-data-mixture-of-experts

ADu2021/flexibility-trap-diffusion-reasoning

data-ai

VerifiedTrustedCommunity

Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flexibility-trap-diffusion-reasoning

ADu2021/flex-continuous-agent-evolution

devops

VerifiedTrustedCommunity

Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flex-continuous-agent-evolution

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/ADu2021/skillXiv.git

# Copy into Claude Code skills folder (global)
cp -r skillXiv/skills/skillxiv-v0.0.2-claude-opus-4.6/attnrl-attention-compass-reasoning-exploration ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

ADu2021/skillXiv

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT