skills/skillxiv-v0.0.2-claude-opus-4.6/attnrl-attention-compass-reasoning-exploration/SKILL.md
Guide LLM exploration in reasoning tasks using attention patterns as navigation signals. This technique branches exploration from high-attention tokens (likely reasoning steps) and applies adaptive sampling to maintain effective gradients, significantly improving training efficiency for mathematical reasoning.
npx skillsauth add ADu2021/skillXiv attnrl-attention-compass-reasoning-explorationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Current reinforcement learning approaches to reasoning in LLMs sample from a broad action space—every possible continuation at every step—resulting in high variance and wasted computation. Blind exploration treats all tokens equally, but reasoning models naturally develop attention patterns that highlight critical decision points. AttnRL exploits this signal.
The core observation is that attention weights correlate with reasoning behavior. When an LLM attends strongly to certain tokens during reasoning, those positions are likely decision points worth exploring. By branching exploration from high-attention regions, we concentrate computational effort on promising paths rather than sampling uniformly.
AttnRL uses a three-component approach:
The result is a focused exploration strategy that reduces computational cost while improving convergence on math benchmarks like AIME and Olympiad-style problems.
First, extract and aggregate attention patterns from the model. High attention scores signal tokens worth exploring from:
def extract_attention_branches(model, prompt, attention_percentile=75):
"""
Identify high-attention tokens for exploration branching.
Args:
model: Language model with attention hooks
prompt: Input prompt to analyze
attention_percentile: Threshold for high-attention identification
Returns:
branch_positions: List of (step, token_id) tuples for exploration
"""
attention_weights = [] # Shape: (num_layers, num_heads, seq_len, seq_len)
# Forward pass with attention capture
with torch.no_grad():
_ = model(prompt)
attention_weights = model.get_attention_weights()
# Average across heads and layers
avg_attention = attention_weights.mean(dim=(0, 1)) # Shape: (seq_len, seq_len)
# For each token position, find max attention score
max_attention_per_token = avg_attention.max(dim=1).values
# Identify high-attention positions
threshold = torch.quantile(max_attention_per_token, attention_percentile / 100)
branch_positions = torch.where(max_attention_per_token > threshold)[0].tolist()
return branch_positions
Next, implement adaptive sampling that adjusts rollout counts based on problem difficulty:
def adaptive_sampler(prompt, model, branch_positions, base_rollouts=4, max_rollouts=16):
"""
Allocate rollout budget based on problem difficulty and history.
Args:
prompt: Input problem
model: Policy model
branch_positions: Tokens to branch from
base_rollouts: Minimum rollouts per prompt
max_rollouts: Maximum allowed rollouts
Returns:
rollouts: List of (trajectory, logprob) pairs
rollout_counts: Actual number per position
"""
rollouts = []
rollout_counts = {}
for pos in branch_positions:
# Estimate difficulty from prompt length and reasoning structure
difficulty_score = estimate_difficulty(prompt) # 0-1 scale
# Allocate more rollouts to harder problems
num_rollouts = int(base_rollouts + difficulty_score * (max_rollouts - base_rollouts))
# Generate rollouts from this branching point
for _ in range(num_rollouts):
# Continue generation from position `pos`
continuation = model.generate_from_position(prompt, pos)
logprob = model.get_logprob(continuation, prompt)
rollouts.append((continuation, logprob))
rollout_counts[pos] = num_rollouts
return rollouts, rollout_counts
Finally, implement off-policy correction to account for non-uniform sampling:
def compute_off_policy_loss(rollouts, rewards, proposal_logprobs,
uniform_baseline_logprobs, clip_ratio=0.2):
"""
Off-policy PPO-style loss with importance weighting.
Args:
rollouts: List of (trajectory, proposal_logprob) tuples
rewards: Corresponding rewards (from process supervision)
proposal_logprobs: Log probabilities from branching policy
uniform_baseline_logprobs: Log prob under uniform distribution (for weighting)
clip_ratio: PPO clipping threshold
Returns:
loss: Scalar loss for backpropagation
"""
# Importance weights: correct for non-uniform sampling
importance_weights = torch.exp(proposal_logprobs - uniform_baseline_logprobs)
importance_weights = torch.clamp(importance_weights, max=5.0) # Prevent extreme weights
# Normalize advantages
advantages = rewards - rewards.mean()
advantages = advantages / (advantages.std() + 1e-8)
# PPO-style clipped loss with importance weighting
ratio = importance_weights
surr1 = ratio * advantages
surr2 = torch.clamp(ratio, 1 - clip_ratio, 1 + clip_ratio) * advantages
loss = -torch.min(surr1, surr2).mean()
return loss
When to use AttnRL:
When NOT to use:
Hyperparameter tuning:
| Parameter | Default | Range | Impact | |-----------|---------|-------|--------| | attention_percentile | 75 | 50-90 | Higher = explore fewer branches | | base_rollouts | 4 | 2-8 | Minimum samples per branch | | max_rollouts | 16 | 8-32 | Cap for high-difficulty problems | | importance_weight_clip | 5.0 | 2.0-10.0 | Stability of off-policy correction | | clip_ratio | 0.2 | 0.1-0.3 | PPO trust region width |
Common pitfalls:
Integration checklist:
Reference: https://arxiv.org/abs/2509.26628
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.