skills/skillxiv-v0.0.2-claude-opus-4.6/ares-entropy-shaping/SKILL.md
Calibrate exploration effort in reasoning traces based on problem difficulty by detecting high-entropy windows and applying hierarchical entropy rewards. Reduces unnecessary reasoning on easy tasks while increasing exploration on hard tasks.
npx skillsauth add ADu2021/skillXiv ares-entropy-shapingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Models waste reasoning computation by overthinking simple problems while under-exploring hard ones. ARES detects the difficulty of a problem mid-reasoning through token-level entropy analysis and dynamically adjusts exploration effort to match difficulty.
Core insight: efficient reasoning allocates thinking proportionally to problem difficulty. Hard problems need more exploration; easy ones need less. By detecting when a model is uncertain and calibrating exploration accordingly, you achieve both faster reasoning and better accuracy on complex tasks.
High Window-Entropy (HWE) Detection: Instead of using noisy single-token entropy, compute entropy over a sliding window of recent tokens. This smoothing reliably identifies when the model hits a decision point or novel reasoning territory.
Adaptive Entropy Policy Optimization (AEPO): Two-stage pipeline where HWE tokens trigger exploration opportunities, paired with hierarchical entropy rewards that scale reward magnitude based on detected task difficulty.
Stage 1: Windowed Entropy Computation
Replace noisy single-token entropy with robust window-based estimates:
import torch
import torch.nn.functional as F
def compute_window_entropy(logits, window_size=5):
"""
Compute entropy over sliding windows of tokens.
Smooths out noise in single-token entropy measurements.
Args:
logits: shape [batch, seq_len, vocab_size]
window_size: tokens to include in entropy window
Returns:
window_entropy: shape [batch, seq_len]
"""
# Compute probabilities
probs = F.softmax(logits, dim=-1)
# Compute entropy for each token
entropy = -(probs * torch.log(probs + 1e-10)).sum(dim=-1)
# Apply sliding window average
window_entropy = []
for i in range(entropy.shape[1]):
start = max(0, i - window_size // 2)
end = min(entropy.shape[1], i + window_size // 2)
window_avg = entropy[:, start:end].mean(dim=-1)
window_entropy.append(window_avg)
return torch.stack(window_entropy, dim=1)
# Identify high-entropy windows
window_entropy = compute_window_entropy(logits)
hwe_threshold = window_entropy.quantile(0.75)
hwe_positions = (window_entropy > hwe_threshold).nonzero()
Stage 2: Difficulty-Aware Reward Shaping
Design rewards that scale exploration incentive based on problem difficulty:
def shape_entropy_reward(
window_entropy,
reasoning_trace_length,
target_trace_length,
difficulty_score
):
"""
Create hierarchical entropy rewards.
High difficulty → encourage more exploration.
Low difficulty → penalize excess exploration.
Args:
window_entropy: computed from Stage 1
reasoning_trace_length: current trace length
target_trace_length: baseline expected length
difficulty_score: inferred problem difficulty
"""
# Compute exploration excess
excess_steps = max(0, reasoning_trace_length - target_trace_length)
# Base entropy reward
exploration_reward = window_entropy.mean()
# Dynamic KL control based on difficulty
if difficulty_score > 0.7: # Hard problem
# Encourage more exploration
kl_coefficient = 0.1 # Low penalty on divergence
exploration_bonus = 0.5
else: # Easy problem
# Penalize excess exploration
kl_coefficient = 0.5 # High penalty on divergence
exploration_bonus = -0.3 * excess_steps
# Total reward
total_reward = (
exploration_reward * exploration_bonus +
-kl_coefficient * excess_steps
)
return total_reward
Stage 3: RL Training Loop
Integrate entropy shaping into policy optimization:
def train_with_entropy_shaping(model, dataloader, num_epochs=10):
"""
Train model with adaptive entropy-based exploration.
"""
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
for epoch in range(num_epochs):
for problems, solutions in dataloader:
# Generate with temperature sampling for exploration
with torch.no_grad():
outputs = model.generate(
problems,
max_length=512,
temperature=1.2,
output_scores=True,
return_dict_in_generate=True
)
# Compute window entropy for generated sequences
window_entropy = compute_window_entropy(outputs.scores)
# Infer difficulty from entropy pattern
mean_entropy = window_entropy.mean(dim=-1)
difficulty = (mean_entropy / max_entropy).clamp(0, 1)
# Get baseline expected length
baseline_lengths = estimate_solution_length(problems)
# Shape rewards based on difficulty and entropy
rewards = []
for i, (output, baseline_len) in enumerate(
zip(outputs.sequences, baseline_lengths)
):
is_correct = verify_solution(output, solutions[i])
trace_length = (output != pad_token).sum()
entropy_reward = shape_entropy_reward(
window_entropy[i],
trace_length,
baseline_len,
difficulty[i]
)
total_reward = (
float(is_correct) * 10.0 + # Solution correctness
entropy_reward # Adaptive exploration
)
rewards.append(total_reward)
# Policy gradient update
loss = compute_policy_loss(model, outputs, rewards)
optimizer.zero_grad()
loss.backward()
optimizer.step()
When to Use ARES:
When NOT to Use:
Hyperparameter Tuning:
| Parameter | Typical Value | Guidance | |-----------|---------------|----------| | Window Size | 5-10 tokens | Larger = smoother but delayed detection | | HWE Threshold | 75th percentile | Targets top-25% highest entropy moments | | KL Coeff (Hard) | 0.05-0.2 | Lower = more exploration encouraged | | KL Coeff (Easy) | 0.3-0.8 | Higher = more aggressive pruning | | Temperature | 1.1-1.5 | Higher temp enables better exploration |
Common Pitfalls:
Based on the research at: https://arxiv.org/abs/2510.08457
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.