skills/skillxiv-v0.0.2-claude-opus-4.6/aria-intention-reward/SKILL.md
Reduce policy gradient variance in language agent training by aggregating rewards in semantic intention space, enabling 9.95% average performance gains across downstream tasks without exponential action space explosion.
npx skillsauth add ADu2021/skillXiv aria-intention-rewardInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
ARIA addresses the fundamental challenge of training language agents in open-ended environments where the action space grows exponentially. In traditional reinforcement learning for language agents, each unique token sequence represents a distinct action, creating extremely sparse reward signals that make gradient-based optimization inefficient.
ARIA's key insight is to project natural language actions into a lower-dimensional semantic space where similar actions are grouped together and share reward signals. This intentional aggregation densifies rewards, dramatically reducing policy gradient variance and enabling effective agent training with standard optimization methods.
The following steps outline how to implement intention-driven reward aggregation in a language agent training pipeline:
import torch
import torch.nn as nn
from transformers import AutoModel
class IntentionRewardAggregator(nn.Module):
def __init__(self, encoder_name: str = "sentence-transformers/all-MiniLM-L6-v2"):
super().__init__()
self.encoder = AutoModel.from_pretrained(encoder_name)
self.embedding_dim = self.encoder.config.hidden_size
def encode_actions(self, action_texts: list[str]) -> torch.Tensor:
"""Encode action sequences into intention space."""
embeddings = self.encoder.encode(action_texts, convert_to_tensor=True)
return embeddings
def aggregate_rewards(self, embeddings: torch.Tensor, rewards: torch.Tensor,
clustering_threshold: float = 0.85) -> torch.Tensor:
"""Aggregate rewards for semantically similar actions."""
similarity_matrix = torch.nn.functional.cosine_similarity(
embeddings.unsqueeze(1), embeddings.unsqueeze(0), dim=2
)
clusters = (similarity_matrix > clustering_threshold).long()
aggregated_rewards = torch.zeros_like(rewards)
for i in range(len(rewards)):
similar_indices = (clusters[i] == 1).nonzero(as_tuple=True)[0]
aggregated_rewards[i] = rewards[similar_indices].mean()
return aggregated_rewards
def forward(self, action_texts: list[str], rewards: torch.Tensor) -> torch.Tensor:
"""Encode actions and aggregate rewards in intention space."""
embeddings = self.encode_actions(action_texts)
return self.aggregate_rewards(embeddings, rewards)
Hyperparameters to tune:
When to use:
When NOT to use:
Common pitfalls:
The paper demonstrates consistent improvements in training efficiency and downstream task performance, with an average gain of 9.95% across four diverse language agent tasks. The method is model-agnostic and compatible with standard RL algorithms (PPO, A3C, etc.).
Original paper: "ARIA: Training Language Agents with Intention-Driven Reward Aggregation" (arxiv.org/abs/2506.00539)
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.