Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

ADu2021/aria-intention-reward

Name: aria-intention-reward
Author: ADu2021

skills/skillxiv-v0.0.2-claude-opus-4.6/aria-intention-reward/SKILL.md

npx skillsauth add ADu2021/skillXiv aria-intention-reward

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

ARIA: Training Language Agents with Intention-Driven Reward Aggregation

Core Concept

ARIA addresses the fundamental challenge of training language agents in open-ended environments where the action space grows exponentially. In traditional reinforcement learning for language agents, each unique token sequence represents a distinct action, creating extremely sparse reward signals that make gradient-based optimization inefficient.

ARIA's key insight is to project natural language actions into a lower-dimensional semantic space where similar actions are grouped together and share reward signals. This intentional aggregation densifies rewards, dramatically reducing policy gradient variance and enabling effective agent training with standard optimization methods.

Architecture Overview

Intention Space Projection: Convert discrete token distributions from the policy into semantic clusters via dimensionality reduction
Reward Signal Aggregation: Assign shared rewards to semantically similar actions rather than treating each token sequence independently
Policy Gradient Optimization: Leverage densified reward signals to improve gradient estimation and reduce variance
End-to-End Training: Integrate intention space projection as a differentiable layer in the RL pipeline
Task-Agnostic Design: Apply the same aggregation mechanism across diverse downstream tasks

Implementation

The following steps outline how to implement intention-driven reward aggregation in a language agent training pipeline:

Define the intention space encoder - Use a semantic encoder (e.g., a frozen language model or contrastive encoder) to map action descriptions to fixed-size vectors
Aggregate reward signals - Group actions by semantic similarity in the intention space and assign shared rewards to clusters
Compute policy gradients - Calculate policy gradients using densified rewards to reduce variance
Update agent policy - Optimize the language agent using standard PPO or policy gradient methods with aggregated rewards
Monitor performance - Track downstream task metrics to validate improvement from reward densification

import torch
import torch.nn as nn
from transformers import AutoModel

class IntentionRewardAggregator(nn.Module):
    def __init__(self, encoder_name: str = "sentence-transformers/all-MiniLM-L6-v2"):
        super().__init__()
        self.encoder = AutoModel.from_pretrained(encoder_name)
        self.embedding_dim = self.encoder.config.hidden_size

    def encode_actions(self, action_texts: list[str]) -> torch.Tensor:
        """Encode action sequences into intention space."""
        embeddings = self.encoder.encode(action_texts, convert_to_tensor=True)
        return embeddings

    def aggregate_rewards(self, embeddings: torch.Tensor, rewards: torch.Tensor,
                         clustering_threshold: float = 0.85) -> torch.Tensor:
        """Aggregate rewards for semantically similar actions."""
        similarity_matrix = torch.nn.functional.cosine_similarity(
            embeddings.unsqueeze(1), embeddings.unsqueeze(0), dim=2
        )
        clusters = (similarity_matrix > clustering_threshold).long()

        aggregated_rewards = torch.zeros_like(rewards)
        for i in range(len(rewards)):
            similar_indices = (clusters[i] == 1).nonzero(as_tuple=True)[0]
            aggregated_rewards[i] = rewards[similar_indices].mean()

        return aggregated_rewards

    def forward(self, action_texts: list[str], rewards: torch.Tensor) -> torch.Tensor:
        """Encode actions and aggregate rewards in intention space."""
        embeddings = self.encode_actions(action_texts)
        return self.aggregate_rewards(embeddings, rewards)

Practical Guidance

Hyperparameters to tune:

Clustering threshold (0.75-0.95): Controls how tightly actions must match semantically to share rewards. Lower values create broader clusters with more reward sharing; higher values create finer-grained distinctions.
Encoder model: Use a domain-specific semantic encoder if available (e.g., CodeBERT for code agents, specialized instruction encoders for task-specific agents)
Aggregation method: Experiment with mean pooling vs. weighted averaging (weight by confidence scores) for combining rewards within clusters

When to use:

Training agents in open-ended environments with large action spaces
When reward signals are sparse and gradient estimates have high variance
For multi-task learning where shared reward structure improves generalization
When downstream tasks benefit from semantic grouping of similar behaviors

When NOT to use:

Tasks requiring highly fine-grained action distinctions where grouping would lose important information
Environments with dense reward signals where variance reduction provides minimal benefit
Real-time systems where the encoding overhead of intention space projection is prohibitive

Common pitfalls:

Over-clustering: Using thresholds too low causes dissimilar actions to share rewards, degrading task performance
Encoder mismatch: Using generic encoders that don't capture domain-specific action semantics reduces aggregation quality
Ignoring context: Treating actions in isolation without considering task context can lead to inappropriate grouping
Insufficient exploration: Reward aggregation can reduce exploration if not balanced with entropy regularization

Reference

The paper demonstrates consistent improvements in training efficiency and downstream task performance, with an average gain of 9.95% across four diverse language agent tasks. The method is model-agnostic and compatible with standard RL algorithms (PPO, A3C, etc.).

Original paper: "ARIA: Training Language Agents with Intention-Driven Reward Aggregation" (arxiv.org/abs/2506.00539)

ADu2021/aria-intention-reward

skills/skillxiv-v0.0.2-claude-opus-4.6/aria-intention-reward/SKILL.md

Reduce policy gradient variance in language agent training by aggregating rewards in semantic intention space, enabling 9.95% average performance gains across downstream tasks without exponential action space explosion.

2 stars

data-ai

Updated Apr 16, 2026

$ install --global

skillsauth

npx skillsauth add ADu2021/skillXiv aria-intention-reward

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 24, 2026, 8:55 PM1.8s1 file scanned

SKILL.md

name:: aria-intention-reward
title:: ARIA: Training Language Agents with Intention-Driven Reward Aggregation
version:: 0.0.2
engine:: skillxiv-v0.0.2-claude-opus-4.6
license:: MIT
url:: https://arxiv.org/abs/2506.00539
keywords:: [reinforcement learning, language agents, reward aggregation, variance reduction, semantic space]
description:: Reduce policy gradient variance in language agent training by aggregating rewards in semantic intention space, enabling 9.95% average performance gains across downstream tasks without exponential action space explosion.

ARIA: Training Language Agents with Intention-Driven Reward Aggregation

Core Concept

Architecture Overview

Intention Space Projection: Convert discrete token distributions from the policy into semantic clusters via dimensionality reduction
Reward Signal Aggregation: Assign shared rewards to semantically similar actions rather than treating each token sequence independently
Policy Gradient Optimization: Leverage densified reward signals to improve gradient estimation and reduce variance
End-to-End Training: Integrate intention space projection as a differentiable layer in the RL pipeline
Task-Agnostic Design: Apply the same aggregation mechanism across diverse downstream tasks

Implementation

The following steps outline how to implement intention-driven reward aggregation in a language agent training pipeline:

Define the intention space encoder - Use a semantic encoder (e.g., a frozen language model or contrastive encoder) to map action descriptions to fixed-size vectors
Aggregate reward signals - Group actions by semantic similarity in the intention space and assign shared rewards to clusters
Compute policy gradients - Calculate policy gradients using densified rewards to reduce variance
Update agent policy - Optimize the language agent using standard PPO or policy gradient methods with aggregated rewards
Monitor performance - Track downstream task metrics to validate improvement from reward densification

import torch
import torch.nn as nn
from transformers import AutoModel

class IntentionRewardAggregator(nn.Module):
    def __init__(self, encoder_name: str = "sentence-transformers/all-MiniLM-L6-v2"):
        super().__init__()
        self.encoder = AutoModel.from_pretrained(encoder_name)
        self.embedding_dim = self.encoder.config.hidden_size

    def encode_actions(self, action_texts: list[str]) -> torch.Tensor:
        """Encode action sequences into intention space."""
        embeddings = self.encoder.encode(action_texts, convert_to_tensor=True)
        return embeddings

    def aggregate_rewards(self, embeddings: torch.Tensor, rewards: torch.Tensor,
                         clustering_threshold: float = 0.85) -> torch.Tensor:
        """Aggregate rewards for semantically similar actions."""
        similarity_matrix = torch.nn.functional.cosine_similarity(
            embeddings.unsqueeze(1), embeddings.unsqueeze(0), dim=2
        )
        clusters = (similarity_matrix > clustering_threshold).long()

        aggregated_rewards = torch.zeros_like(rewards)
        for i in range(len(rewards)):
            similar_indices = (clusters[i] == 1).nonzero(as_tuple=True)[0]
            aggregated_rewards[i] = rewards[similar_indices].mean()

        return aggregated_rewards

    def forward(self, action_texts: list[str], rewards: torch.Tensor) -> torch.Tensor:
        """Encode actions and aggregate rewards in intention space."""
        embeddings = self.encode_actions(action_texts)
        return self.aggregate_rewards(embeddings, rewards)

Practical Guidance

Hyperparameters to tune:

Clustering threshold (0.75-0.95): Controls how tightly actions must match semantically to share rewards. Lower values create broader clusters with more reward sharing; higher values create finer-grained distinctions.
Encoder model: Use a domain-specific semantic encoder if available (e.g., CodeBERT for code agents, specialized instruction encoders for task-specific agents)
Aggregation method: Experiment with mean pooling vs. weighted averaging (weight by confidence scores) for combining rewards within clusters

When to use:

Training agents in open-ended environments with large action spaces
When reward signals are sparse and gradient estimates have high variance
For multi-task learning where shared reward structure improves generalization
When downstream tasks benefit from semantic grouping of similar behaviors

When NOT to use:

Tasks requiring highly fine-grained action distinctions where grouping would lose important information
Environments with dense reward signals where variance reduction provides minimal benefit
Real-time systems where the encoding overhead of intention space projection is prohibitive

Common pitfalls:

Over-clustering: Using thresholds too low causes dissimilar actions to share rewards, degrading task performance
Encoder mismatch: Using generic encoders that don't capture domain-specific action semantics reduces aggregation quality
Ignoring context: Treating actions in isolation without considering task context can lead to inappropriate grouping
Insufficient exploration: Reward aggregation can reduce exploration if not balanced with entropy regularization

Reference

Original paper: "ARIA: Training Language Agents with Intention-Driven Reward Aggregation" (arxiv.org/abs/2506.00539)

Related Skills

ADu2021/flow-map-trajectory-tilting

testing

VerifiedTrustedCommunity

Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flow-map-trajectory-tilting

ADu2021/flexible-data-mixture-of-experts

testing

VerifiedTrustedCommunity

Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flexible-data-mixture-of-experts

ADu2021/flexibility-trap-diffusion-reasoning

data-ai

VerifiedTrustedCommunity

Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flexibility-trap-diffusion-reasoning

ADu2021/flex-continuous-agent-evolution

devops

VerifiedTrustedCommunity

Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flex-continuous-agent-evolution

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/ADu2021/skillXiv.git

# Copy into Claude Code skills folder (global)
cp -r skillXiv/skills/skillxiv-v0.0.2-claude-opus-4.6/aria-intention-reward ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

ADu2021/skillXiv

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT