skills/skillxiv-v0.0.2-claude-opus-4.6/agent-reasoning-reward-model/SKILL.md
Build multi-faceted reward models for agent trajectories that provide structured feedback on intermediate reasoning quality. Implement explicit reasoning traces, focused critiques with refinement guidance, and overall process scores to train more effective agentic agents without relying solely on sparse outcome rewards.
npx skillsauth add ADu2021/skillXiv agent-reasoning-reward-modelInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Agentic reinforcement learning systems typically rely on sparse outcome-based rewards that fail to differentiate intermediate reasoning quality, leading to suboptimal agent training. Agents need process-level feedback during their reasoning chains to improve beyond trial-and-error learning.
Implement an Agent Reasoning Reward Model (Agent-RRM) that produces three types of structured feedback for agentic trajectories:
Implement a component that identifies and validates reasoning steps in agent trajectories.
def extract_reasoning_trace(trajectory):
"""
Extract logical steps from agent trajectory.
trajectory: list of (observation, action, thought) tuples
Returns: structured trace with step IDs and dependencies
"""
trace_steps = []
for i, (obs, action, thought) in enumerate(trajectory):
step = {
"id": i,
"observation": obs,
"thought": thought,
"action": action,
"is_valid": validate_step_logic(thought, action)
}
trace_steps.append(step)
return {"steps": trace_steps, "total_length": len(trace_steps)}
Create a module that identifies reasoning flaws and suggests improvements.
def generate_critique(trajectory, trace):
"""
Identify reasoning flaws and provide refinement guidance.
Returns: list of critiques with locations and suggestions
"""
critiques = []
for step in trace["steps"]:
if not step["is_valid"]:
critique = {
"step_id": step["id"],
"flaw_type": classify_reasoning_flaw(step),
"description": describe_flaw(step),
"suggestion": generate_refinement(step)
}
critiques.append(critique)
return critiques
Score the overall reasoning process quality.
def score_process(trajectory, trace, critiques):
"""
Evaluate process performance using multiple signals:
- Reasoning clarity and coherence
- Tool use efficiency
- Goal alignment
"""
scores = {
"clarity": measure_clarity(trace),
"efficiency": measure_efficiency(trajectory),
"alignment": measure_goal_alignment(trace),
"flaw_count": len(critiques)
}
# Weighted combination: penalize flaws, reward efficiency
overall_score = (
0.4 * scores["clarity"] +
0.3 * scores["efficiency"] +
0.3 * scores["alignment"] -
0.1 * min(len(critiques), 10)
)
return {"scores": scores, "overall": max(0, min(1, overall_score))}
Implement three training strategies using the reward signals:
Reagent-C (Text-Augmented Refinement): Augment trajectories with textual critiques during training.
def reagent_c_augment(trajectory, critiques):
"""Add critique text to trajectory for language-model-guided learning"""
augmented = trajectory.copy()
for critique in critiques:
step_id = critique["step_id"]
augmented[step_id]["refinement_hint"] = critique["suggestion"]
return augmented
Reagent-R (Reward-Augmented Guidance): Use process scores to shape reward signal.
def reagent_r_reward(outcome_reward, process_score, weight=0.3):
"""Combine outcome and process rewards"""
return (1 - weight) * outcome_reward + weight * process_score
Reagent-U (Unified Integration): Combine all signals—trace, critique, and scores—in a joint training objective.
def reagent_u_objective(trajectory, trace, critiques, process_score, outcome_reward):
"""Unified loss combining all reward signals"""
trace_loss = 0.2 * evaluate_trace_quality(trace)
critique_loss = 0.3 * len(critiques)
score_loss = 0.3 * (1 - process_score)
outcome_loss = 0.2 * (1 - outcome_reward)
return trace_loss + critique_loss + score_loss + outcome_loss
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.