skills/exploring-reasoning-reward-agents/SKILL.md
Apply Agent Reasoning Reward Model (Agent-RRM) structured critique to improve multi-step agent trajectories. Evaluates tool-use chains with explicit reasoning traces, focused critiques, and process scores. Use this skill when: - "Critique this agent's reasoning trace" - "Evaluate my tool-calling workflow and find flaws" - "Score this multi-step agent trajectory" - "Help me build a reward model for agent training" - "Improve this agent's reasoning with structured feedback" - "Debug why my agent pipeline produces wrong answers"
npx skillsauth add ndpvt-web/arxiv-claude-skills exploring-reasoning-reward-agentsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to apply the Agent-RRM framework from the Reagent paper to evaluate and improve multi-step agent trajectories. Instead of binary pass/fail judgments on agent outputs, you produce three structured feedback components for any tool-using reasoning chain: (1) a reasoning trace analyzing logical consistency and tool-use patterns, (2) a focused critique identifying specific flaws without revealing answers, and (3) a calibrated process score. These signals can be used to refine agent behavior at inference time (Reagent-C style), augment reward signals for RL training (Reagent-R style), or unify both approaches (Reagent-U style).
The core problem: Standard agent training uses sparse outcome-based rewards -- the agent either gets the final answer right or wrong. This fails to distinguish between an agent that reasoned well but made one tool error versus one that stumbled into the right answer through flawed logic. Intermediate reasoning quality is invisible to the training signal.
Agent-RRM's solution: A reward model that produces structured, multi-faceted feedback for each trajectory. The model outputs three components in sequence: a <think> block containing step-by-step analysis of the agent's logical consistency, tool selection appropriateness, and iterative improvement behavior; a <critique> block that distills specific, actionable flaws (wrong tool arguments, hallucinated inputs, unverified assumptions, over-reliance on tools when direct reasoning sufficed); and a <score> between 0 and 1 representing overall process quality. Critically, the critique never reveals the correct answer -- it only highlights where reasoning went wrong, preserving the agent's ability to self-correct.
Three integration strategies: The paper validates three ways to use these signals. Reagent-C is training-free: feed the critique back to the agent at inference time and let it produce a refined answer in a second pass. Reagent-R combines the model's score with rule-based outcome rewards (R = R_rule + lambda * R_model) for RL training, giving gradient signal on reasoning quality. Reagent-U unifies both: during training, the agent generates an initial trajectory, receives critique, generates a refined trajectory, and both are pooled for advantage-normalized optimization. At inference, Reagent-U operates as a standard agent without the critique step, having internalized the self-correction behavior. Reagent-U achieves the strongest results (43.7% on GAIA, 46.2% on WebWalkerQA).
Collect the full agent trajectory. Gather the complete chain of thought, tool calls, tool responses, and final answer. Include every intermediate step -- partial reasoning, failed tool calls, retries. The trajectory must be complete; truncated traces produce unreliable evaluations.
Evaluate tool decision appropriateness. For each step, assess whether the agent correctly decided to invoke a tool versus reason directly. Flag cases of tool over-reliance (calling a search engine for basic arithmetic) and tool under-use (hallucinating facts that should have been looked up).
Check tool argument correctness. Examine every tool invocation for hallucinated inputs (fabricated filenames, invented URLs, non-existent API parameters), formatting errors (malformed JSON, wrong parameter types), and irrelevant tool selection (using a calculator when a web search was needed).
Assess tool output handling. Determine whether the agent properly interpreted tool responses. Flag blind acceptance of noisy or incomplete outputs, failure to handle errors or empty results, and incorrect extraction of relevant information from tool responses.
Analyze iterative improvement behavior. Check whether the agent corrected earlier mistakes in subsequent steps, verified hypotheses against new evidence, and avoided repeating the same failed approach. Good trajectories show learning within the episode.
Detect fabrications. Identify any invented facts, file paths, object identifiers, or data points that the agent produced without tool verification. This is distinct from tool argument errors -- fabrication is presenting unverified information as established fact.
Produce the <think> reasoning trace. Write a structured analysis covering each dimension above. Be specific: reference exact steps by number, quote the problematic tool calls, and explain the logical gap. This trace must be transparent enough that a developer can locate the exact failure point.
Produce the <critique> block. Distill the analysis into 2-5 concise, actionable findings. Each finding should name the flaw, locate it in the trajectory, and suggest what the agent should have done differently. Do NOT reveal the correct final answer or provide hints toward it.
Assign the <score> (0-1). Score based on process quality, not outcome correctness. A trajectory that reasons well but reaches a wrong answer due to a single tool failure scores higher than one that stumbles into the right answer through flawed logic. Use the full range: 0.0-0.2 (fundamentally flawed reasoning), 0.3-0.5 (major logical gaps), 0.5-0.7 (sound reasoning with notable errors), 0.7-0.9 (strong reasoning with minor issues), 0.9-1.0 (excellent process throughout).
Apply the appropriate integration strategy. For inference-time improvement (Reagent-C): feed the critique back and generate a refined response. For RL training data: use the score as a process reward signal combined with outcome reward. For unified training (Reagent-U): generate both initial and critique-refined trajectories as paired training examples.
Example 1: Evaluating a ReAct Agent Trajectory
User: "Evaluate this agent trajectory that tried to answer 'What is the population of the capital of the country that won the 2022 FIFA World Cup?'"
Agent Trajectory:
Step 1: Thought: I need to find who won the 2022 FIFA World Cup.
Action: search("2022 FIFA World Cup winner")
Observation: Argentina won the 2022 FIFA World Cup.
Step 2: Thought: The capital of Argentina is Buenos Aires. I know this.
Action: search("population of Buenos Aires")
Observation: The population of Buenos Aires is approximately 3.1 million (city proper).
Step 3: Answer: The population is 3.1 million.
Approach:
Output:
<think>
Step 1: Appropriate tool use. The agent correctly searched for the World Cup
winner rather than relying on potentially outdated training data.
Step 2: The agent assumed Buenos Aires is Argentina's capital without
verification. While correct, this represents an unverified assumption in a
chain where the previous step was tool-verified. Inconsistent verification
standards. The population search was appropriate.
Step 3: The agent reported "3.1 million" without specifying city proper vs.
metropolitan area. The search result likely contained both figures, but the
agent extracted only one without noting the ambiguity.
</think>
<critique>
1. AMBIGUOUS EXTRACTION (Step 3): The agent reported a single population
figure without clarifying whether it refers to city proper or metropolitan
area. Should have noted the distinction or asked for clarification.
2. INCONSISTENT VERIFICATION (Step 2): Capital identification was assumed
while other facts were tool-verified. Minor risk, but breaks the
verification chain.
</critique>
<score>0.72</score>
Example 2: Critiquing a Code-Execution Agent
User: "Score this agent that tried to find the most common word in a CSV file."
Agent Trajectory:
Step 1: Thought: I'll read the CSV and count words.
Action: execute_code("import pandas as pd; df = pd.read_csv('data.csv')")
Observation: DataFrame loaded, 5000 rows x 3 columns.
Step 2: Thought: I'll count words in all text columns.
Action: execute_code("from collections import Counter; words = ' '.join(df.values.flatten().astype(str)).split(); Counter(words).most_common(5)")
Observation: [('the', 12453), ('and', 8921), ('of', 7234), ('to', 6102), ('in', 5890)]
Step 3: Answer: The most common word is "the" with 12,453 occurrences.
Approach:
Output:
<think>
Step 1: Appropriate tool use. Loading with pandas is reasonable for CSV data.
Step 2: Critical flaw. The agent flattened ALL columns including numeric ones
via .values.flatten().astype(str), converting numbers to strings and counting
them as "words." Additionally, no stop word filtering was applied. The result
is dominated by English function words (the, and, of, to, in) which are almost
certainly not what the user intended by "most common word."
Step 3: The agent reported the raw result without questioning whether stop
words should be excluded or whether numeric columns should be included. No
iterative improvement -- the agent accepted the first result without critical
evaluation.
</think>
<critique>
1. NO STOP WORD FILTERING (Step 2): Results are dominated by function words.
The agent should have either filtered stop words or asked the user whether
they wanted content words only.
2. NUMERIC COLUMN INCLUSION (Step 2): All columns were flattened to strings,
meaning numeric values were tokenized as "words." Should have identified
text columns first using dtype checks.
3. NO SELF-CORRECTION (Step 3): The agent accepted obviously generic results
without questioning their utility. A strong agent would recognize that
"the" being the top word is uninformative and refine the approach.
</critique>
<score>0.35</score>
Example 3: Applying Reagent-C Refinement
User: "Use critique-based refinement to improve my search agent's answer."
Original agent output: "The Eiffel Tower is 300 meters tall."
(Agent used search but pulled an outdated figure -- actual height with
antenna is 330 meters since 2022.)
Approach:
Output:
--- Critique feedback (Reagent-C input) ---
<critique>
1. OUTDATED INFORMATION (Step 2): The agent reported 300m which is the
original 1889 construction height. The current height including the
radio antenna is 330m. The search result likely contained both figures
but the agent extracted the historical one.
2. MISSING CONTEXT: Height figures for structures typically distinguish
between structural height and total height with antenna. Agent should
specify which measurement.
</critique>
--- Refined output (after critique injection) ---
"The Eiffel Tower's structural height is 300 meters (its original 1889
height). Including the radio antenna added in 2022, its current total
height is 330 meters."
report_2024.pdf that was not returned by any previous tool call" is actionable. "The agent made errors" is not.<think> section and reduce the score proportionally. Do not hallucinate what the agent "would have done."<think> block.Exploring Reasoning Reward Model for Agents -- Fan et al., 2026. Focus on Section 3 (Agent-RRM architecture and the three-component feedback structure), Section 4 (Reagent-C/R/U integration strategies), and Appendix A.3 (annotation prompt template for the evaluation checklist). Code and models at github.com/kxfan2002/Reagent.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".