skills/skillxiv-v0.0.2-claude-opus-4.6/agentic-critical-training/SKILL.md
Improves LLM agent decision-making by training agents to first critically evaluate actions before generating, using RL on action-pair comparisons. Develops intrinsic reasoning about action quality without requiring reflection supervision.
npx skillsauth add ADu2021/skillXiv agentic-critical-trainingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
LLM agents trained on imitation learning only learn what to do, not why certain actions are preferable. They never contrast successful actions against alternatives, leaving them unaware of action quality. Traditional reflection-based approaches attempt to address this but remain fundamentally imitative—they copy pre-generated reflections rather than autonomously reasoning about quality.
Agentic Critical Training (ACT) develops genuine action quality reasoning through a two-stage RL approach: first train the agent to critically compare action pairs via RL, then leverage this learned critical ability for direct action generation. This forces autonomous reasoning without supervised reflection supervision.
Stage 1 - Critical Reasoning Training:
Stage 2 - Action Generation Training:
Key insight: Selection is simpler than generation but teaches reasoning. By training on comparison first, the model internalizes quality criteria before attempting generation.
Implement two-stage training: comparison selection followed by generation refinement.
Stage 1: Critical Training via Action Comparison
import torch
import torch.nn as nn
from transformers import AutoModelForCausalLM, AutoTokenizer
class CriticalAgentTrainer:
"""Trains agents to critically evaluate actions through comparison."""
def __init__(self, model_name="llama-2-13b", device="cuda"):
self.model = AutoModelForCausalLM.from_pretrained(model_name)
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.device = device
self.model.to(device)
def prepare_comparison_prompt(self, state, expert_action, model_action, task):
"""
Construct prompt presenting both actions for comparison.
Args:
state: current environment/problem state
expert_action: reference action known to be good
model_action: alternative action generated by model
task: task description
Returns:
prompt: formatted comparison prompt
"""
prompt = f"""Task: {task}
Current State: {state}
Two possible actions:
Action A (Reference): {expert_action}
Action B (Alternative): {model_action}
Which action is better for this situation? Respond with only "A" or "B" followed by brief reasoning."""
return prompt
def get_model_selection(self, prompt):
"""
Get model's action selection (A or B).
Args:
prompt: comparison prompt
Returns:
selection: 'A' or 'B'
log_prob: log probability of selection
"""
input_ids = self.tokenizer.encode(prompt, return_tensors="pt").to(self.device)
with torch.no_grad():
outputs = self.model(input_ids, output_hidden_states=True)
logits = outputs.logits[:, -1, :] # Last position
# Get logits for 'A' and 'B' tokens
token_A = self.tokenizer.encode('A')[-1]
token_B = self.tokenizer.encode('B')[-1]
logit_A = logits[0, token_A]
logit_B = logits[0, token_B]
# Compute probabilities
probs = torch.softmax(torch.stack([logit_A, logit_B]), dim=0)
selection = 'A' if logit_A > logit_B else 'B'
return selection, torch.log(probs[0 if selection == 'A' else 1])
def compute_comparison_reward(self, state, expert_action, model_action, selection, task):
"""
Compute reward for comparison selection.
Args:
selection: model's chosen action ('A' or 'B')
expert_action: known good action (Action A)
model_action: alternative action (Action B)
Returns:
reward: composite reward signal
"""
reward_components = {}
# Accuracy reward: did model select expert action?
if selection == 'A':
reward_components['accuracy'] = 1.0
else:
reward_components['accuracy'] = -1.0
# Admissibility bonus: was the non-selected action at least valid?
# This prevents excessive penalty for selecting reasonable alternatives
non_selected = model_action if selection == 'A' else expert_action
is_valid_action = self._verify_action_validity(non_selected, state, task)
reward_components['admissibility'] = 0.2 if is_valid_action else -0.2
# Format reward: ensure response format is correct
# (already implicitly satisfied by A/B selection)
reward_components['format'] = 0.0
# Total weighted reward
total_reward = (
1.0 * reward_components['accuracy'] +
0.3 * reward_components['admissibility'] +
0.1 * reward_components['format']
)
return total_reward
def _verify_action_validity(self, action, state, task):
"""Check if action is valid in this context."""
# Task-specific validation logic
# For now, simplified version
if action and len(action) > 0:
return True
return False
class ComparisonRewardOptimizer:
"""Optimize model for action comparison using GRPO."""
def __init__(self, model, optimizer, trainer):
self.model = model
self.optimizer = optimizer
self.trainer = trainer
def training_step(self, batch_data):
"""
Single GRPO training step on comparison task.
Args:
batch_data: list of {state, expert_action, model_action, task} dicts
Returns:
loss: scalar loss
"""
batch_size = len(batch_data)
group_size = batch_size
all_rewards = []
all_log_probs = []
# Forward pass: get all selections
for sample in batch_data:
prompt = self.trainer.prepare_comparison_prompt(
sample['state'],
sample['expert_action'],
sample['model_action'],
sample['task']
)
selection, log_prob = self.trainer.get_model_selection(prompt)
all_log_probs.append(log_prob)
# Compute reward for this selection
reward = self.trainer.compute_comparison_reward(
sample['state'],
sample['expert_action'],
sample['model_action'],
selection,
sample['task']
)
all_rewards.append(reward)
all_rewards = torch.tensor(all_rewards, device=self.model.device)
all_log_probs = torch.stack(all_log_probs)
# GRPO: group-relative advantages
advantages = all_rewards - all_rewards.mean()
advantages = advantages / (all_rewards.std() + 1e-8)
# Policy gradient: maximize expected reward
loss = -(all_log_probs * advantages.detach()).mean()
# Backward pass
self.optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)
self.optimizer.step()
return loss.item()
Stage 2: Generation Fine-Tuning
class GenerationTrainer:
"""Fine-tune model for direct action generation after critical training."""
def __init__(self, critic_model, optimizer, verifier_fn):
self.model = critic_model
self.optimizer = optimizer
self.verifier = verifier_fn # Function to verify action correctness
def prepare_generation_prompt(self, state, task):
"""
Construct prompt for action generation.
Args:
state: problem state
task: task description
Returns:
prompt: generation prompt
"""
return f"""Task: {task}
Current State: {state}
Generate the best action for this situation:"""
def generation_training_step(self, batch_data):
"""
GRPO training on direct action generation.
Args:
batch_data: list of {state, task, ground_truth_action} dicts
Returns:
loss: scalar loss
"""
batch_size = len(batch_data)
all_rewards = []
all_log_probs = []
for sample in batch_data:
prompt = self.prepare_generation_prompt(sample['state'], sample['task'])
input_ids = self.model.tokenizer.encode(prompt, return_tensors="pt")
with torch.enable_grad():
outputs = self.model(input_ids, output_hidden_states=True)
# Generate action greedily
generated_action = self._generate_action(outputs.logits)
all_log_probs.append(outputs.logits.sum()) # Simplified; actual version uses proper probability
# Verify action correctness
is_correct = self.verifier(
generated_action,
sample['ground_truth_action'],
sample['state'],
sample['task']
)
reward = 1.0 if is_correct else -1.0
all_rewards.append(reward)
# Same GRPO logic as Stage 1
all_rewards = torch.tensor(all_rewards, device=self.model.device)
advantages = all_rewards - all_rewards.mean()
advantages = advantages / (all_rewards.std() + 1e-8)
loss = -(torch.stack(all_log_probs) * advantages.detach()).mean()
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
return loss.item()
def _generate_action(self, logits):
"""Extract action from model logits."""
# Task-specific generation logic
return "generated_action"
def full_agentic_training_pipeline(
model_name,
comparison_data,
generation_data,
num_comparison_epochs=5,
num_generation_epochs=3
):
"""
Complete two-stage training pipeline.
Args:
model_name: HuggingFace model identifier
comparison_data: training data for comparison stage
generation_data: training data for generation stage
num_comparison_epochs: training iterations for Stage 1
Returns:
trained_model: agent with critical reasoning capability
"""
device = "cuda" if torch.cuda.is_available() else "cpu"
# Stage 1: Critical training
print("[Stage 1] Critical Action Comparison Training...")
trainer = CriticalAgentTrainer(model_name, device)
model = trainer.model
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
comparison_optimizer = ComparisonRewardOptimizer(model, optimizer, trainer)
for epoch in range(num_comparison_epochs):
epoch_loss = 0
for batch in comparison_data:
loss = comparison_optimizer.training_step(batch)
epoch_loss += loss
print(f" Epoch {epoch+1}/{num_comparison_epochs} Loss: {epoch_loss:.4f}")
# Stage 2: Generation fine-tuning
print("[Stage 2] Action Generation Fine-Tuning...")
def verify_action(generated, ground_truth, state, task):
# Simplified; real version compares execution in environment
return generated == ground_truth
gen_trainer = GenerationTrainer(model, optimizer, verify_action)
for epoch in range(num_generation_epochs):
epoch_loss = 0
for batch in generation_data:
loss = gen_trainer.generation_training_step(batch)
epoch_loss += loss
print(f" Epoch {epoch+1}/{num_generation_epochs} Loss: {epoch_loss:.4f}")
return model
Hyperparameters:
When to Apply:
When NOT to Apply:
Key Pitfalls:
Integration Notes: Works with any causal LLM; requires pairs of (expert, alternative) actions for comparison stage; comparison data can be auto-generated or human-collected.
Evidence: Improves agent success rates 15-25% over standard imitation learning; develops genuine action quality understanding; enables agents to generalize to unseen action combinations.
Reference: https://arxiv.org/abs/2603.08706
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.