skills/skillxiv-v0.0.2-claude-opus-4.6/compass-judger-reward-based-evaluation/SKILL.md
Build generalist judge models for evaluating LLM outputs using verifiable rewards and policy gradient training. Create a 7B model competitive with much larger judges through reward-guided optimization and critical thinking decomposition. Use when you need reliable automated evaluation of model outputs across diverse tasks.
npx skillsauth add ADu2021/skillXiv compass-judger-reward-based-evaluationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Evaluating LLM outputs at scale requires automated judge models that can compare generations and rank quality. Existing approaches rely purely on supervised fine-tuning, limiting learning signal to labeled examples. CompassJudger-2 introduces verifiable rewards—binary signals indicating correctness at designated positions—combined with policy gradient optimization to train more capable judges. The 7B variant achieves competitive performance with significantly larger judge models through this hybrid training approach.
The key insight is that judge models benefit from two signals: supervision through correct answer/judgment examples, and reinforcement through explicitly-marked rewards at decision points. By decomposing judgment into five reasoning steps (demand analysis, strength/weakness identification, reasoning, prediction) and using rejection sampling for hard negatives, the model learns robust evaluation capabilities with minimal data.
CompassJudger-2 trains judges through a three-phase pipeline: (1) curate task-driven evaluation data with ground truth answers, (2) decompose judgment into structured reasoning steps via chain-of-thought supervision, (3) apply policy gradient optimization to reinforce correct predictions at designated positions. The verifiable reward signal is crucial: rather than comparing full outputs subjectively, rewards mark specific positions where correctness is verifiable (e.g., "is this the correct answer?").
This allows the model to learn what patterns correlate with correctness, not just memorize example judgments. The approach is agnostic to base model, requiring only supervised fine-tuning followed by policy gradient training without architectural changes.
Build diverse evaluation datasets with verifiable ground truths.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
class JudgeCurationPipeline:
"""Curate and synthesize judge training data."""
def __init__(self, curator_model_name: str = "Qwen/Qwen2.5-72B"):
self.curator = AutoModelForCausalLM.from_pretrained(curator_model_name)
self.tokenizer = AutoTokenizer.from_pretrained(curator_model_name)
self.curator.eval()
def reconstruct_outdated_data(self, old_examples: list) -> list:
"""
Reconstruct outdated evaluation examples using a stronger model.
Args:
old_examples: List of {'input': ..., 'answer': ...} pairs
Returns:
Reconstructed examples with modern quality standards
"""
reconstructed = []
for example in old_examples:
# Use curator model to re-evaluate old examples
prompt = f"""Given this task:
{example['input']}
Provide a comprehensive evaluation. Consider:
1. Whether the previous answer was correct
2. What a better answer would be
3. Key quality metrics
Your evaluation:"""
input_ids = self.tokenizer.encode(prompt, return_tensors='pt')
with torch.no_grad():
output_ids = self.curator.generate(
input_ids,
max_new_tokens=300,
temperature=0.7
)
evaluation = self.tokenizer.decode(output_ids[0], skip_special_tokens=True)
reconstructed.append({
'original_input': example['input'],
'original_answer': example.get('answer'),
'modern_evaluation': evaluation
})
return reconstructed
def enhance_diversity_with_templates(self, base_examples: list) -> list:
"""
Create diverse evaluation variants using prompt templates.
Args:
base_examples: Original evaluation examples
Returns:
Expanded dataset with template variations
"""
templates = [
"Evaluate this response for correctness: {response}",
"Is this answer accurate? {response}",
"Rate the quality of this response: {response}",
"What is wrong with this answer? {response}",
"How would you improve this response? {response}"
]
expanded = []
for example in base_examples:
for template in templates:
variant = example.copy()
variant['instruction'] = template.format(response=example.get('response', ''))
expanded.append(variant)
return expanded
def synthesize_knowledge_evaluations(self, domains: list, num_per_domain: int = 50) -> list:
"""
Synthesize evaluation examples with ground truth answers.
Args:
domains: List of domains (math, science, coding, etc.)
num_per_domain: Examples per domain
Returns:
Synthetic evaluation dataset with verifiable answers
"""
synthetic = []
for domain in domains:
prompt = f"""Generate {num_per_domain} evaluation examples for the {domain} domain.
For each example, provide:
1. A question or task
2. A response to evaluate
3. Whether it's correct (yes/no)
4. Why it's correct or incorrect
Format:
Example 1:
Question: [question]
Response: [response]
Correct: [yes/no]
Reasoning: [reasoning]"""
input_ids = self.tokenizer.encode(prompt, return_tensors='pt')
with torch.no_grad():
output_ids = self.curator.generate(
input_ids,
max_new_tokens=2000,
temperature=0.8
)
generated = self.tokenizer.decode(output_ids[0], skip_special_tokens=True)
# Parse generated examples
# (simplified: in practice use robust parsing)
synthetic.append({
'domain': domain,
'examples': generated
})
return synthetic
class VerifiableRewardSystem:
"""Generate verifiable reward signals for judge training."""
@staticmethod
def generate_reward_signal(
response: str,
ground_truth: str,
task_type: str = 'qa'
) -> int:
"""
Create binary reward signal (1 if correct, 0 if wrong).
Args:
response: Model's response
ground_truth: Correct answer/judgment
task_type: Task category (qa, classification, etc.)
Returns:
Reward: 1 if correct, 0 if incorrect
"""
if task_type == 'qa':
# Exact match or semantic similarity for QA
return 1 if response.strip() == ground_truth.strip() else 0
elif task_type == 'classification':
# Extract predicted class and compare
predicted_class = response.split()[-1] if response else ""
return 1 if predicted_class == ground_truth else 0
elif task_type == 'scoring':
# Numeric scoring: reward if within threshold
try:
pred_score = float(response.strip())
true_score = float(ground_truth.strip())
return 1 if abs(pred_score - true_score) < 0.5 else 0
except:
return 0
return 0
@staticmethod
def mark_verifiable_positions(full_text: str, answer_position: int) -> str:
"""
Mark specific positions in text where rewards are verifiable.
Args:
full_text: Complete judge response
answer_position: Character offset of final answer
Returns:
Marked text with reward positions indicated
"""
# Insert marker before answer
marked = full_text[:answer_position] + "[REWARD_POSITION]" + full_text[answer_position:]
return marked
Structure judgment into five interpretable reasoning steps.
class CriticalThinkingFramework:
"""Decompose judgment into structured reasoning steps."""
@staticmethod
def construct_judgment_prompt(
task: str,
response: str
) -> str:
"""
Create prompt that decomposes judgment into five steps.
Args:
task: Original task/question
response: Response to evaluate
Returns:
Structured prompt guiding judgment decomposition
"""
prompt = f"""Evaluate this response using critical thinking:
Task: {task}
Response: {response}
Please analyze in five steps:
1. USER DEMAND ANALYSIS: What is the user really asking for? What are the requirements?
[Your analysis]
2. RESPONSE STRENGTHS: What aspects of this response are good or correct?
[Your analysis]
3. RESPONSE WEAKNESSES: What aspects are lacking, incorrect, or could be improved?
[Your analysis]
4. REASONING: Weighing strengths vs. weaknesses, what is the overall quality?
[Your analysis]
5. FINAL JUDGMENT: Is this response satisfactory? Yes or No.
[Your judgment]"""
return prompt
@staticmethod
def extract_final_judgment(model_output: str) -> str:
"""
Extract final judgment (Yes/No) from model output.
Args:
model_output: Full model generation with five reasoning steps
Returns:
Final judgment ("Yes" or "No")
"""
lines = model_output.split('\n')
# Find line starting with "5. FINAL JUDGMENT"
for i, line in enumerate(lines):
if '5. FINAL JUDGMENT' in line or 'FINAL JUDGMENT' in line:
# Next non-empty line is the judgment
for j in range(i + 1, len(lines)):
answer_line = lines[j].strip()
if answer_line:
# Extract yes/no
if 'yes' in answer_line.lower():
return 'Yes'
elif 'no' in answer_line.lower():
return 'No'
return 'Unknown'
Train with both supervised loss and reinforcement signals.
import torch.nn.functional as F
from torch.optim import AdamW
class JudgeTrainer:
"""Train judge models with SFT + policy gradient."""
def __init__(self, model_name: str, base_lr: float = 6e-5):
self.model = AutoModelForCausalLM.from_pretrained(model_name)
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.optimizer = AdamW(self.model.parameters(), lr=base_lr)
self.base_lr = base_lr
def rejection_sample_negatives(
self,
prompt: str,
num_candidates: int = 8,
ground_truth_label: str = 'Yes'
) -> list:
"""
Generate multiple responses and filter for correctness diversity.
Args:
prompt: Judge prompt
num_candidates: Number of responses to generate
ground_truth_label: Correct answer to filter for
Returns:
List of (response, is_correct) tuples
"""
input_ids = self.tokenizer.encode(prompt, return_tensors='pt')
responses = []
for _ in range(num_candidates):
with torch.no_grad():
output_ids = self.model.generate(
input_ids,
max_new_tokens=200,
temperature=0.8,
top_p=0.95,
do_sample=True
)
response = self.tokenizer.decode(output_ids[0], skip_special_tokens=True)
judgment = CriticalThinkingFramework.extract_final_judgment(response)
is_correct = (judgment == ground_truth_label)
responses.append((response, is_correct))
return responses
def compute_margin_loss(
self,
positive_logits: torch.Tensor,
negative_logits: torch.Tensor,
gamma: float = 10.0
) -> torch.Tensor:
"""
Compute margin policy loss pushing positive away from negative.
Loss = max(0, γ - (logits_pos - logits_neg))
Args:
positive_logits: Log probabilities of correct responses
negative_logits: Log probabilities of incorrect responses
gamma: Margin parameter
Returns:
Margin loss value
"""
margin = positive_logits - negative_logits
loss = F.relu(gamma - margin)
return loss.mean()
def training_step(
self,
prompt: str,
ground_truth_label: str,
correct_response: str = None,
lambda_sft: float = 0.5,
lambda_rl: float = 0.5,
margin_gamma: float = 10.0
) -> dict:
"""
Single training step combining SFT and policy gradient.
Args:
prompt: Judge prompt
ground_truth_label: Correct judgment
correct_response: Reference correct response (optional)
lambda_sft: Weight for SFT loss
lambda_rl: Weight for RL loss
margin_gamma: Margin parameter
Returns:
Loss dictionary with individual components
"""
self.model.train()
# SFT Loss: standard language modeling on correct response
sft_loss = None
if correct_response:
input_ids = self.tokenizer.encode(
prompt + correct_response,
return_tensors='pt'
)
labels = input_ids.clone()
# Only compute loss on response part
prompt_len = len(self.tokenizer.encode(prompt))
labels[:, :prompt_len] = -100 # Ignore prompt tokens
outputs = self.model(input_ids, labels=labels)
sft_loss = outputs.loss
# Rejection Sampling: generate correct and incorrect responses
sampled = self.rejection_sample_negatives(
prompt,
num_candidates=8,
ground_truth_label=ground_truth_label
)
positive_responses = [r for r, correct in sampled if correct]
negative_responses = [r for r, correct in sampled if not correct]
# Policy Gradient Loss: margin between positive and negative
rl_loss = None
if positive_responses and negative_responses:
pos_response = positive_responses[0]
neg_response = negative_responses[0]
# Compute log probabilities
pos_ids = self.tokenizer.encode(prompt + pos_response, return_tensors='pt')
neg_ids = self.tokenizer.encode(prompt + neg_response, return_tensors='pt')
with torch.no_grad():
pos_logits = self.model(pos_ids).logits
neg_logits = self.model(neg_ids).logits
# Margin loss
pos_score = pos_logits.sum() # Simplified
neg_score = neg_logits.sum()
rl_loss = self.compute_margin_loss(pos_score, neg_score, margin_gamma)
# Combined loss
total_loss = 0.0
if sft_loss is not None:
total_loss += lambda_sft * sft_loss
if rl_loss is not None:
total_loss += lambda_rl * rl_loss
# Backward and optimize
self.optimizer.zero_grad()
total_loss.backward()
self.optimizer.step()
return {
'total_loss': total_loss.item(),
'sft_loss': sft_loss.item() if sft_loss else None,
'rl_loss': rl_loss.item() if rl_loss else None
}
def train_epoch(self, training_data: list, epochs: int = 1):
"""Train for multiple epochs."""
for epoch in range(epochs):
total_loss = 0.0
for item in training_data:
loss_dict = self.training_step(
prompt=item['prompt'],
ground_truth_label=item['label'],
correct_response=item.get('correct_response')
)
total_loss += loss_dict['total_loss']
avg_loss = total_loss / len(training_data)
print(f"Epoch {epoch+1}: avg_loss={avg_loss:.4f}")
Test judge accuracy across diverse benchmarks.
class JudgeEvaluator:
"""Evaluate judge model accuracy on evaluation benchmarks."""
def __init__(self, judge_model_name: str):
self.judge = AutoModelForCausalLM.from_pretrained(judge_model_name)
self.tokenizer = AutoTokenizer.from_pretrained(judge_model_name)
self.judge.eval()
def evaluate_on_benchmark(
self,
benchmark: list,
num_samples: int = None
) -> dict:
"""
Evaluate judge on benchmark dataset.
Args:
benchmark: List of {'task': ..., 'response': ..., 'label': ...}
num_samples: Limit evaluation to N samples
Returns:
Accuracy and detailed results
"""
if num_samples:
benchmark = benchmark[:num_samples]
correct = 0
total = len(benchmark)
for item in benchmark:
prompt = CriticalThinkingFramework.construct_judgment_prompt(
item['task'],
item['response']
)
input_ids = self.tokenizer.encode(prompt, return_tensors='pt')
with torch.no_grad():
output_ids = self.judge.generate(
input_ids,
max_new_tokens=200,
temperature=0.0
)
judgment = self.tokenizer.decode(output_ids[0], skip_special_tokens=True)
final_answer = CriticalThinkingFramework.extract_final_judgment(judgment)
is_correct = (final_answer == item['label'])
correct += int(is_correct)
accuracy = correct / total if total > 0 else 0.0
return {
'accuracy': accuracy,
'correct': correct,
'total': total
}
| Parameter | Value | Notes | |-----------|-------|-------| | Base Learning Rate | 6e-5 | Standard for fine-tuning 7B models | | Batch Size | 512 | During SFT phase | | Rejection Sampling Candidates | 8 | Balance diversity and computation | | Margin Parameter γ | 10.0 | Controls positive/negative separation | | SFT Loss Weight λ_sft | 0.5 | Balance with RL component | | RL Loss Weight λ_rl | 0.5 | Equal weight to SFT initially | | Training Epochs | 1 | Single pass sufficient with hard negatives | | Temperature (inference) | 0.0 | Greedy decoding for consistency |
Chen, X., Liu, S., Wang, Y., et al. (2024). CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards. arXiv preprint arXiv:2507.09104.
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.