skills/skillxiv-v0.0.2-claude-opus-4.6/cooper-co-optimized-policy-reward/SKILL.md
Joint optimization of policy and reward models in LLM reinforcement learning by leveraging rule-based reward precision and dynamically constructing training pairs to prevent reward hacking and improve performance.
npx skillsauth add ADu2021/skillXiv cooper-co-optimized-policy-rewardInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Cooper addresses a fundamental challenge in RLHF (Reinforcement Learning from Human Feedback) for large language models: the tension between rule-based and model-based reward systems. Rule-based rewards lack robustness to distribution shift, while learned reward models are vulnerable to reward hacking—where the policy learns to exploit the reward model rather than optimize genuine task performance.
The core insight is that jointly optimizing both the policy and reward model, while dynamically updating the reward model during training, can mitigate reward hacking and improve end-to-end performance.
Set up your base language model (policy) and initialize a reward model architecture. The reward model should accept concatenated inputs: [response, reference_answer, prompt].
# PyTorch/Hugging Face style pseudocode
policy_model = AutoModelForCausalLM.from_pretrained("llm-base")
reward_model = RewardModel(hidden_size=768, num_labels=1)
optimizer_policy = AdamW(policy_model.parameters(), lr=5e-6)
optimizer_reward = AdamW(reward_model.parameters(), lr=1e-5)
Generate policy rollouts and obtain gold-standard reference answers. Annotate with rule-based rewards where available (exact match, constraint satisfaction, etc.).
# Generate rollouts from current policy
rollouts = []
for prompt in prompt_batch:
response = policy_model.generate(prompt, max_length=256)
reference = gold_standard_answers[prompt_id]
rule_reward = compute_rule_based_reward(response, reference)
rollouts.append({
'prompt': prompt,
'response': response,
'reference': reference,
'rule_reward': rule_reward
})
Dynamically select pairs where rule-based rewards provide high-confidence signals. Create pairs by matching high-reward responses (positive) with low-reward responses (negative) from the same or similar prompts.
# Construct training pairs using rule-based signal
pairs = []
for prompt_group in group_by_prompt(rollouts):
high_reward_samples = [r for r in prompt_group if r['rule_reward'] > threshold]
low_reward_samples = [r for r in prompt_group if r['rule_reward'] < threshold]
for pos in high_reward_samples:
for neg in low_reward_samples[:2]: # Limit negatives per positive
pairs.append((pos, neg))
Train the VerifyRM reward model using the constructed pairs. The model should learn to prefer responses that match the reference answer and follow task constraints.
# Train reward model with reference answers as additional context
for pos_sample, neg_sample in pairs:
pos_input = tokenize([pos_sample['prompt'], pos_sample['response'],
pos_sample['reference']])
neg_input = tokenize([neg_sample['prompt'], neg_sample['response'],
neg_sample['reference']])
pos_score = reward_model(pos_input)
neg_score = reward_model(neg_input)
# Preference loss (margin-based or ranking)
loss = max(0, margin - (pos_score - neg_score))
optimizer_reward.zero_grad()
loss.backward()
optimizer_reward.step()
Run PPO or similar RL algorithm using the newly trained reward model to score policy rollouts. The key difference from standard RLHF is that the reward model itself improves over time.
# Policy optimization with updated reward model
for epoch in range(num_ppo_epochs):
policy_rollouts = policy_model.generate(prompts, max_length=256)
# Score with updated reward model
with torch.no_grad():
policy_rollout_inputs = tokenize(prompts, policy_rollouts, references)
rewards = reward_model(policy_rollout_inputs)
# PPO update
advantages = compute_advantages(rewards, value_baseline)
policy_loss = -torch.mean(log_probs * advantages)
optimizer_policy.zero_grad()
policy_loss.backward()
optimizer_policy.step()
Repeat steps 2-5. The system converges as the reward model becomes more accurate and the policy learns to optimize genuine task performance rather than exploiting reward signal artifacts.
# Main training loop
for iteration in range(num_iterations):
# Collect new rollouts with current policy
rollouts = generate_rollouts(policy_model, prompts, references)
# Construct pairs using rule-based signals
pairs = construct_pairs(rollouts, rule_reward_fn)
# Update reward model
train_reward_model(reward_model, pairs, num_epochs=3)
# Update policy with new reward model
train_policy(policy_model, reward_model, rollouts, num_ppo_steps=100)
# Evaluate on held-out test set
eval_score = evaluate(policy_model, test_prompts, gold_answers)
print(f"Iteration {iteration}: Eval score = {eval_score}")
Cooper (2508.05613): https://arxiv.org/abs/2508.05613
Joint optimization of policy and reward models prevents reward hacking and improves end-to-end RL performance for LLM alignment, achieving 0.54% accuracy gains on instruction-following benchmarks.
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.