skills/skillxiv-v0.0.2-claude-opus-4.6/clipo-contrastive-policy-optimization/SKILL.md
Augment verifiable reward RL (RLVR) with contrastive learning to generate dense auxiliary rewards. Enforce proximity among correct reasoning trajectories in embedding space while suppressing errors, amplifying invariant reasoning patterns.
npx skillsauth add ADu2021/skillXiv clipo-contrastive-policy-optimizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Sparse verifiable rewards (binary success/failure) provide limited training signal for complex reasoning tasks. CLIPO adds contrastive learning as an auxiliary objective: it embeds reasoning trajectories in latent space and applies InfoNCE loss to cluster correct responses together while repelling errors.
The insight is that successful reasoning paths share consistent underlying logic structures. By enforcing this structure in embedding space, contrastive learning acts as a denoising mechanism, amplifying invariant reasoning patterns while suppressing spurious shortcuts and hallucinations.
CLIPO extends RLVR policy optimization algorithms (GRPO, GSPO, DAPO, GMPO) by introducing a lightweight contrastive head and auxiliary reward:
This dual signal prevents optimization collapse on narrow heuristics while maintaining grounding in task-specific verifiable rewards.
Create embeddings for reasoning trajectories by processing token hidden states.
import torch
import torch.nn as nn
class TrajectoryContrastiveHead(nn.Module):
def __init__(self, hidden_dim, embedding_dim=256, projection_dim=128):
super().__init__()
# Encode trajectory via mean pooling + projection
self.projection = nn.Linear(hidden_dim, projection_dim)
self.contrastive_head = nn.Sequential(
nn.Linear(projection_dim, embedding_dim),
nn.ReLU(),
nn.Linear(embedding_dim, projection_dim)
)
def forward(self, hidden_states):
"""
hidden_states: (batch_size, seq_len, hidden_dim)
returns: (batch_size, projection_dim) embeddings
"""
# Mean pooling over sequence dimension
trajectory_repr = hidden_states.mean(dim=1) # (batch, hidden_dim)
# Project through network
projected = self.projection(trajectory_repr)
embedding = self.contrastive_head(projected)
return embedding
Within each batch, group by correctness and compute contrastive pairs.
def infonce_loss(embeddings, labels, temperature=0.1):
"""
InfoNCE loss for trajectory embeddings.
embeddings: (batch_size, embedding_dim)
labels: (batch_size,) binary correctness labels
temperature: temperature parameter for softmax
"""
batch_size = embeddings.shape[0]
# Normalize embeddings
embeddings = torch.nn.functional.normalize(embeddings, dim=1)
# Compute similarity matrix
similarity_matrix = torch.mm(embeddings, embeddings.t()) / temperature
# Create positive and negative masks
labels_expanded = labels.unsqueeze(1)
positive_mask = (labels_expanded == labels_expanded.t()).float()
# Set diagonal (self-similarity) to 0 for positive mask
positive_mask.fill_diagonal_(0)
# Negative mask is complement
negative_mask = 1.0 - positive_mask
negative_mask.fill_diagonal_(0)
# InfoNCE: log(exp(sim_pos) / sum(exp(sim_neg)))
exp_sim = torch.exp(similarity_matrix)
# Sum of positive similarities per row
pos_sum = (exp_sim * positive_mask).sum(dim=1, keepdim=True)
# Sum of negative similarities per row
neg_sum = (exp_sim * negative_mask).sum(dim=1, keepdim=True)
# InfoNCE loss (avoiding division by zero)
infonce = -torch.log(
(pos_sum + 1e-8) / (pos_sum + neg_sum + 1e-8)
)
return infonce.mean()
Transform contrastive objective into a dense reward signal that complements verifiable reward.
def contrastive_reward(
embeddings,
labels,
verifiable_rewards,
contrastive_weight=0.5,
temperature=0.1
):
"""
Compute combined reward: verifiable + contrastive auxiliary.
Returns dense reward per trajectory.
"""
batch_size = embeddings.shape[0]
# Normalize embeddings
embeddings_norm = torch.nn.functional.normalize(embeddings, dim=1)
# Compute similarity matrix
similarity_matrix = torch.mm(embeddings_norm, embeddings_norm.t()) / temperature
# Create positive mask (other correct trajectories)
labels_expanded = labels.unsqueeze(1)
positive_mask = (labels_expanded == labels_expanded.t()).float()
positive_mask.fill_diagonal_(0)
# Average positive similarity per trajectory
pos_similarities = (similarity_matrix * positive_mask).sum(dim=1) / (
positive_mask.sum(dim=1) + 1e-8
)
# Average negative similarity per trajectory
negative_mask = 1.0 - positive_mask
negative_mask.fill_diagonal_(0)
neg_similarities = (similarity_matrix * negative_mask).sum(dim=1) / (
negative_mask.sum(dim=1) + 1e-8
)
# Contrastive reward: pull positives, push negatives
contrastive_reward_signal = pos_similarities - neg_similarities
# Combine with verifiable reward
total_reward = (
verifiable_rewards +
contrastive_weight * contrastive_reward_signal
)
return total_reward, contrastive_reward_signal
Extend baseline policy optimization with contrastive auxiliary objective.
def train_step_with_clipo(
model,
input_ids,
verifiable_rewards,
contrastive_head,
policy_optimizer,
contrastive_weight=0.5,
temperature=0.1
):
"""
Single training step combining RLVR with contrastive learning.
"""
# Forward pass through model
outputs = model(
input_ids,
output_hidden_states=True
)
# Extract trajectory embeddings
hidden_states = outputs.hidden_states[-1]
trajectory_embeddings = contrastive_head(hidden_states)
# Compute correctness labels (binarize verifiable rewards)
correctness_labels = (verifiable_rewards > 0).long()
# Compute combined rewards
total_rewards, contrastive_signal = contrastive_reward(
trajectory_embeddings,
correctness_labels,
verifiable_rewards,
contrastive_weight=contrastive_weight,
temperature=temperature
)
# Policy gradient: GRPO style (placeholder)
log_probs = outputs.logits.log_softmax(dim=-1)
baseline = total_rewards.mean()
advantages = total_rewards - baseline
policy_loss = -(log_probs * advantages.unsqueeze(-1)).mean()
# Backward pass
policy_loss.backward()
policy_optimizer.step()
return {
'policy_loss': policy_loss.item(),
'contrastive_signal_mean': contrastive_signal.mean().item(),
'total_reward_mean': total_rewards.mean().item()
}
When to Use:
When NOT to Use:
Hyperparameter Tuning:
Common Pitfalls:
CLIPO paper on arXiv
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.