Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

ADu2021/clipo-contrastive-policy-optimization

Name: clipo-contrastive-policy-optimization
Author: ADu2021

skills/skillxiv-v0.0.2-claude-opus-4.6/clipo-contrastive-policy-optimization/SKILL.md

npx skillsauth add ADu2021/skillXiv clipo-contrastive-policy-optimization

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Technique: Dense Contrastive Rewards for Reasoning Trajectory Clustering

Sparse verifiable rewards (binary success/failure) provide limited training signal for complex reasoning tasks. CLIPO adds contrastive learning as an auxiliary objective: it embeds reasoning trajectories in latent space and applies InfoNCE loss to cluster correct responses together while repelling errors.

The insight is that successful reasoning paths share consistent underlying logic structures. By enforcing this structure in embedding space, contrastive learning acts as a denoising mechanism, amplifying invariant reasoning patterns while suppressing spurious shortcuts and hallucinations.

Core Concept

CLIPO extends RLVR policy optimization algorithms (GRPO, GSPO, DAPO, GMPO) by introducing a lightweight contrastive head and auxiliary reward:

Contrastive Head: Projects reasoning trajectories to embedding space
InfoNCE Loss: Treats correct responses as positives, incorrect as negatives
Dense Auxiliary Reward: Converted contrastive loss to reward signal
Combined Loss: Final reward = verifiable reward + contrastive reward

This dual signal prevents optimization collapse on narrow heuristics while maintaining grounding in task-specific verifiable rewards.

Architecture Overview

Trajectory encoder: Linear or small MLP embedding trajectories
Contrastive head: Projects to (typically) 256-512 dimensional embedding space
InfoNCE comparator: Computes similarity matrices and contrastive loss
Reward converter: Translates contrastive loss to auxiliary signal
Main policy: Unchanged from baseline RLVR method

Implementation Steps

Step 1: Build Trajectory Encoder and Contrastive Head

Create embeddings for reasoning trajectories by processing token hidden states.

import torch
import torch.nn as nn

class TrajectoryContrastiveHead(nn.Module):
    def __init__(self, hidden_dim, embedding_dim=256, projection_dim=128):
        super().__init__()
        # Encode trajectory via mean pooling + projection
        self.projection = nn.Linear(hidden_dim, projection_dim)
        self.contrastive_head = nn.Sequential(
            nn.Linear(projection_dim, embedding_dim),
            nn.ReLU(),
            nn.Linear(embedding_dim, projection_dim)
        )

    def forward(self, hidden_states):
        """
        hidden_states: (batch_size, seq_len, hidden_dim)
        returns: (batch_size, projection_dim) embeddings
        """
        # Mean pooling over sequence dimension
        trajectory_repr = hidden_states.mean(dim=1)  # (batch, hidden_dim)

        # Project through network
        projected = self.projection(trajectory_repr)
        embedding = self.contrastive_head(projected)

        return embedding

Step 2: Compute InfoNCE Contrastive Loss

Within each batch, group by correctness and compute contrastive pairs.

def infonce_loss(embeddings, labels, temperature=0.1):
    """
    InfoNCE loss for trajectory embeddings.

    embeddings: (batch_size, embedding_dim)
    labels: (batch_size,) binary correctness labels
    temperature: temperature parameter for softmax
    """
    batch_size = embeddings.shape[0]

    # Normalize embeddings
    embeddings = torch.nn.functional.normalize(embeddings, dim=1)

    # Compute similarity matrix
    similarity_matrix = torch.mm(embeddings, embeddings.t()) / temperature

    # Create positive and negative masks
    labels_expanded = labels.unsqueeze(1)
    positive_mask = (labels_expanded == labels_expanded.t()).float()

    # Set diagonal (self-similarity) to 0 for positive mask
    positive_mask.fill_diagonal_(0)

    # Negative mask is complement
    negative_mask = 1.0 - positive_mask
    negative_mask.fill_diagonal_(0)

    # InfoNCE: log(exp(sim_pos) / sum(exp(sim_neg)))
    exp_sim = torch.exp(similarity_matrix)

    # Sum of positive similarities per row
    pos_sum = (exp_sim * positive_mask).sum(dim=1, keepdim=True)

    # Sum of negative similarities per row
    neg_sum = (exp_sim * negative_mask).sum(dim=1, keepdim=True)

    # InfoNCE loss (avoiding division by zero)
    infonce = -torch.log(
        (pos_sum + 1e-8) / (pos_sum + neg_sum + 1e-8)
    )

    return infonce.mean()

Step 3: Convert Contrastive Loss to Auxiliary Reward

Transform contrastive objective into a dense reward signal that complements verifiable reward.

def contrastive_reward(
    embeddings,
    labels,
    verifiable_rewards,
    contrastive_weight=0.5,
    temperature=0.1
):
    """
    Compute combined reward: verifiable + contrastive auxiliary.

    Returns dense reward per trajectory.
    """
    batch_size = embeddings.shape[0]

    # Normalize embeddings
    embeddings_norm = torch.nn.functional.normalize(embeddings, dim=1)

    # Compute similarity matrix
    similarity_matrix = torch.mm(embeddings_norm, embeddings_norm.t()) / temperature

    # Create positive mask (other correct trajectories)
    labels_expanded = labels.unsqueeze(1)
    positive_mask = (labels_expanded == labels_expanded.t()).float()
    positive_mask.fill_diagonal_(0)

    # Average positive similarity per trajectory
    pos_similarities = (similarity_matrix * positive_mask).sum(dim=1) / (
        positive_mask.sum(dim=1) + 1e-8
    )

    # Average negative similarity per trajectory
    negative_mask = 1.0 - positive_mask
    negative_mask.fill_diagonal_(0)
    neg_similarities = (similarity_matrix * negative_mask).sum(dim=1) / (
        negative_mask.sum(dim=1) + 1e-8
    )

    # Contrastive reward: pull positives, push negatives
    contrastive_reward_signal = pos_similarities - neg_similarities

    # Combine with verifiable reward
    total_reward = (
        verifiable_rewards +
        contrastive_weight * contrastive_reward_signal
    )

    return total_reward, contrastive_reward_signal

Step 4: Integrate into RLVR Training Loop

Extend baseline policy optimization with contrastive auxiliary objective.

def train_step_with_clipo(
    model,
    input_ids,
    verifiable_rewards,
    contrastive_head,
    policy_optimizer,
    contrastive_weight=0.5,
    temperature=0.1
):
    """
    Single training step combining RLVR with contrastive learning.
    """
    # Forward pass through model
    outputs = model(
        input_ids,
        output_hidden_states=True
    )

    # Extract trajectory embeddings
    hidden_states = outputs.hidden_states[-1]
    trajectory_embeddings = contrastive_head(hidden_states)

    # Compute correctness labels (binarize verifiable rewards)
    correctness_labels = (verifiable_rewards > 0).long()

    # Compute combined rewards
    total_rewards, contrastive_signal = contrastive_reward(
        trajectory_embeddings,
        correctness_labels,
        verifiable_rewards,
        contrastive_weight=contrastive_weight,
        temperature=temperature
    )

    # Policy gradient: GRPO style (placeholder)
    log_probs = outputs.logits.log_softmax(dim=-1)
    baseline = total_rewards.mean()
    advantages = total_rewards - baseline

    policy_loss = -(log_probs * advantages.unsqueeze(-1)).mean()

    # Backward pass
    policy_loss.backward()
    policy_optimizer.step()

    return {
        'policy_loss': policy_loss.item(),
        'contrastive_signal_mean': contrastive_signal.mean().item(),
        'total_reward_mean': total_rewards.mean().item()
    }

Practical Guidance

When to Use:

Complex reasoning tasks (MATH, GSM8K, competition-level problems)
Distributions with spurious correlations that mislead optimization
Scenarios where verifiable rewards alone lead to hallucinations
Tasks requiring robust generalization to perturbed inputs

When NOT to Use:

Simple classification with clean, unambiguous labels
When correct solutions form disconnected regions (contrastive smoothness harmful)
Extreme computational budget constraints (adds embedding computation)

Hyperparameter Tuning:

contrastive_weight: 0.3-1.0; balance with task difficulty
temperature: 0.05-0.2; lower sharpens contrastive distinctions
embedding_dim: 128-512 depending on trajectory complexity
batch size: Larger batches provide more negative samples

Common Pitfalls:

Weight too high, overshadowing verifiable signals (verify with ablations)
Temperature too low causing training instability (increase if loss oscillates)
Insufficient positive pairs early in training (consider warm-up scheduling)
Embedding space not well-aligned with task semantics

Reference

CLIPO paper on arXiv

ADu2021/clipo-contrastive-policy-optimization

skills/skillxiv-v0.0.2-claude-opus-4.6/clipo-contrastive-policy-optimization/SKILL.md

Augment verifiable reward RL (RLVR) with contrastive learning to generate dense auxiliary rewards. Enforce proximity among correct reasoning trajectories in embedding space while suppressing errors, amplifying invariant reasoning patterns.

2 stars

tools

Updated Apr 16, 2026

$ install --global

skillsauth

npx skillsauth add ADu2021/skillXiv clipo-contrastive-policy-optimization

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 16, 2026, 3:08 PM21.8s1 file scanned

SKILL.md

name:: clipo-contrastive-policy-optimization
title:: CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR
version:: 0.0.2
engine:: skillxiv-v0.0.2-claude-opus-4.6
license:: MIT
url:: https://arxiv.org/abs/2603.10101
keywords:: [RLVR, Contrastive Learning, Policy Optimization, Reasoning, Dense Rewards]
description:: Augment verifiable reward RL (RLVR) with contrastive learning to generate dense auxiliary rewards. Enforce proximity among correct reasoning trajectories in embedding space while suppressing errors, amplifying invariant reasoning patterns.

Technique: Dense Contrastive Rewards for Reasoning Trajectory Clustering

Core Concept

CLIPO extends RLVR policy optimization algorithms (GRPO, GSPO, DAPO, GMPO) by introducing a lightweight contrastive head and auxiliary reward:

Contrastive Head: Projects reasoning trajectories to embedding space
InfoNCE Loss: Treats correct responses as positives, incorrect as negatives
Dense Auxiliary Reward: Converted contrastive loss to reward signal
Combined Loss: Final reward = verifiable reward + contrastive reward

This dual signal prevents optimization collapse on narrow heuristics while maintaining grounding in task-specific verifiable rewards.

Architecture Overview

Trajectory encoder: Linear or small MLP embedding trajectories
Contrastive head: Projects to (typically) 256-512 dimensional embedding space
InfoNCE comparator: Computes similarity matrices and contrastive loss
Reward converter: Translates contrastive loss to auxiliary signal
Main policy: Unchanged from baseline RLVR method

Implementation Steps

Step 1: Build Trajectory Encoder and Contrastive Head

Create embeddings for reasoning trajectories by processing token hidden states.

import torch
import torch.nn as nn

class TrajectoryContrastiveHead(nn.Module):
    def __init__(self, hidden_dim, embedding_dim=256, projection_dim=128):
        super().__init__()
        # Encode trajectory via mean pooling + projection
        self.projection = nn.Linear(hidden_dim, projection_dim)
        self.contrastive_head = nn.Sequential(
            nn.Linear(projection_dim, embedding_dim),
            nn.ReLU(),
            nn.Linear(embedding_dim, projection_dim)
        )

    def forward(self, hidden_states):
        """
        hidden_states: (batch_size, seq_len, hidden_dim)
        returns: (batch_size, projection_dim) embeddings
        """
        # Mean pooling over sequence dimension
        trajectory_repr = hidden_states.mean(dim=1)  # (batch, hidden_dim)

        # Project through network
        projected = self.projection(trajectory_repr)
        embedding = self.contrastive_head(projected)

        return embedding

Step 2: Compute InfoNCE Contrastive Loss

Within each batch, group by correctness and compute contrastive pairs.

def infonce_loss(embeddings, labels, temperature=0.1):
    """
    InfoNCE loss for trajectory embeddings.

    embeddings: (batch_size, embedding_dim)
    labels: (batch_size,) binary correctness labels
    temperature: temperature parameter for softmax
    """
    batch_size = embeddings.shape[0]

    # Normalize embeddings
    embeddings = torch.nn.functional.normalize(embeddings, dim=1)

    # Compute similarity matrix
    similarity_matrix = torch.mm(embeddings, embeddings.t()) / temperature

    # Create positive and negative masks
    labels_expanded = labels.unsqueeze(1)
    positive_mask = (labels_expanded == labels_expanded.t()).float()

    # Set diagonal (self-similarity) to 0 for positive mask
    positive_mask.fill_diagonal_(0)

    # Negative mask is complement
    negative_mask = 1.0 - positive_mask
    negative_mask.fill_diagonal_(0)

    # InfoNCE: log(exp(sim_pos) / sum(exp(sim_neg)))
    exp_sim = torch.exp(similarity_matrix)

    # Sum of positive similarities per row
    pos_sum = (exp_sim * positive_mask).sum(dim=1, keepdim=True)

    # Sum of negative similarities per row
    neg_sum = (exp_sim * negative_mask).sum(dim=1, keepdim=True)

    # InfoNCE loss (avoiding division by zero)
    infonce = -torch.log(
        (pos_sum + 1e-8) / (pos_sum + neg_sum + 1e-8)
    )

    return infonce.mean()

Step 3: Convert Contrastive Loss to Auxiliary Reward

Transform contrastive objective into a dense reward signal that complements verifiable reward.

def contrastive_reward(
    embeddings,
    labels,
    verifiable_rewards,
    contrastive_weight=0.5,
    temperature=0.1
):
    """
    Compute combined reward: verifiable + contrastive auxiliary.

    Returns dense reward per trajectory.
    """
    batch_size = embeddings.shape[0]

    # Normalize embeddings
    embeddings_norm = torch.nn.functional.normalize(embeddings, dim=1)

    # Compute similarity matrix
    similarity_matrix = torch.mm(embeddings_norm, embeddings_norm.t()) / temperature

    # Create positive mask (other correct trajectories)
    labels_expanded = labels.unsqueeze(1)
    positive_mask = (labels_expanded == labels_expanded.t()).float()
    positive_mask.fill_diagonal_(0)

    # Average positive similarity per trajectory
    pos_similarities = (similarity_matrix * positive_mask).sum(dim=1) / (
        positive_mask.sum(dim=1) + 1e-8
    )

    # Average negative similarity per trajectory
    negative_mask = 1.0 - positive_mask
    negative_mask.fill_diagonal_(0)
    neg_similarities = (similarity_matrix * negative_mask).sum(dim=1) / (
        negative_mask.sum(dim=1) + 1e-8
    )

    # Contrastive reward: pull positives, push negatives
    contrastive_reward_signal = pos_similarities - neg_similarities

    # Combine with verifiable reward
    total_reward = (
        verifiable_rewards +
        contrastive_weight * contrastive_reward_signal
    )

    return total_reward, contrastive_reward_signal

Step 4: Integrate into RLVR Training Loop

Extend baseline policy optimization with contrastive auxiliary objective.

def train_step_with_clipo(
    model,
    input_ids,
    verifiable_rewards,
    contrastive_head,
    policy_optimizer,
    contrastive_weight=0.5,
    temperature=0.1
):
    """
    Single training step combining RLVR with contrastive learning.
    """
    # Forward pass through model
    outputs = model(
        input_ids,
        output_hidden_states=True
    )

    # Extract trajectory embeddings
    hidden_states = outputs.hidden_states[-1]
    trajectory_embeddings = contrastive_head(hidden_states)

    # Compute correctness labels (binarize verifiable rewards)
    correctness_labels = (verifiable_rewards > 0).long()

    # Compute combined rewards
    total_rewards, contrastive_signal = contrastive_reward(
        trajectory_embeddings,
        correctness_labels,
        verifiable_rewards,
        contrastive_weight=contrastive_weight,
        temperature=temperature
    )

    # Policy gradient: GRPO style (placeholder)
    log_probs = outputs.logits.log_softmax(dim=-1)
    baseline = total_rewards.mean()
    advantages = total_rewards - baseline

    policy_loss = -(log_probs * advantages.unsqueeze(-1)).mean()

    # Backward pass
    policy_loss.backward()
    policy_optimizer.step()

    return {
        'policy_loss': policy_loss.item(),
        'contrastive_signal_mean': contrastive_signal.mean().item(),
        'total_reward_mean': total_rewards.mean().item()
    }

Practical Guidance

When to Use:

Complex reasoning tasks (MATH, GSM8K, competition-level problems)
Distributions with spurious correlations that mislead optimization
Scenarios where verifiable rewards alone lead to hallucinations
Tasks requiring robust generalization to perturbed inputs

When NOT to Use:

Simple classification with clean, unambiguous labels
When correct solutions form disconnected regions (contrastive smoothness harmful)
Extreme computational budget constraints (adds embedding computation)

Hyperparameter Tuning:

contrastive_weight: 0.3-1.0; balance with task difficulty
temperature: 0.05-0.2; lower sharpens contrastive distinctions
embedding_dim: 128-512 depending on trajectory complexity
batch size: Larger batches provide more negative samples

Common Pitfalls:

Weight too high, overshadowing verifiable signals (verify with ablations)
Temperature too low causing training instability (increase if loss oscillates)
Insufficient positive pairs early in training (consider warm-up scheduling)
Embedding space not well-aligned with task semantics

Reference

CLIPO paper on arXiv

Related Skills

ADu2021/flow-map-trajectory-tilting

testing

VerifiedTrustedCommunity

Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flow-map-trajectory-tilting

ADu2021/flexible-data-mixture-of-experts

testing

VerifiedTrustedCommunity

Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flexible-data-mixture-of-experts

ADu2021/flexibility-trap-diffusion-reasoning

data-ai

VerifiedTrustedCommunity

Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flexibility-trap-diffusion-reasoning

ADu2021/flex-continuous-agent-evolution

devops

VerifiedTrustedCommunity

Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flex-continuous-agent-evolution

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/ADu2021/skillXiv.git

# Copy into Claude Code skills folder (global)
cp -r skillXiv/skills/skillxiv-v0.0.2-claude-opus-4.6/clipo-contrastive-policy-optimization ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

ADu2021/skillXiv

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT