Embed-RL: RL-Driven Reasoning for Multimodal Embeddings

Problem Context

Standard multimodal embeddings rely purely on contrastive learning, lacking reasoning about why items are similar. Methods that add reasoning often generate text-only explanations misaligned with retrieval. Joint training of embedder and reasoner produces conflicting gradients. Cross-modal retrieval remains brittle for complex multimodal reasoning. The challenge: improve embedding quality through reasoning that's explicitly grounded in retrievable evidence.

Core Concept

Embed-RL adds RL-optimized reasoning to multimodal embeddings through a two-component system: a frozen contrastive embedder (reward provider) and a reasoner (RL-optimized). The reasoner generates "Traceability CoT" (T-CoT) that explicitly grounds reasoning in multimodal evidence:

Text keywords: Extractable text entities
Spatial locations: Bounding boxes for visual grounding
Temporal markers: Key frames in videos

Rather than joint training, the frozen embedder acts as a stable reward signal. The reasoner optimizes to improve retrieval accuracy while explaining its reasoning in traceable, multi-modal terms.

Architecture Overview

Frozen Embedder: Pre-trained contrastive learner, provides reward signals
Reasoner: RL-optimized via PPO or similar
Evidential Traceability CoT: Grounds reasoning in multimodal evidence
Dual Reward Mechanism: Format compliance, process alignment, outcome effectiveness
Multi-Evidence Integration: Text keywords, bounding boxes, key frames
Decoupled Optimization: Prevents gradient conflicts
Task-Specific Training: Query-target reasoning pairs

Implementation

Two-component architecture:

class EmbedRLSystem(nn.Module):
    """
    Two-component system: frozen embedder (reward) + reasoner (RL-optimized).
    """

    def __init__(self, embedder_model, reasoner_model):
        super().__init__()
        # Frozen embedder for reward computation
        self.embedder = embedder_model
        self.embedder.eval()
        for param in self.embedder.parameters():
            param.requires_grad = False

        # RL-optimized reasoner
        self.reasoner = reasoner_model
        self.reasoner.train()

    def forward(self, query, target, modalities):
        """
        Compute embedding + reasoning jointly.
        Reasoner explains why embeddings are similar/different.
        """
        # Get embeddings (frozen)
        query_embedding = self.embedder.embed(query)
        target_embedding = self.embedder.embed(target)

        # Generate reasoning from reasoner
        reasoning = self.reasoner(query, target, modalities)

        return {
            'query_embedding': query_embedding,
            'target_embedding': target_embedding,
            'reasoning': reasoning
        }

Traceability CoT generation:

class TraceabilityCoTGenerator(nn.Module):
    """
    Generate reasoning grounded in retrievable multimodal evidence.
    Explicitly references text, visual locations, temporal moments.
    """

    def __init__(self, hidden_dim=512):
        super().__init__()
        self.hidden_dim = hidden_dim

        # Text extractor: identify keywords
        self.text_extractor = nn.Sequential(
            nn.Linear(hidden_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 100)  # Top 100 keywords
        )

        # Visual extractor: localize bounding boxes
        self.bbox_generator = nn.Sequential(
            nn.Linear(hidden_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 4)  # [x, y, w, h]
        )

        # Temporal extractor: identify key frames
        self.keyframe_extractor = nn.Sequential(
            nn.Linear(hidden_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 1)  # Frame index
        )

        # Reasoning composer
        self.reasoning_composer = nn.TransformerDecoder(
            nn.TransformerDecoderLayer(hidden_dim, 8),
            num_layers=2
        )

    def extract_text_keywords(self, query_features, target_features):
        """
        Extract text-based keywords explaining similarity.
        Returns: list of relevant keywords
        """
        combined = torch.cat([query_features, target_features], dim=-1)
        keyword_logits = self.text_extractor(combined)

        # Get top keywords
        top_k_indices = torch.topk(keyword_logits, k=5)[1]
        keywords = [get_keyword(idx) for idx in top_k_indices]

        return keywords

    def extract_spatial_locations(self, query_image, target_image,
                                  query_features, target_features):
        """
        Generate bounding boxes localizing relevant image regions.
        """
        combined_features = torch.cat([
            query_features, target_features
        ], dim=-1)

        # Query image bounding box
        query_bbox = torch.sigmoid(
            self.bbox_generator(combined_features))  # Normalize to [0,1]

        # Target image bounding box
        target_bbox = torch.sigmoid(
            self.bbox_generator(combined_features))

        return {
            'query_bbox_2d': query_bbox,
            'target_bbox_2d': target_bbox,
            'query_image': query_image,
            'target_image': target_image
        }

    def extract_key_frames(self, query_video, target_video,
                          query_features, target_features):
        """
        Identify critical frames in videos explaining similarity.
        """
        combined = torch.cat([query_features, target_features], dim=-1)

        # Key frame index for query video
        query_keyframe_idx = torch.sigmoid(
            self.keyframe_extractor(combined))
        query_keyframe_idx = (query_keyframe_idx *
                             query_video.num_frames).int()

        # Key frame index for target video
        target_keyframe_idx = torch.sigmoid(
            self.keyframe_extractor(combined))
        target_keyframe_idx = (target_keyframe_idx *
                              target_video.num_frames).int()

        return {
            'query_key_frames': [
                query_video.get_frame(query_keyframe_idx)
            ],
            'target_key_frames': [
                target_video.get_frame(target_keyframe_idx)
            ],
            'query_frame_indices': [query_keyframe_idx],
            'target_frame_indices': [target_keyframe_idx]
        }

    def generate_traceability_cot(self, query, target, modalities,
                                 query_features, target_features):
        """
        Compose complete T-CoT with multimodal grounding.
        """
        t_cot = {
            'type': 'evidential_traceability_cot',
            'evidence': {}
        }

        # Text evidence
        if 'text' in modalities:
            keywords = self.extract_text_keywords(
                query_features, target_features)
            t_cot['evidence']['text_keywords'] = keywords

        # Visual evidence
        if 'image' in modalities:
            spatial = self.extract_spatial_locations(
                query.get('image'), target.get('image'),
                query_features, target_features)
            t_cot['evidence']['bbox_2d'] = spatial

        # Temporal evidence
        if 'video' in modalities:
            temporal = self.extract_key_frames(
                query.get('video'), target.get('video'),
                query_features, target_features)
            t_cot['evidence']['key_frames'] = temporal

        return t_cot

Dual-reward mechanism for RL:

class DualRewardComputation:
    """
    Three-component reward system for RL training:
    1. Format compliance: is T-CoT properly structured
    2. Process alignment: does reasoning match query-target relationship
    3. Outcome effectiveness: does reasoning improve retrieval
    """

    def __init__(self, embedder):
        self.embedder = embedder

    def compute_format_compliance_reward(self, t_cot):
        """
        Measure if T-CoT has proper structure.
        Reward well-formed, complete reasoning.
        """
        required_fields = ['text_keywords', 'bbox_2d', 'key_frames']
        present_fields = sum(
            1 for field in required_fields
            if field in t_cot['evidence'])

        completeness = present_fields / len(required_fields)

        # Also check field validity
        validity_score = 1.0
        if 'text_keywords' in t_cot['evidence']:
            keywords = t_cot['evidence']['text_keywords']
            if not (1 <= len(keywords) <= 10):
                validity_score *= 0.5

        if 'bbox_2d' in t_cot['evidence']:
            bbox = t_cot['evidence']['bbox_2d']
            if not (0 <= bbox <= 1).all():
                validity_score *= 0.5

        return completeness * validity_score

    def compute_process_alignment_reward(self, query, target, t_cot,
                                        query_emb, target_emb):
        """
        Measure if reasoning correctly aligns with query-target relationship.
        """
        # Extract reasoning components
        text_sim = compute_text_similarity(
            query, target, t_cot['evidence'].get('text_keywords', []))

        # Visual similarity from bboxes
        visual_sim = 1.0
        if 'bbox_2d' in t_cot['evidence']:
            query_bbox = t_cot['evidence']['bbox_2d']['query_bbox_2d']
            target_bbox = t_cot['evidence']['bbox_2d']['target_bbox_2d']
            # Overlap measure
            visual_sim = compute_bbox_overlap(query_bbox, target_bbox)

        # Alignment with embeddings
        embedding_sim = torch.cosine_similarity(query_emb, target_emb)

        # Combined alignment
        alignment = (text_sim + visual_sim + embedding_sim) / 3.0

        return float(alignment.clamp(0, 1))

    def compute_outcome_effectiveness_reward(self, query, target, t_cot,
                                            query_emb, target_emb,
                                            batch_targets):
        """
        Measure if reasoning actually improves retrieval accuracy.
        """
        # Baseline: embedding similarity without reasoning
        baseline_score = torch.cosine_similarity(query_emb, target_emb)

        # With reasoning: boost similarity if reasoning is strong
        reasoning_strength = len(t_cot['evidence']) / 3.0

        # Adjusted score: reasoning boosts relevant retrievals
        adjusted_score = baseline_score * (1.0 + 0.5 * reasoning_strength)

        # Compute retrieval rank with reasoning
        query_embedding = self.embedder.embed(query)
        scores = []
        for candidate in batch_targets:
            candidate_emb = self.embedder.embed(candidate)
            score = torch.cosine_similarity(query_embedding,
                                           candidate_emb)
            # Apply reasoning boost
            scores.append(score)

        # Rank of target among candidates
        scores = torch.tensor(scores)
        target_rank = (scores > adjusted_score).sum().item()

        # Reward: higher for better rank
        rank_reward = 1.0 / (1.0 + target_rank)

        return float(rank_reward)

    def compute_combined_reward(self, query, target, t_cot,
                               query_emb, target_emb, batch_targets):
        """
        Combine three reward components.
        """
        format_reward = self.compute_format_compliance_reward(t_cot)
        alignment_reward = self.compute_process_alignment_reward(
            query, target, t_cot, query_emb, target_emb)
        outcome_reward = self.compute_outcome_effectiveness_reward(
            query, target, t_cot, query_emb, target_emb, batch_targets)

        # Weighted combination
        combined = (
            0.2 * format_reward +
            0.3 * alignment_reward +
            0.5 * outcome_reward
        )

        return {
            'format_reward': format_reward,
            'alignment_reward': alignment_reward,
            'outcome_reward': outcome_reward,
            'combined_reward': combined
        }

RL training loop:

def train_with_rl(system, reward_computer, train_pairs, num_epochs=10):
    """
    Train reasoner with RL while keeping embedder frozen.
    """
    optimizer = torch.optim.Adam(system.reasoner.parameters(),
                                 lr=1e-4)
    ppo_optimizer = PPOOptimizer(system.reasoner)

    for epoch in range(num_epochs):
        epoch_loss = 0.0

        for batch_idx, (queries, targets, batch_targets) in enumerate(
            train_pairs):

            # Forward pass
            outputs = system(queries, targets, modalities=['text',
                                                          'image',
                                                          'video'])

            embeddings = {
                'query': outputs['query_embedding'],
                'target': outputs['target_embedding']
            }
            reasoning = outputs['reasoning']

            # Compute rewards
            rewards = []
            for query, target, t_cot in zip(queries, targets,
                                            reasoning):
                reward_dict = reward_computer.compute_combined_reward(
                    query, target, t_cot,
                    embeddings['query'], embeddings['target'],
                    batch_targets)
                rewards.append(reward_dict['combined_reward'])

            rewards = torch.tensor(rewards)

            # PPO update
            log_probs = system.reasoner.get_log_prob(reasoning)
            policy_loss = -(rewards * log_probs).mean()

            # Entropy regularization
            entropy = system.reasoner.compute_entropy(reasoning)
            total_loss = policy_loss - 0.01 * entropy

            # Update
            optimizer.zero_grad()
            total_loss.backward()
            optimizer.step()

            epoch_loss += total_loss.item()

            if (batch_idx + 1) % 100 == 0:
                print(f"Epoch {epoch}, Batch {batch_idx + 1}: "
                      f"Loss={total_loss.item():.4f}, "
                      f"Mean Reward={rewards.mean():.4f}")

        print(f"Epoch {epoch} completed. Average Loss={epoch_loss:.4f}")

Practical Guidance

When to use:

Building retrieval systems requiring reasoning
Have multimodal data (text, images, videos)
Need explainable embeddings
Want better cross-modal retrieval

Implementation steps:

Base Embedder: Train or use pre-trained contrastive model
Reasoner Model: Implement T-CoT generators for each modality
Freeze Embedder: Lock embedder weights during RL training
Reward Computer: Implement all three reward components
RL Training: Use PPO or similar for reasoner optimization

Modality-specific considerations:

Text: Extract keywords from documents, measure semantic similarity
Images: Generate bounding boxes, track spatial relationships
Videos: Identify key frames showing critical moments
Multimodal: Combine evidence across modalities coherently

Reward tuning:

Format compliance (0.2): Ensure structured reasoning
Alignment (0.3): Reasoning matches actual relationships
Outcome (0.5): Reasoning improves retrieval performance

Expected improvements:

MMEB-V2 benchmark: 5-10% improvement over baselines
UVRB benchmark: 8-15% improvement
Fine-grained retrieval: strong gains (+10-20%)
Out-of-domain transfer: maintains performance

Training configuration:

Batch size: 64-128
Learning rate: 1e-4 for reasoner, 0 for embedder
Epochs: 10-20
PPO clip ratio: 0.2
Entropy coefficient: 0.01
Max sequence length: 512

Inference:

Generate T-CoT for both query and candidates
Weight retrieval by reasoning strength
Cache embeddings for efficiency
Use reasoning for result explanation

Reference

Decoupling embedder and reasoner through RL enables grounding reasoning in retrievable evidence while maintaining stable reward signals. Traceability CoT provides interpretable, multimodal explanations that directly improve retrieval performance across modalities.

Embed-RL: RL-Driven Reasoning for Multimodal Embeddings

Problem Context

Core Concept

Text keywords: Extractable text entities
Spatial locations: Bounding boxes for visual grounding
Temporal markers: Key frames in videos

Rather than joint training, the frozen embedder acts as a stable reward signal. The reasoner optimizes to improve retrieval accuracy while explaining its reasoning in traceable, multi-modal terms.

Architecture Overview

Frozen Embedder: Pre-trained contrastive learner, provides reward signals
Reasoner: RL-optimized via PPO or similar
Evidential Traceability CoT: Grounds reasoning in multimodal evidence
Dual Reward Mechanism: Format compliance, process alignment, outcome effectiveness
Multi-Evidence Integration: Text keywords, bounding boxes, key frames
Decoupled Optimization: Prevents gradient conflicts
Task-Specific Training: Query-target reasoning pairs

Implementation

Two-component architecture:

class EmbedRLSystem(nn.Module):
    """
    Two-component system: frozen embedder (reward) + reasoner (RL-optimized).
    """

    def __init__(self, embedder_model, reasoner_model):
        super().__init__()
        # Frozen embedder for reward computation
        self.embedder = embedder_model
        self.embedder.eval()
        for param in self.embedder.parameters():
            param.requires_grad = False

        # RL-optimized reasoner
        self.reasoner = reasoner_model
        self.reasoner.train()

    def forward(self, query, target, modalities):
        """
        Compute embedding + reasoning jointly.
        Reasoner explains why embeddings are similar/different.
        """
        # Get embeddings (frozen)
        query_embedding = self.embedder.embed(query)
        target_embedding = self.embedder.embed(target)

        # Generate reasoning from reasoner
        reasoning = self.reasoner(query, target, modalities)

        return {
            'query_embedding': query_embedding,
            'target_embedding': target_embedding,
            'reasoning': reasoning
        }

Traceability CoT generation:

class TraceabilityCoTGenerator(nn.Module):
    """
    Generate reasoning grounded in retrievable multimodal evidence.
    Explicitly references text, visual locations, temporal moments.
    """

    def __init__(self, hidden_dim=512):
        super().__init__()
        self.hidden_dim = hidden_dim

        # Text extractor: identify keywords
        self.text_extractor = nn.Sequential(
            nn.Linear(hidden_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 100)  # Top 100 keywords
        )

        # Visual extractor: localize bounding boxes
        self.bbox_generator = nn.Sequential(
            nn.Linear(hidden_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 4)  # [x, y, w, h]
        )

        # Temporal extractor: identify key frames
        self.keyframe_extractor = nn.Sequential(
            nn.Linear(hidden_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 1)  # Frame index
        )

        # Reasoning composer
        self.reasoning_composer = nn.TransformerDecoder(
            nn.TransformerDecoderLayer(hidden_dim, 8),
            num_layers=2
        )

    def extract_text_keywords(self, query_features, target_features):
        """
        Extract text-based keywords explaining similarity.
        Returns: list of relevant keywords
        """
        combined = torch.cat([query_features, target_features], dim=-1)
        keyword_logits = self.text_extractor(combined)

        # Get top keywords
        top_k_indices = torch.topk(keyword_logits, k=5)[1]
        keywords = [get_keyword(idx) for idx in top_k_indices]

        return keywords

    def extract_spatial_locations(self, query_image, target_image,
                                  query_features, target_features):
        """
        Generate bounding boxes localizing relevant image regions.
        """
        combined_features = torch.cat([
            query_features, target_features
        ], dim=-1)

        # Query image bounding box
        query_bbox = torch.sigmoid(
            self.bbox_generator(combined_features))  # Normalize to [0,1]

        # Target image bounding box
        target_bbox = torch.sigmoid(
            self.bbox_generator(combined_features))

        return {
            'query_bbox_2d': query_bbox,
            'target_bbox_2d': target_bbox,
            'query_image': query_image,
            'target_image': target_image
        }

    def extract_key_frames(self, query_video, target_video,
                          query_features, target_features):
        """
        Identify critical frames in videos explaining similarity.
        """
        combined = torch.cat([query_features, target_features], dim=-1)

        # Key frame index for query video
        query_keyframe_idx = torch.sigmoid(
            self.keyframe_extractor(combined))
        query_keyframe_idx = (query_keyframe_idx *
                             query_video.num_frames).int()

        # Key frame index for target video
        target_keyframe_idx = torch.sigmoid(
            self.keyframe_extractor(combined))
        target_keyframe_idx = (target_keyframe_idx *
                              target_video.num_frames).int()

        return {
            'query_key_frames': [
                query_video.get_frame(query_keyframe_idx)
            ],
            'target_key_frames': [
                target_video.get_frame(target_keyframe_idx)
            ],
            'query_frame_indices': [query_keyframe_idx],
            'target_frame_indices': [target_keyframe_idx]
        }

    def generate_traceability_cot(self, query, target, modalities,
                                 query_features, target_features):
        """
        Compose complete T-CoT with multimodal grounding.
        """
        t_cot = {
            'type': 'evidential_traceability_cot',
            'evidence': {}
        }

        # Text evidence
        if 'text' in modalities:
            keywords = self.extract_text_keywords(
                query_features, target_features)
            t_cot['evidence']['text_keywords'] = keywords

        # Visual evidence
        if 'image' in modalities:
            spatial = self.extract_spatial_locations(
                query.get('image'), target.get('image'),
                query_features, target_features)
            t_cot['evidence']['bbox_2d'] = spatial

        # Temporal evidence
        if 'video' in modalities:
            temporal = self.extract_key_frames(
                query.get('video'), target.get('video'),
                query_features, target_features)
            t_cot['evidence']['key_frames'] = temporal

        return t_cot

Dual-reward mechanism for RL:

class DualRewardComputation:
    """
    Three-component reward system for RL training:
    1. Format compliance: is T-CoT properly structured
    2. Process alignment: does reasoning match query-target relationship
    3. Outcome effectiveness: does reasoning improve retrieval
    """

    def __init__(self, embedder):
        self.embedder = embedder

    def compute_format_compliance_reward(self, t_cot):
        """
        Measure if T-CoT has proper structure.
        Reward well-formed, complete reasoning.
        """
        required_fields = ['text_keywords', 'bbox_2d', 'key_frames']
        present_fields = sum(
            1 for field in required_fields
            if field in t_cot['evidence'])

        completeness = present_fields / len(required_fields)

        # Also check field validity
        validity_score = 1.0
        if 'text_keywords' in t_cot['evidence']:
            keywords = t_cot['evidence']['text_keywords']
            if not (1 <= len(keywords) <= 10):
                validity_score *= 0.5

        if 'bbox_2d' in t_cot['evidence']:
            bbox = t_cot['evidence']['bbox_2d']
            if not (0 <= bbox <= 1).all():
                validity_score *= 0.5

        return completeness * validity_score

    def compute_process_alignment_reward(self, query, target, t_cot,
                                        query_emb, target_emb):
        """
        Measure if reasoning correctly aligns with query-target relationship.
        """
        # Extract reasoning components
        text_sim = compute_text_similarity(
            query, target, t_cot['evidence'].get('text_keywords', []))

        # Visual similarity from bboxes
        visual_sim = 1.0
        if 'bbox_2d' in t_cot['evidence']:
            query_bbox = t_cot['evidence']['bbox_2d']['query_bbox_2d']
            target_bbox = t_cot['evidence']['bbox_2d']['target_bbox_2d']
            # Overlap measure
            visual_sim = compute_bbox_overlap(query_bbox, target_bbox)

        # Alignment with embeddings
        embedding_sim = torch.cosine_similarity(query_emb, target_emb)

        # Combined alignment
        alignment = (text_sim + visual_sim + embedding_sim) / 3.0

        return float(alignment.clamp(0, 1))

    def compute_outcome_effectiveness_reward(self, query, target, t_cot,
                                            query_emb, target_emb,
                                            batch_targets):
        """
        Measure if reasoning actually improves retrieval accuracy.
        """
        # Baseline: embedding similarity without reasoning
        baseline_score = torch.cosine_similarity(query_emb, target_emb)

        # With reasoning: boost similarity if reasoning is strong
        reasoning_strength = len(t_cot['evidence']) / 3.0

        # Adjusted score: reasoning boosts relevant retrievals
        adjusted_score = baseline_score * (1.0 + 0.5 * reasoning_strength)

        # Compute retrieval rank with reasoning
        query_embedding = self.embedder.embed(query)
        scores = []
        for candidate in batch_targets:
            candidate_emb = self.embedder.embed(candidate)
            score = torch.cosine_similarity(query_embedding,
                                           candidate_emb)
            # Apply reasoning boost
            scores.append(score)

        # Rank of target among candidates
        scores = torch.tensor(scores)
        target_rank = (scores > adjusted_score).sum().item()

        # Reward: higher for better rank
        rank_reward = 1.0 / (1.0 + target_rank)

        return float(rank_reward)

    def compute_combined_reward(self, query, target, t_cot,
                               query_emb, target_emb, batch_targets):
        """
        Combine three reward components.
        """
        format_reward = self.compute_format_compliance_reward(t_cot)
        alignment_reward = self.compute_process_alignment_reward(
            query, target, t_cot, query_emb, target_emb)
        outcome_reward = self.compute_outcome_effectiveness_reward(
            query, target, t_cot, query_emb, target_emb, batch_targets)

        # Weighted combination
        combined = (
            0.2 * format_reward +
            0.3 * alignment_reward +
            0.5 * outcome_reward
        )

        return {
            'format_reward': format_reward,
            'alignment_reward': alignment_reward,
            'outcome_reward': outcome_reward,
            'combined_reward': combined
        }

RL training loop:

def train_with_rl(system, reward_computer, train_pairs, num_epochs=10):
    """
    Train reasoner with RL while keeping embedder frozen.
    """
    optimizer = torch.optim.Adam(system.reasoner.parameters(),
                                 lr=1e-4)
    ppo_optimizer = PPOOptimizer(system.reasoner)

    for epoch in range(num_epochs):
        epoch_loss = 0.0

        for batch_idx, (queries, targets, batch_targets) in enumerate(
            train_pairs):

            # Forward pass
            outputs = system(queries, targets, modalities=['text',
                                                          'image',
                                                          'video'])

            embeddings = {
                'query': outputs['query_embedding'],
                'target': outputs['target_embedding']
            }
            reasoning = outputs['reasoning']

            # Compute rewards
            rewards = []
            for query, target, t_cot in zip(queries, targets,
                                            reasoning):
                reward_dict = reward_computer.compute_combined_reward(
                    query, target, t_cot,
                    embeddings['query'], embeddings['target'],
                    batch_targets)
                rewards.append(reward_dict['combined_reward'])

            rewards = torch.tensor(rewards)

            # PPO update
            log_probs = system.reasoner.get_log_prob(reasoning)
            policy_loss = -(rewards * log_probs).mean()

            # Entropy regularization
            entropy = system.reasoner.compute_entropy(reasoning)
            total_loss = policy_loss - 0.01 * entropy

            # Update
            optimizer.zero_grad()
            total_loss.backward()
            optimizer.step()

            epoch_loss += total_loss.item()

            if (batch_idx + 1) % 100 == 0:
                print(f"Epoch {epoch}, Batch {batch_idx + 1}: "
                      f"Loss={total_loss.item():.4f}, "
                      f"Mean Reward={rewards.mean():.4f}")

        print(f"Epoch {epoch} completed. Average Loss={epoch_loss:.4f}")

Practical Guidance

When to use:

Building retrieval systems requiring reasoning
Have multimodal data (text, images, videos)
Need explainable embeddings
Want better cross-modal retrieval

Implementation steps:

Base Embedder: Train or use pre-trained contrastive model
Reasoner Model: Implement T-CoT generators for each modality
Freeze Embedder: Lock embedder weights during RL training
Reward Computer: Implement all three reward components
RL Training: Use PPO or similar for reasoner optimization

Modality-specific considerations:

Text: Extract keywords from documents, measure semantic similarity
Images: Generate bounding boxes, track spatial relationships
Videos: Identify key frames showing critical moments
Multimodal: Combine evidence across modalities coherently

Reward tuning:

Format compliance (0.2): Ensure structured reasoning
Alignment (0.3): Reasoning matches actual relationships
Outcome (0.5): Reasoning improves retrieval performance

Expected improvements:

MMEB-V2 benchmark: 5-10% improvement over baselines
UVRB benchmark: 8-15% improvement
Fine-grained retrieval: strong gains (+10-20%)
Out-of-domain transfer: maintains performance

Training configuration:

Batch size: 64-128
Learning rate: 1e-4 for reasoner, 0 for embedder
Epochs: 10-20
PPO clip ratio: 0.2
Entropy coefficient: 0.01
Max sequence length: 512

Inference:

Generate T-CoT for both query and candidates
Weight retrieval by reasoning strength
Cache embeddings for efficiency
Use reasoning for result explanation

Adoption

ADu2021/embed-rl

$ install --global

Security Scan Results

SKILL.md

Embed-RL: RL-Driven Reasoning for Multimodal Embeddings

Problem Context

Core Concept

Architecture Overview

Implementation

Practical Guidance

Reference

Related Skills

ADu2021/flow-map-trajectory-tilting

ADu2021/flexible-data-mixture-of-experts

ADu2021/flexibility-trap-diffusion-reasoning

ADu2021/flex-continuous-agent-evolution

ADu2021/embed-rl

$ install --global

Security Scan Results

SKILL.md

Embed-RL: RL-Driven Reasoning for Multimodal Embeddings

Problem Context

Core Concept

Architecture Overview

Implementation

Practical Guidance

Reference

Related Skills

ADu2021/flow-map-trajectory-tilting

ADu2021/flexible-data-mixture-of-experts

ADu2021/flexibility-trap-diffusion-reasoning

ADu2021/flex-continuous-agent-evolution