skills/skillxiv-v0.0.2-claude-opus-4.6/embed-rl/SKILL.md
Improve multimodal embeddings through RL-optimized reasoning that grounds evidence in retrievable visual cues. Frozen embedder provides stable rewards while reasoner generates evidential traceability CoT with text keywords, bounding boxes, and key frames.
npx skillsauth add ADu2021/skillXiv embed-rlInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Standard multimodal embeddings rely purely on contrastive learning, lacking reasoning about why items are similar. Methods that add reasoning often generate text-only explanations misaligned with retrieval. Joint training of embedder and reasoner produces conflicting gradients. Cross-modal retrieval remains brittle for complex multimodal reasoning. The challenge: improve embedding quality through reasoning that's explicitly grounded in retrievable evidence.
Embed-RL adds RL-optimized reasoning to multimodal embeddings through a two-component system: a frozen contrastive embedder (reward provider) and a reasoner (RL-optimized). The reasoner generates "Traceability CoT" (T-CoT) that explicitly grounds reasoning in multimodal evidence:
Rather than joint training, the frozen embedder acts as a stable reward signal. The reasoner optimizes to improve retrieval accuracy while explaining its reasoning in traceable, multi-modal terms.
Two-component architecture:
class EmbedRLSystem(nn.Module):
"""
Two-component system: frozen embedder (reward) + reasoner (RL-optimized).
"""
def __init__(self, embedder_model, reasoner_model):
super().__init__()
# Frozen embedder for reward computation
self.embedder = embedder_model
self.embedder.eval()
for param in self.embedder.parameters():
param.requires_grad = False
# RL-optimized reasoner
self.reasoner = reasoner_model
self.reasoner.train()
def forward(self, query, target, modalities):
"""
Compute embedding + reasoning jointly.
Reasoner explains why embeddings are similar/different.
"""
# Get embeddings (frozen)
query_embedding = self.embedder.embed(query)
target_embedding = self.embedder.embed(target)
# Generate reasoning from reasoner
reasoning = self.reasoner(query, target, modalities)
return {
'query_embedding': query_embedding,
'target_embedding': target_embedding,
'reasoning': reasoning
}
Traceability CoT generation:
class TraceabilityCoTGenerator(nn.Module):
"""
Generate reasoning grounded in retrievable multimodal evidence.
Explicitly references text, visual locations, temporal moments.
"""
def __init__(self, hidden_dim=512):
super().__init__()
self.hidden_dim = hidden_dim
# Text extractor: identify keywords
self.text_extractor = nn.Sequential(
nn.Linear(hidden_dim, 256),
nn.ReLU(),
nn.Linear(256, 100) # Top 100 keywords
)
# Visual extractor: localize bounding boxes
self.bbox_generator = nn.Sequential(
nn.Linear(hidden_dim, 256),
nn.ReLU(),
nn.Linear(256, 4) # [x, y, w, h]
)
# Temporal extractor: identify key frames
self.keyframe_extractor = nn.Sequential(
nn.Linear(hidden_dim, 256),
nn.ReLU(),
nn.Linear(256, 1) # Frame index
)
# Reasoning composer
self.reasoning_composer = nn.TransformerDecoder(
nn.TransformerDecoderLayer(hidden_dim, 8),
num_layers=2
)
def extract_text_keywords(self, query_features, target_features):
"""
Extract text-based keywords explaining similarity.
Returns: list of relevant keywords
"""
combined = torch.cat([query_features, target_features], dim=-1)
keyword_logits = self.text_extractor(combined)
# Get top keywords
top_k_indices = torch.topk(keyword_logits, k=5)[1]
keywords = [get_keyword(idx) for idx in top_k_indices]
return keywords
def extract_spatial_locations(self, query_image, target_image,
query_features, target_features):
"""
Generate bounding boxes localizing relevant image regions.
"""
combined_features = torch.cat([
query_features, target_features
], dim=-1)
# Query image bounding box
query_bbox = torch.sigmoid(
self.bbox_generator(combined_features)) # Normalize to [0,1]
# Target image bounding box
target_bbox = torch.sigmoid(
self.bbox_generator(combined_features))
return {
'query_bbox_2d': query_bbox,
'target_bbox_2d': target_bbox,
'query_image': query_image,
'target_image': target_image
}
def extract_key_frames(self, query_video, target_video,
query_features, target_features):
"""
Identify critical frames in videos explaining similarity.
"""
combined = torch.cat([query_features, target_features], dim=-1)
# Key frame index for query video
query_keyframe_idx = torch.sigmoid(
self.keyframe_extractor(combined))
query_keyframe_idx = (query_keyframe_idx *
query_video.num_frames).int()
# Key frame index for target video
target_keyframe_idx = torch.sigmoid(
self.keyframe_extractor(combined))
target_keyframe_idx = (target_keyframe_idx *
target_video.num_frames).int()
return {
'query_key_frames': [
query_video.get_frame(query_keyframe_idx)
],
'target_key_frames': [
target_video.get_frame(target_keyframe_idx)
],
'query_frame_indices': [query_keyframe_idx],
'target_frame_indices': [target_keyframe_idx]
}
def generate_traceability_cot(self, query, target, modalities,
query_features, target_features):
"""
Compose complete T-CoT with multimodal grounding.
"""
t_cot = {
'type': 'evidential_traceability_cot',
'evidence': {}
}
# Text evidence
if 'text' in modalities:
keywords = self.extract_text_keywords(
query_features, target_features)
t_cot['evidence']['text_keywords'] = keywords
# Visual evidence
if 'image' in modalities:
spatial = self.extract_spatial_locations(
query.get('image'), target.get('image'),
query_features, target_features)
t_cot['evidence']['bbox_2d'] = spatial
# Temporal evidence
if 'video' in modalities:
temporal = self.extract_key_frames(
query.get('video'), target.get('video'),
query_features, target_features)
t_cot['evidence']['key_frames'] = temporal
return t_cot
Dual-reward mechanism for RL:
class DualRewardComputation:
"""
Three-component reward system for RL training:
1. Format compliance: is T-CoT properly structured
2. Process alignment: does reasoning match query-target relationship
3. Outcome effectiveness: does reasoning improve retrieval
"""
def __init__(self, embedder):
self.embedder = embedder
def compute_format_compliance_reward(self, t_cot):
"""
Measure if T-CoT has proper structure.
Reward well-formed, complete reasoning.
"""
required_fields = ['text_keywords', 'bbox_2d', 'key_frames']
present_fields = sum(
1 for field in required_fields
if field in t_cot['evidence'])
completeness = present_fields / len(required_fields)
# Also check field validity
validity_score = 1.0
if 'text_keywords' in t_cot['evidence']:
keywords = t_cot['evidence']['text_keywords']
if not (1 <= len(keywords) <= 10):
validity_score *= 0.5
if 'bbox_2d' in t_cot['evidence']:
bbox = t_cot['evidence']['bbox_2d']
if not (0 <= bbox <= 1).all():
validity_score *= 0.5
return completeness * validity_score
def compute_process_alignment_reward(self, query, target, t_cot,
query_emb, target_emb):
"""
Measure if reasoning correctly aligns with query-target relationship.
"""
# Extract reasoning components
text_sim = compute_text_similarity(
query, target, t_cot['evidence'].get('text_keywords', []))
# Visual similarity from bboxes
visual_sim = 1.0
if 'bbox_2d' in t_cot['evidence']:
query_bbox = t_cot['evidence']['bbox_2d']['query_bbox_2d']
target_bbox = t_cot['evidence']['bbox_2d']['target_bbox_2d']
# Overlap measure
visual_sim = compute_bbox_overlap(query_bbox, target_bbox)
# Alignment with embeddings
embedding_sim = torch.cosine_similarity(query_emb, target_emb)
# Combined alignment
alignment = (text_sim + visual_sim + embedding_sim) / 3.0
return float(alignment.clamp(0, 1))
def compute_outcome_effectiveness_reward(self, query, target, t_cot,
query_emb, target_emb,
batch_targets):
"""
Measure if reasoning actually improves retrieval accuracy.
"""
# Baseline: embedding similarity without reasoning
baseline_score = torch.cosine_similarity(query_emb, target_emb)
# With reasoning: boost similarity if reasoning is strong
reasoning_strength = len(t_cot['evidence']) / 3.0
# Adjusted score: reasoning boosts relevant retrievals
adjusted_score = baseline_score * (1.0 + 0.5 * reasoning_strength)
# Compute retrieval rank with reasoning
query_embedding = self.embedder.embed(query)
scores = []
for candidate in batch_targets:
candidate_emb = self.embedder.embed(candidate)
score = torch.cosine_similarity(query_embedding,
candidate_emb)
# Apply reasoning boost
scores.append(score)
# Rank of target among candidates
scores = torch.tensor(scores)
target_rank = (scores > adjusted_score).sum().item()
# Reward: higher for better rank
rank_reward = 1.0 / (1.0 + target_rank)
return float(rank_reward)
def compute_combined_reward(self, query, target, t_cot,
query_emb, target_emb, batch_targets):
"""
Combine three reward components.
"""
format_reward = self.compute_format_compliance_reward(t_cot)
alignment_reward = self.compute_process_alignment_reward(
query, target, t_cot, query_emb, target_emb)
outcome_reward = self.compute_outcome_effectiveness_reward(
query, target, t_cot, query_emb, target_emb, batch_targets)
# Weighted combination
combined = (
0.2 * format_reward +
0.3 * alignment_reward +
0.5 * outcome_reward
)
return {
'format_reward': format_reward,
'alignment_reward': alignment_reward,
'outcome_reward': outcome_reward,
'combined_reward': combined
}
RL training loop:
def train_with_rl(system, reward_computer, train_pairs, num_epochs=10):
"""
Train reasoner with RL while keeping embedder frozen.
"""
optimizer = torch.optim.Adam(system.reasoner.parameters(),
lr=1e-4)
ppo_optimizer = PPOOptimizer(system.reasoner)
for epoch in range(num_epochs):
epoch_loss = 0.0
for batch_idx, (queries, targets, batch_targets) in enumerate(
train_pairs):
# Forward pass
outputs = system(queries, targets, modalities=['text',
'image',
'video'])
embeddings = {
'query': outputs['query_embedding'],
'target': outputs['target_embedding']
}
reasoning = outputs['reasoning']
# Compute rewards
rewards = []
for query, target, t_cot in zip(queries, targets,
reasoning):
reward_dict = reward_computer.compute_combined_reward(
query, target, t_cot,
embeddings['query'], embeddings['target'],
batch_targets)
rewards.append(reward_dict['combined_reward'])
rewards = torch.tensor(rewards)
# PPO update
log_probs = system.reasoner.get_log_prob(reasoning)
policy_loss = -(rewards * log_probs).mean()
# Entropy regularization
entropy = system.reasoner.compute_entropy(reasoning)
total_loss = policy_loss - 0.01 * entropy
# Update
optimizer.zero_grad()
total_loss.backward()
optimizer.step()
epoch_loss += total_loss.item()
if (batch_idx + 1) % 100 == 0:
print(f"Epoch {epoch}, Batch {batch_idx + 1}: "
f"Loss={total_loss.item():.4f}, "
f"Mean Reward={rewards.mean():.4f}")
print(f"Epoch {epoch} completed. Average Loss={epoch_loss:.4f}")
When to use:
Implementation steps:
Modality-specific considerations:
Reward tuning:
Expected improvements:
Training configuration:
Inference:
Decoupling embedder and reasoner through RL enables grounding reasoning in retrievable evidence while maintaining stable reward signals. Traceability CoT provides interpretable, multimodal explanations that directly improve retrieval performance across modalities.
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.