skills/skillxiv-v0.0.2-claude-opus-4.6/concerto-joint-learning/SKILL.md
Learns richer spatial representations by training on both 2D and 3D data simultaneously. Combines 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding, achieving 14.2% and 4.8% improvements over single-modality baselines in scene understanding and geometric consistency.
npx skillsauth add ADu2021/skillXiv concerto-joint-learningInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Human spatial understanding emerges from multiple sensory modalities simultaneously. Concerto mirrors this by learning from paired 2D images and 3D point clouds, discovering representations richer than either modality alone.
The joint learning approach creates spatial features with superior geometric and semantic consistency, improving downstream scene understanding tasks.
Key insight: simultaneous 2D-3D training creates more coherent spatial concepts than training separately. Concerto uses:
Create dual-stream encoders for 2D and 3D data. Each stream learns modality-specific representations while staying aligned through contrastive loss:
class ConcertoEncoder(nn.Module):
def __init__(self, feature_dim=256):
super().__init__()
# 2D encoder (e.g., Vision Transformer)
self.image_encoder = ViT(
patch_size=16,
num_layers=12,
hidden_dim=768
)
# 3D encoder (e.g., PointNet++)
self.pointcloud_encoder = PointNet(
num_layers=4,
feature_dim=feature_dim
)
# Projection heads for alignment
self.image_proj = nn.Sequential(
nn.Linear(768, 512),
nn.ReLU(),
nn.Linear(512, feature_dim)
)
self.pc_proj = nn.Sequential(
nn.Linear(feature_dim, 512),
nn.ReLU(),
nn.Linear(512, feature_dim)
)
def forward(self, images, point_clouds):
# Encode both modalities
image_feat = self.image_encoder(images)
pc_feat = self.pointcloud_encoder(point_clouds)
# Project to common space
image_proj = self.image_proj(image_feat)
pc_proj = self.pc_proj(pc_feat)
return image_proj, pc_proj
Implement cross-modal contrastive learning that aligns 2D and 3D representations. Instances from the same scene should have similar features:
class CrossModalContrastiveLoss(nn.Module):
def __init__(self, temperature=0.07):
super().__init__()
self.temperature = temperature
def forward(self, image_proj, pc_proj):
"""Contrastive loss between 2D and 3D projections."""
# Normalize features
image_proj = torch.nn.functional.normalize(image_proj, dim=-1)
pc_proj = torch.nn.functional.normalize(pc_proj, dim=-1)
# Compute similarity matrix
batch_size = image_proj.shape[0]
logits = torch.mm(image_proj, pc_proj.t()) / self.temperature
# Labels: diagonal elements are positives (same scene)
labels = torch.arange(batch_size, device=image_proj.device)
# Symmetric cross-entropy loss
loss_i2p = torch.nn.functional.cross_entropy(logits, labels)
loss_p2i = torch.nn.functional.cross_entropy(logits.t(), labels)
return (loss_i2p + loss_p2i) / 2
Implement 3D intra-modal self-distillation using a momentum encoder for point clouds. This captures 3D geometry better than cross-modal learning alone:
class PointCloud3DDistillation(nn.Module):
def __init__(self, encoder, momentum=0.999):
super().__init__()
self.encoder = encoder
self.momentum_encoder = copy.deepcopy(encoder)
self.momentum = momentum
# Freeze momentum encoder
for param in self.momentum_encoder.parameters():
param.requires_grad = False
def forward(self, point_clouds, aug_point_clouds):
"""Self-distillation within 3D modality."""
# Online encoder
feat = self.encoder(point_clouds)
# Momentum encoder (no gradient)
with torch.no_grad():
feat_momentum = self.momentum_encoder(aug_point_clouds)
# Knowledge distillation loss
loss = torch.nn.functional.mse_loss(feat, feat_momentum.detach())
# Update momentum encoder
self._update_momentum_encoder()
return loss
def _update_momentum_encoder(self):
"""Update momentum encoder with EMA."""
for param, momentum_param in zip(
self.encoder.parameters(),
self.momentum_encoder.parameters()
):
momentum_param.data = (
self.momentum * momentum_param.data +
(1 - self.momentum) * param.data
)
| Parameter | Recommendation | |-----------|-----------------| | Feature dimension | 256 (balance capacity and efficiency) | | Temperature | 0.07-0.1 (standard for contrastive learning) | | Momentum coefficient | 0.999 (slow teacher updates) | | Cross-modal weight | 1.0, 3D distillation weight | 0.5 | | Batch size | 256-512 (contrastive learning needs diverse negatives) |
When to use:
When NOT to use:
Common pitfalls:
Reference: Concerto on arXiv
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.