Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

ADu2021/concerto-joint-learning

Name: concerto-joint-learning
Author: ADu2021

skills/skillxiv-v0.0.2-claude-opus-4.6/concerto-joint-learning/SKILL.md

npx skillsauth add ADu2021/skillXiv concerto-joint-learning

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Concerto: Joint 2D-3D Self-Supervised Representation Learning

Human spatial understanding emerges from multiple sensory modalities simultaneously. Concerto mirrors this by learning from paired 2D images and 3D point clouds, discovering representations richer than either modality alone.

The joint learning approach creates spatial features with superior geometric and semantic consistency, improving downstream scene understanding tasks.

Core Concept

Key insight: simultaneous 2D-3D training creates more coherent spatial concepts than training separately. Concerto uses:

3D intra-modal self-distillation (learning within 3D point clouds)
2D-3D cross-modal joint embedding (aligning image and 3D representations)
Multi-view consistency to reinforce learned concepts

Architecture Overview

Separate encoders for 2D and 3D modalities
Cross-modal contrastive learning between image and 3D features
Intra-modal self-distillation within 3D representations
Multi-view consistency constraints across modalities

Implementation Steps

Create dual-stream encoders for 2D and 3D data. Each stream learns modality-specific representations while staying aligned through contrastive loss:

class ConcertoEncoder(nn.Module):
    def __init__(self, feature_dim=256):
        super().__init__()
        # 2D encoder (e.g., Vision Transformer)
        self.image_encoder = ViT(
            patch_size=16,
            num_layers=12,
            hidden_dim=768
        )
        # 3D encoder (e.g., PointNet++)
        self.pointcloud_encoder = PointNet(
            num_layers=4,
            feature_dim=feature_dim
        )

        # Projection heads for alignment
        self.image_proj = nn.Sequential(
            nn.Linear(768, 512),
            nn.ReLU(),
            nn.Linear(512, feature_dim)
        )
        self.pc_proj = nn.Sequential(
            nn.Linear(feature_dim, 512),
            nn.ReLU(),
            nn.Linear(512, feature_dim)
        )

    def forward(self, images, point_clouds):
        # Encode both modalities
        image_feat = self.image_encoder(images)
        pc_feat = self.pointcloud_encoder(point_clouds)

        # Project to common space
        image_proj = self.image_proj(image_feat)
        pc_proj = self.pc_proj(pc_feat)

        return image_proj, pc_proj

Implement cross-modal contrastive learning that aligns 2D and 3D representations. Instances from the same scene should have similar features:

class CrossModalContrastiveLoss(nn.Module):
    def __init__(self, temperature=0.07):
        super().__init__()
        self.temperature = temperature

    def forward(self, image_proj, pc_proj):
        """Contrastive loss between 2D and 3D projections."""
        # Normalize features
        image_proj = torch.nn.functional.normalize(image_proj, dim=-1)
        pc_proj = torch.nn.functional.normalize(pc_proj, dim=-1)

        # Compute similarity matrix
        batch_size = image_proj.shape[0]
        logits = torch.mm(image_proj, pc_proj.t()) / self.temperature

        # Labels: diagonal elements are positives (same scene)
        labels = torch.arange(batch_size, device=image_proj.device)

        # Symmetric cross-entropy loss
        loss_i2p = torch.nn.functional.cross_entropy(logits, labels)
        loss_p2i = torch.nn.functional.cross_entropy(logits.t(), labels)

        return (loss_i2p + loss_p2i) / 2

Implement 3D intra-modal self-distillation using a momentum encoder for point clouds. This captures 3D geometry better than cross-modal learning alone:

class PointCloud3DDistillation(nn.Module):
    def __init__(self, encoder, momentum=0.999):
        super().__init__()
        self.encoder = encoder
        self.momentum_encoder = copy.deepcopy(encoder)
        self.momentum = momentum

        # Freeze momentum encoder
        for param in self.momentum_encoder.parameters():
            param.requires_grad = False

    def forward(self, point_clouds, aug_point_clouds):
        """Self-distillation within 3D modality."""
        # Online encoder
        feat = self.encoder(point_clouds)

        # Momentum encoder (no gradient)
        with torch.no_grad():
            feat_momentum = self.momentum_encoder(aug_point_clouds)

        # Knowledge distillation loss
        loss = torch.nn.functional.mse_loss(feat, feat_momentum.detach())

        # Update momentum encoder
        self._update_momentum_encoder()

        return loss

    def _update_momentum_encoder(self):
        """Update momentum encoder with EMA."""
        for param, momentum_param in zip(
            self.encoder.parameters(),
            self.momentum_encoder.parameters()
        ):
            momentum_param.data = (
                self.momentum * momentum_param.data +
                (1 - self.momentum) * param.data
            )

Practical Guidance

| Parameter | Recommendation | |-----------|-----------------| | Feature dimension | 256 (balance capacity and efficiency) | | Temperature | 0.07-0.1 (standard for contrastive learning) | | Momentum coefficient | 0.999 (slow teacher updates) | | Cross-modal weight | 1.0, 3D distillation weight | 0.5 | | Batch size | 256-512 (contrastive learning needs diverse negatives) |

When to use:

Scene understanding tasks with both RGB and 3D data
Indoor/outdoor robotics requiring spatial reasoning
3D reconstruction or depth estimation
Multi-view geometry applications

When NOT to use:

Single modality data only (more efficient methods exist)
Tasks with abundant labeled data (supervised learning better)
Real-time inference with latency constraints (dual encoders add cost)

Common pitfalls:

Imbalanced modality contributions (one dominates)
Insufficient momentum encoder updates (stale teacher)
Cross-modal loss weight too low (poor alignment)
Not aligning 2D views with 3D geometrically (learns spurious correlations)

Reference: Concerto on arXiv

ADu2021/concerto-joint-learning

skills/skillxiv-v0.0.2-claude-opus-4.6/concerto-joint-learning/SKILL.md

Learns richer spatial representations by training on both 2D and 3D data simultaneously. Combines 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding, achieving 14.2% and 4.8% improvements over single-modality baselines in scene understanding and geometric consistency.

2 stars

data-ai

Updated Apr 17, 2026

$ install --global

skillsauth

npx skillsauth add ADu2021/skillXiv concerto-joint-learning

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 17, 2026, 5:30 AM30.8s1 file scanned

SKILL.md

name:: concerto-joint-learning
title:: Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations
version:: 0.0.2
engine:: skillxiv-v0.0.2-claude-opus-4.6
license:: MIT
url:: https://arxiv.org/abs/2510.23607
keywords:: [Self-Supervised Learning, Multimodal, Spatial Understanding, 3D Vision]
description:: Learns richer spatial representations by training on both 2D and 3D data simultaneously. Combines 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding, achieving 14.2% and 4.8% improvements over single-modality baselines in scene understanding and geometric consistency.

Concerto: Joint 2D-3D Self-Supervised Representation Learning

The joint learning approach creates spatial features with superior geometric and semantic consistency, improving downstream scene understanding tasks.

Core Concept

Key insight: simultaneous 2D-3D training creates more coherent spatial concepts than training separately. Concerto uses:

3D intra-modal self-distillation (learning within 3D point clouds)
2D-3D cross-modal joint embedding (aligning image and 3D representations)
Multi-view consistency to reinforce learned concepts

Architecture Overview

Separate encoders for 2D and 3D modalities
Cross-modal contrastive learning between image and 3D features
Intra-modal self-distillation within 3D representations
Multi-view consistency constraints across modalities

Implementation Steps

Create dual-stream encoders for 2D and 3D data. Each stream learns modality-specific representations while staying aligned through contrastive loss:

class ConcertoEncoder(nn.Module):
    def __init__(self, feature_dim=256):
        super().__init__()
        # 2D encoder (e.g., Vision Transformer)
        self.image_encoder = ViT(
            patch_size=16,
            num_layers=12,
            hidden_dim=768
        )
        # 3D encoder (e.g., PointNet++)
        self.pointcloud_encoder = PointNet(
            num_layers=4,
            feature_dim=feature_dim
        )

        # Projection heads for alignment
        self.image_proj = nn.Sequential(
            nn.Linear(768, 512),
            nn.ReLU(),
            nn.Linear(512, feature_dim)
        )
        self.pc_proj = nn.Sequential(
            nn.Linear(feature_dim, 512),
            nn.ReLU(),
            nn.Linear(512, feature_dim)
        )

    def forward(self, images, point_clouds):
        # Encode both modalities
        image_feat = self.image_encoder(images)
        pc_feat = self.pointcloud_encoder(point_clouds)

        # Project to common space
        image_proj = self.image_proj(image_feat)
        pc_proj = self.pc_proj(pc_feat)

        return image_proj, pc_proj

Implement cross-modal contrastive learning that aligns 2D and 3D representations. Instances from the same scene should have similar features:

class CrossModalContrastiveLoss(nn.Module):
    def __init__(self, temperature=0.07):
        super().__init__()
        self.temperature = temperature

    def forward(self, image_proj, pc_proj):
        """Contrastive loss between 2D and 3D projections."""
        # Normalize features
        image_proj = torch.nn.functional.normalize(image_proj, dim=-1)
        pc_proj = torch.nn.functional.normalize(pc_proj, dim=-1)

        # Compute similarity matrix
        batch_size = image_proj.shape[0]
        logits = torch.mm(image_proj, pc_proj.t()) / self.temperature

        # Labels: diagonal elements are positives (same scene)
        labels = torch.arange(batch_size, device=image_proj.device)

        # Symmetric cross-entropy loss
        loss_i2p = torch.nn.functional.cross_entropy(logits, labels)
        loss_p2i = torch.nn.functional.cross_entropy(logits.t(), labels)

        return (loss_i2p + loss_p2i) / 2

Implement 3D intra-modal self-distillation using a momentum encoder for point clouds. This captures 3D geometry better than cross-modal learning alone:

class PointCloud3DDistillation(nn.Module):
    def __init__(self, encoder, momentum=0.999):
        super().__init__()
        self.encoder = encoder
        self.momentum_encoder = copy.deepcopy(encoder)
        self.momentum = momentum

        # Freeze momentum encoder
        for param in self.momentum_encoder.parameters():
            param.requires_grad = False

    def forward(self, point_clouds, aug_point_clouds):
        """Self-distillation within 3D modality."""
        # Online encoder
        feat = self.encoder(point_clouds)

        # Momentum encoder (no gradient)
        with torch.no_grad():
            feat_momentum = self.momentum_encoder(aug_point_clouds)

        # Knowledge distillation loss
        loss = torch.nn.functional.mse_loss(feat, feat_momentum.detach())

        # Update momentum encoder
        self._update_momentum_encoder()

        return loss

    def _update_momentum_encoder(self):
        """Update momentum encoder with EMA."""
        for param, momentum_param in zip(
            self.encoder.parameters(),
            self.momentum_encoder.parameters()
        ):
            momentum_param.data = (
                self.momentum * momentum_param.data +
                (1 - self.momentum) * param.data
            )

Practical Guidance

When to use:

Scene understanding tasks with both RGB and 3D data
Indoor/outdoor robotics requiring spatial reasoning
3D reconstruction or depth estimation
Multi-view geometry applications

When NOT to use:

Single modality data only (more efficient methods exist)
Tasks with abundant labeled data (supervised learning better)
Real-time inference with latency constraints (dual encoders add cost)

Common pitfalls:

Imbalanced modality contributions (one dominates)
Insufficient momentum encoder updates (stale teacher)
Cross-modal loss weight too low (poor alignment)
Not aligning 2D views with 3D geometrically (learns spurious correlations)

Reference: Concerto on arXiv

Related Skills

ADu2021/flow-map-trajectory-tilting

testing

VerifiedTrustedCommunity

Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flow-map-trajectory-tilting

ADu2021/flexible-data-mixture-of-experts

testing

VerifiedTrustedCommunity

Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flexible-data-mixture-of-experts

ADu2021/flexibility-trap-diffusion-reasoning

data-ai

VerifiedTrustedCommunity

Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flexibility-trap-diffusion-reasoning

ADu2021/flex-continuous-agent-evolution

devops

VerifiedTrustedCommunity

Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flex-continuous-agent-evolution

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/ADu2021/skillXiv.git

# Copy into Claude Code skills folder (global)
cp -r skillXiv/skills/skillxiv-v0.0.2-claude-opus-4.6/concerto-joint-learning ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

ADu2021/skillXiv

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT