skills/bridging-modality-gap-roadside/SKILL.md
Build training-free pipelines that convert sparse 3D LiDAR point clouds into depth-encoded 2D images for classification by Vision-Language Models (CLIP, etc.). Covers the full workflow: point cloud denoising, temporal frame fusion, canonical orientation, orthographic projection, morphological cleanup, bilateral smoothing, and few-shot VLM prompting with semantic anchoring. Use when: 'classify vehicles from roadside LiDAR', 'convert point clouds to images for VLM', 'few-shot 3D object classification without training', 'bridge LiDAR to vision-language model', 'bootstrap labeled dataset from unlabeled LiDAR', 'cold start vehicle classifier from point clouds'.
npx skillsauth add ndpvt-web/arxiv-claude-skills bridging-modality-gap-roadsideInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to design and implement pipelines that transform sparse 3D LiDAR point cloud data into clean, depth-encoded 2D images suitable for classification by off-the-shelf Vision-Language Models (e.g., CLIP ViT-L/14) — entirely without model fine-tuning. The core technique, from Li et al. (2026), achieves competitive F1 scores (~0.62) on 20-class vehicle classification using only 16–30 examples per class, and can bootstrap fully supervised models (PointNet, F1 ~0.71) via a Cold Start strategy.
The Modality Bridge. Vision-Language Models like CLIP excel at few-shot image classification but cannot ingest raw 3D point clouds. This framework bridges that gap with a six-stage deterministic pipeline: (1) statistical outlier removal on each LiDAR frame, (2) temporal fusion of multiple frames via probabilistic registration (FilterReg), (3) canonical orientation alignment so every vehicle faces the same direction, (4) orthographic projection onto the YZ plane with minimum-depth encoding, (5) morphological opening to remove projection artifacts, and (6) bilateral filtering for edge-preserving smoothing. The output is a single-channel grayscale image where pixel intensity encodes lateral depth — visually resembling a side-view silhouette with depth shading.
Few-Shot VLM Prompting with Semantic Anchoring. The generated depth images are fed to CLIP (ViT-L/14 recommended) using few-shot in-context learning. For each class, k support examples (k=1..30) are encoded into visual prototypes. An optional text embedding (class description) can be fused with visual prototypes: p_fused = (1-w) * p_visual + w * e_text. A critical finding is the Semantic Anchor effect: at ultra-low shot counts (k < 4), text guidance at weight w=0.2 regularizes predictions and prevents collapse; but beyond k=8, text embeddings degrade accuracy due to semantic mismatch between natural-language vehicle descriptions and depth-image appearance. Set w=0 for k >= 8.
Cold Start Bootstrapping. The training-free VLM pipeline generates pseudo-labels for unlabeled LiDAR data. These labels (even at ~62% accuracy) are sufficient to train a lightweight supervised 3D classifier (e.g., PointNet) that achieves F1=0.71 with 4ms inference — suitable for real-time deployment. This eliminates the expensive manual annotation phase entirely.
Ingest and parse raw LiDAR frames. Load point cloud data (.pcd, .ply, .bin, or .npy format) for each tracked object. Each frame is a set of (x, y, z) coordinates from sensors like Velodyne VLP-32c. Group frames by object/vehicle track ID.
Apply statistical outlier removal (SOR) per frame. For each frame: voxel-downsample at 0.05m resolution, compute mean distance μ_i to k-nearest neighbors for each point, then remove points where μ_i > μ_global + α·σ_global (α=1.0). Use Open3D's statistical_outlier_removal or implement manually.
Fuse temporal frames via probabilistic registration. Align and merge all filtered frames for the same tracked vehicle into a single dense point cloud. Use FilterReg (Gao & Tedrake, 2019) or ICP as a fallback. This compensates for the sparsity of individual LiDAR sweeps by accumulating geometry across time.
Rectify orientation to a canonical coordinate system. Compute the ground-plane normal (Z-axis), negate the normalized motion vector for the Y-axis (longitudinal), and derive X-axis via cross product. Apply the rotation matrix so every vehicle is oriented consistently — front facing the same direction, side profile visible from the YZ plane.
Project orthographically onto the YZ plane with minimum-depth encoding. Create a 2D image grid over the YZ extent of the point cloud. For each pixel (y, z), assign intensity = min(x) across all points within spatial tolerance δ of that grid cell. This produces a single-channel grayscale depth image — a side-view silhouette where brightness encodes lateral distance.
Apply morphological opening. Perform erosion then dilation using an elliptical structuring element of size s×s (tune s based on image resolution, typically 3–5 pixels). This removes thin projection artifacts and isolated noise pixels while preserving vehicle shape. Use OpenCV's cv2.morphologyEx(img, cv2.MORPH_OPEN, kernel).
Apply bilateral filtering for edge-preserving smoothing. Smooth the image using bilateral filtering: cv2.bilateralFilter(img, d, sigma_color, sigma_space). Set sigma_space to control spatial smoothing extent and sigma_color to preserve depth edges (typical starting points: sigma_space=10, sigma_color=50). This produces the final clean depth proxy image.
Encode images and classify with CLIP. Load a pretrained CLIP model (ViT-L/14 recommended). Encode k support images per class into visual prototype embeddings (mean pooling). For k < 4, fuse with text embeddings at w=0.2: p_fused = 0.8 * p_visual + 0.2 * encode_text(class_description). For k >= 8, use pure visual prototypes (w=0). Classify query images by nearest prototype in embedding space (cosine similarity).
Evaluate with F1-score over multiple runs. Run classification 10 times with different random support-set draws. Report mean F1 and standard deviation across runs. Inspect confusion matrices for systematic misclassifications between visually similar classes.
Optionally bootstrap a supervised model (Cold Start). Use VLM-generated labels as pseudo-ground-truth to train a lightweight 3D classifier (PointNet, DGCNN, or a simple ViT fine-tune) on the original point clouds. This yields faster inference (~4ms vs. VLM latency) and higher accuracy (F1 ~0.71) while requiring zero manual annotation.
Example 1: Classify roadside LiDAR vehicle tracks with CLIP
User: "I have Velodyne LiDAR point cloud sequences for vehicles passing a highway checkpoint. I want to classify them into truck types (container sizes, tankers, platforms, etc.) without training a model."
Approach:
.pcd frame sequences grouped by vehicle track ID using Open3DOutput:
import open3d as o3d
import numpy as np
import cv2
import clip
import torch
# Stage 1-2: Denoise a single frame
def denoise_frame(pcd, voxel_size=0.05, alpha=1.0):
pcd_down = pcd.voxel_down_sample(voxel_size)
cl, ind = pcd_down.remove_statistical_outlier(nb_neighbors=20, std_ratio=alpha)
return pcd_down.select_by_index(ind)
# Stage 5: Orthographic depth projection
def project_to_depth_image(points, resolution=128, tolerance=0.05):
y, z = points[:, 1], points[:, 2]
x = points[:, 0]
y_bins = np.linspace(y.min(), y.max(), resolution)
z_bins = np.linspace(z.min(), z.max(), resolution)
img = np.full((resolution, resolution), np.nan)
for i in range(resolution - 1):
for j in range(resolution - 1):
mask = (y >= y_bins[i]) & (y < y_bins[i+1]) & (z >= z_bins[j]) & (z < z_bins[j+1])
if mask.any():
img[resolution - 1 - j, i] = x[mask].min()
img_norm = np.nan_to_num(img, nan=0)
if img_norm.max() > 0:
img_norm = (255 * (img_norm - img_norm[img_norm > 0].min()) /
(img_norm.max() - img_norm[img_norm > 0].min())).clip(0, 255)
return img_norm.astype(np.uint8)
# Stage 6-7: Morphological + bilateral cleanup
def clean_depth_image(img, kernel_size=3, sigma_sp=10, sigma_c=50):
kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (kernel_size, kernel_size))
opened = cv2.morphologyEx(img, cv2.MORPH_OPEN, kernel)
smoothed = cv2.bilateralFilter(opened, d=9, sigmaColor=sigma_c, sigmaSpace=sigma_sp)
return smoothed
# Stage 8: CLIP few-shot classification
def classify_with_clip(query_img, support_sets, model, preprocess, device, w=0.0, text_embeds=None):
query_tensor = preprocess(query_img).unsqueeze(0).to(device)
with torch.no_grad():
query_emb = model.encode_image(query_tensor)
query_emb = query_emb / query_emb.norm(dim=-1, keepdim=True)
best_sim, best_class = -1, None
for cls_name, prototype in support_sets.items():
p = prototype
if w > 0 and text_embeds and cls_name in text_embeds:
p = (1 - w) * prototype + w * text_embeds[cls_name]
p = p / p.norm(dim=-1, keepdim=True)
sim = (query_emb @ p.T).item()
if sim > best_sim:
best_sim, best_class = sim, cls_name
return best_class, best_sim
Example 2: Cold Start — bootstrap a PointNet model from VLM labels
User: "I classified 500 vehicles using the CLIP pipeline above. Now I want to train a fast model for real-time deployment."
Approach:
Output:
# Pseudo-label export format
pseudo_labels = {
"track_001": {"points": np.array(...), "label": "40ft Container", "confidence": 0.87},
"track_002": {"points": np.array(...), "label": "Bobtail", "confidence": 0.73},
# ...
}
# Filter high-confidence labels for cleaner training data
train_data = {k: v for k, v in pseudo_labels.items() if v["confidence"] > 0.6}
# Train PointNet (using standard PointNet implementation)
# Expected result: F1 ~0.70, inference ~4ms per vehicle
Example 3: Tuning the Semantic Anchor weight
User: "I only have 2 examples per class. How should I set up the text fusion?"
Approach:
p_fused = 0.8 * visual_prototype + 0.2 * text_embeddingOutput:
# Text descriptions for semantic anchoring (k < 4)
class_descriptions = {
"53ft Container": "A semi-trailer hauling a 53-foot intermodal shipping container with corrugated rectangular profile",
"Bobtail": "A truck tractor driving without any attached trailer, showing exposed fifth wheel coupling",
"Tank (Semi)": "A semi-trailer with a cylindrical tanker body for transporting liquids or gases",
# ... one description per class
}
# Encode text
text_tokens = clip.tokenize(list(class_descriptions.values())).to(device)
with torch.no_grad():
text_embeds = model.encode_text(text_tokens)
text_embeds = text_embeds / text_embeds.norm(dim=-1, keepdim=True)
# Use w=0.2 for k < 4, w=0 for k >= 8
w = 0.2 if k < 4 else 0.0
| Problem | Cause | Fix | |---------|-------|-----| | All-black depth image | Point cloud extent doesn't match image grid bounds | Compute YZ bounding box from actual point extents; add 10% padding | | Noisy silhouettes with scattered dots | Insufficient SOR filtering | Decrease α from 1.0 to 0.5 or increase k-nearest-neighbors count | | All vehicles classified as same class | Too few support examples or bad projection | Verify depth images visually; increase k to at least 16; check orientation rectification | | F1 drops when adding text (k > 4) | Semantic Anchor degradation | Set w=0 for k >= 4; text descriptions in natural language don't match depth-image visual features | | Temporal registration diverges | Large frame-to-frame motion or occlusion | Use motion-compensated ICP with initial transform from tracker; discard frames with < 50 points | | Classes with similar profiles confused | Ambiguous side-view geometry (e.g., enclosed van SU vs. semi) | Add a second view (top-down XY projection) as a separate CLIP input channel; ensemble two views |
Li, Y., Shang, B., & Wei, J. (2026). Bridging the Modality Gap in Roadside LiDAR: A Training-Free Vision-Language Model Framework for Vehicle Classification. arXiv:2602.09425v1. https://arxiv.org/abs/2602.09425v1
Key sections to consult: Algorithm 1 (full pipeline pseudocode), Table I (20-class taxonomy with descriptions), Table II (F1 comparison across shot counts and models), Figure 5 (Semantic Anchor weight ablation), Section IV-D (Cold Start strategy).
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".