skills/skillxiv-v0.0.2-claude-opus-4.6/bitvla-robot-control/SKILL.md
Build fully ternary quantized vision-language-action models for robotic manipulation, achieving 11x memory reduction and 4.4x speedup while maintaining task performance on edge devices.
npx skillsauth add ADu2021/skillXiv bitvla-robot-controlInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
BitVLA is the first fully native 1-bit quantized vision-language-action model where every parameter is ternary ({−1, 0, 1}), enabling efficient robotic policies on edge devices. Through a novel Quantize-then-Distill strategy, it compresses vision encoders to 1.58-bit weights while maintaining alignment with language understanding. The 2B parameter model achieves 11x memory reduction and 4.4x speedup compared to full-precision baselines, matching larger models on manipulation benchmarks while fitting on resource-constrained robot hardware.
Create the core quantization operators:
import torch
import torch.nn as nn
import torch.nn.functional as F
class TernaryQuantizer(nn.Module):
"""Quantize weights and activations to ternary values"""
def __init__(self):
super().__init__()
@staticmethod
def quantize_weights_ternary(weight, scale=None):
"""
Quantize weights to {-1, 0, 1}.
Uses absmean scaling: scale = mean(|weight|)
"""
# Compute scale as mean absolute value
if scale is None:
scale = weight.abs().mean()
# Threshold at scale/2
threshold = scale / 2.0
# Quantize: round to nearest ternary value
weight_normalized = weight / scale
weight_ternary = torch.sign(weight_normalized)
weight_ternary[weight_normalized.abs() < 0.5] = 0
return weight_ternary, scale
@staticmethod
def quantize_activations_int8(activation):
"""
Quantize activations to INT8.
Per-token symmetric quantization using absmax scaling.
"""
batch_size = activation.shape[0]
# Compute per-token absmax scale
activation_reshaped = activation.reshape(batch_size, -1)
scales = activation_reshaped.abs().max(dim=1)[0] # (batch_size,)
# Clip to [-127, 127] range (INT8)
activation_normalized = (activation / scales.unsqueeze(-1)).clamp(-1, 1)
activation_int8 = (activation_normalized * 127).round().to(torch.int8)
return activation_int8, scales
@staticmethod
def dequantize(quantized, scale):
"""Dequantize back to float"""
if isinstance(quantized, torch.Tensor) and quantized.dtype == torch.int8:
return quantized.float() / 127.0 * scale.unsqueeze(-1)
else:
return quantized * scale
Compress vision encoder using Quantize-then-Distill:
class QuantizedVisionEncoder(nn.Module):
"""
SigLIP vision encoder compressed to 1.58-bit weights.
Uses knowledge distillation from full-precision teacher.
"""
def __init__(self, teacher_encoder, target_bits=1.58):
super().__init__()
self.teacher = teacher_encoder
self.target_bits = target_bits
self.quantizer = TernaryQuantizer()
# Create student encoder with same architecture
self.student = self.create_student_encoder()
def create_student_encoder(self):
"""Create ternary student encoder"""
# Copy architecture from teacher
config = self.teacher.config
student = self.teacher.__class__(config)
# Quantize initial weights
for name, param in student.named_parameters():
if 'weight' in name and len(param.shape) > 1:
ternary_weight, _ = self.quantizer.quantize_weights_ternary(param)
param.data = ternary_weight
return student
def forward(self, images):
"""Forward pass with quantized vision encoder"""
features = self.student.encoder(images) # (batch_size, num_patches, hidden_dim)
return features
def distillation_loss(self, images, temperature=4.0):
"""
KL divergence loss between student and teacher.
Encourages student representations to match teacher while learning ternary weights.
"""
with torch.no_grad():
teacher_features = self.teacher.encoder(images)
teacher_hidden = self.teacher.encoder.pool(teacher_features)
student_features = self.student.encoder(images)
student_hidden = self.student.encoder.pool(student_features)
# Normalize representations
teacher_norm = F.normalize(teacher_hidden, dim=-1)
student_norm = F.normalize(student_hidden, dim=-1)
# MSE loss between representations
mse_loss = F.mse_loss(student_norm, teacher_norm)
return mse_loss
def update_ternary_weights(self, learning_rate=1e-4):
"""
Update ternary weights using straight-through estimator.
Gradients flow through quantization for learning.
"""
for name, param in self.student.named_parameters():
if 'weight' in name and len(param.shape) > 1:
# Compute gradient
if param.grad is not None:
# Straight-through estimator: gradient flows unchanged through quantization
param.grad.data = param.grad.data / (param.abs().mean() + 1e-8)
# Update with quantization constraint
with torch.no_grad():
ternary_weight, scale = self.quantizer.quantize_weights_ternary(param)
param.data = ternary_weight
Integrate quantized vision and language components:
class BitVLA(nn.Module):
"""Fully ternary vision-language-action model for robotics"""
def __init__(self, llm_checkpoint="OpenBitNet/bitnet-b1.58-2B",
vision_checkpoint="google/siglip-base-patch16-512"):
super().__init__()
# Load BitNet backbone (pre-quantized 1-bit LLM)
from transformers import AutoModelForCausalLM, AutoTokenizer
self.llm = AutoModelForCausalLM.from_pretrained(llm_checkpoint)
self.tokenizer = AutoTokenizer.from_pretrained(llm_checkpoint)
self.llm_dim = self.llm.config.hidden_size
# Quantized vision encoder
teacher_vision = self.load_vision_encoder(vision_checkpoint)
self.vision_encoder = QuantizedVisionEncoder(teacher_vision, target_bits=1.58)
self.vision_dim = 768
# Lightweight connector (full precision OK, small size)
self.connector = nn.Linear(self.vision_dim, self.llm_dim)
# Action prediction head (ternary)
self.action_head = TernaryActionHead(self.llm_dim)
def load_vision_encoder(self, checkpoint):
"""Load SigLIP vision encoder"""
from transformers import AutoModel
return AutoModel.from_pretrained(checkpoint)
def forward(self, images, action_prompt="What action?"):
"""
Process images and generate robot actions.
Args:
images: (batch_size, 3, 512, 512)
action_prompt: instruction text for the robot
Returns:
actions: (batch_size, action_dim) - robot action tokens
"""
# Vision encoding
vision_features = self.vision_encoder(images) # (B, P, D_v)
vision_features = vision_features.mean(dim=1) # Pool to (B, D_v)
# Connect to LLM space
aligned_features = self.connector(vision_features) # (B, D_llm)
# Tokenize prompt
prompt_ids = self.tokenizer.encode(action_prompt, return_tensors='pt')
prompt_embeddings = self.llm.get_input_embeddings()(prompt_ids)
# Concatenate: [prompt_tokens, vision_features]
combined_embeddings = torch.cat([
prompt_embeddings,
aligned_features.unsqueeze(1)
], dim=1)
# Forward through LLM with causal attention
# (Important: maintain causal masking for action generation)
outputs = self.llm(
inputs_embeds=combined_embeddings,
attention_mask=torch.ones(combined_embeddings.shape[:2]),
use_cache=False
)
# Action prediction from last hidden state
logits = self.action_head(outputs.hidden_states[-1][:, -1, :])
actions = torch.argmax(logits, dim=-1)
return actions
class TernaryActionHead(nn.Module):
"""Ternary action prediction head"""
def __init__(self, hidden_dim, num_actions=8):
super().__init__()
self.hidden_dim = hidden_dim
self.num_actions = num_actions
# Linear layer for action logits
self.action_projection = nn.Linear(hidden_dim, num_actions)
def forward(self, hidden_state):
"""Generate action logits"""
logits = self.action_projection(hidden_state)
return logits
Create the complete training procedure:
class BitVLATrainer:
"""Three-stage training: multimodal alignment -> quantization -> RL fine-tuning"""
def __init__(self, model, device='cuda'):
self.model = model.to(device)
self.device = device
def stage_1_multimodal_alignment(self, image_text_dataset, epochs=5):
"""
Stage 1: Align vision-language features (vision encoder frozen).
Training objective: make vision features compatible with LLM space.
"""
print("Stage 1: Multimodal alignment...")
optimizer = torch.optim.AdamW(
[p for n, p in self.model.named_parameters() if 'connector' in n or 'action_head' in n],
lr=1e-4
)
for epoch in range(epochs):
total_loss = 0
for batch in image_text_dataset:
images = batch['images'].to(self.device)
text_ids = batch['text_ids'].to(self.device)
optimizer.zero_grad()
# Forward pass
vision_features = self.model.vision_encoder(images)
vision_features = vision_features.mean(dim=1)
aligned = self.model.connector(vision_features)
# Text features
text_embeddings = self.model.llm.get_input_embeddings()(text_ids)
text_features = self.model.llm(input_ids=text_ids).hidden_states[-1][:, 0, :]
# Contrastive loss (align representations)
loss = self.contrastive_loss(aligned, text_features)
loss.backward()
torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)
optimizer.step()
total_loss += loss.item()
print(f"Epoch {epoch+1}: Loss = {total_loss / len(image_text_dataset):.4f}")
def stage_2_quantize_then_distill(self, image_dataset, epochs=5):
"""
Stage 2: Compress vision encoder via knowledge distillation.
Learns ternary weights while maintaining LLaVA-style alignment.
"""
print("Stage 2: Quantize-then-Distill...")
optimizer = torch.optim.AdamW(
self.model.vision_encoder.student.parameters(),
lr=5e-5
)
for epoch in range(epochs):
total_loss = 0
for batch in image_dataset:
images = batch['images'].to(self.device)
optimizer.zero_grad()
# Distillation loss (KL divergence)
loss = self.model.vision_encoder.distillation_loss(images)
loss.backward()
torch.nn.utils.clip_grad_norm_(self.model.vision_encoder.parameters(), 1.0)
# Update ternary weights
self.model.vision_encoder.update_ternary_weights()
optimizer.step()
total_loss += loss.item()
print(f"Epoch {epoch+1}: Distillation Loss = {total_loss / len(image_dataset):.4f}")
def stage_3_rl_finetuning(self, robot_trajectory_dataset, epochs=10):
"""
Stage 3: Fine-tune on robot manipulation tasks via RL.
Objective: maximize success on robotic manipulation tasks.
"""
print("Stage 3: RL Fine-tuning on robot tasks...")
# Optimizer for action head only (vision/language frozen)
optimizer = torch.optim.AdamW(
self.model.action_head.parameters(),
lr=1e-5
)
for epoch in range(epochs):
total_reward = 0
for batch in robot_trajectory_dataset:
images = batch['images'].to(self.device)
actions = batch['actions'].to(self.device)
rewards = batch['rewards'].to(self.device)
optimizer.zero_grad()
# Forward pass
predicted_actions = self.model(images)
# Action prediction loss
action_loss = F.cross_entropy(predicted_actions, actions)
# Reward-weighted loss
loss = (action_loss * rewards).mean()
loss.backward()
torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)
optimizer.step()
total_reward += rewards.mean().item()
print(f"Epoch {epoch+1}: Avg Reward = {total_reward / len(robot_trajectory_dataset):.4f}")
def contrastive_loss(self, v1, v2, temperature=0.07):
"""SimCLR-style contrastive loss"""
# Normalize
v1 = F.normalize(v1, dim=-1)
v2 = F.normalize(v2, dim=-1)
# Similarity
sim = torch.mm(v1, v2.t()) / temperature
# Cross-entropy loss
labels = torch.arange(v1.shape[0]).to(self.device)
loss = F.cross_entropy(sim, labels)
return loss
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.