skills/skillxiv-v0.0.2-claude-opus-4.6/dino-r1-vision-reasoning/SKILL.md
Apply reasoning-focused RL to vision foundation models using GRPO to develop deep visual understanding and abstract reasoning beyond visual recognition.
npx skillsauth add ADu2021/skillXiv dino-r1-vision-reasoningInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Vision foundation models like DINO excel at representation learning but lack reasoning capabilities. DINO-R1 extends reasoning-focused RL techniques (like GRPO, which powers DeepSeek-R1) to vision models, enabling them to develop step-by-step reasoning about visual scenes, not just extract features. This unlocks new applications: visual reasoning, scene understanding, abstract spatial reasoning.
The key contribution is adapting RL training methodologies from language models to vision, demonstrating that verifiable rewards can guide vision models toward deeper reasoning just as they do for LLMs. This opens the frontier of reasoning-capable vision models.
DINO-R1 applies Group Relative Policy Optimization (GRPO) to vision foundation models:
Unlike visual classification (simple prediction), DINO-R1 develops genuine reasoning: decomposing problems, explaining evidence, drawing conclusions.
Build a vision-language reasoning model with GRPO training:
# DINO-R1: Reasoning capability for vision foundation models
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer
class DINOReasoner(nn.Module):
"""
DINO vision model augmented with reasoning capability.
"""
def __init__(self, dino_model_name="facebook/dino-vitb14", vocab_size=50257):
super().__init__()
# Load DINO vision encoder
self.vision_encoder = AutoModel.from_pretrained(dino_model_name)
self.vision_dim = self.vision_encoder.config.hidden_size
# Reasoning decoder: generates thinking steps
decoder_config = {
'hidden_size': self.vision_dim,
'num_hidden_layers': 6,
'num_attention_heads': 8,
'intermediate_size': 2048,
'vocab_size': vocab_size
}
from transformers import GPT2Config, GPT2LMHeadModel
config = GPT2Config(**decoder_config)
self.reasoning_decoder = GPT2LMHeadModel(config)
# Reward model: predict quality of reasoning
self.reward_model = nn.Sequential(
nn.Linear(self.vision_dim, 512),
nn.ReLU(),
nn.Linear(512, 256),
nn.ReLU(),
nn.Linear(256, 1) # Scalar reward
)
self.tokenizer = AutoTokenizer.from_pretrained("gpt2")
def encode_image(self, images):
"""
Extract vision features from images.
images: (batch, 3, height, width)
"""
# DINO forward pass
outputs = self.vision_encoder(images)
# Use CLS token as image representation
image_features = outputs.last_hidden_state[:, 0] # (batch, vision_dim)
return image_features
def generate_reasoning(self, image_features, task_prompt="", max_length=256):
"""
Generate reasoning chain given image features.
"""
batch_size = image_features.shape[0]
device = image_features.device
# Project vision features to decoder embedding space
encoder_hidden = image_features.unsqueeze(1) # (batch, 1, vision_dim)
# Initialize with task prompt tokens
if task_prompt:
prompt_tokens = self.tokenizer.encode(task_prompt)
prompt_ids = torch.tensor(prompt_tokens).to(device).unsqueeze(0)
prompt_ids = prompt_ids.expand(batch_size, -1)
else:
prompt_ids = None
# Autoregressive generation with vision conditioning
input_ids = prompt_ids if prompt_ids is not None else \
torch.full((batch_size, 1), self.tokenizer.bos_token_id, device=device)
all_logits = []
for step in range(max_length):
# Decoder forward pass with vision context
outputs = self.reasoning_decoder(
input_ids=input_ids,
past_key_values=None # Could cache for efficiency
)
logits = outputs.logits[:, -1, :] # Last token logits
# Sample next token
probs = F.softmax(logits, dim=-1)
next_tokens = torch.multinomial(probs, num_samples=1)
all_logits.append(logits.unsqueeze(1))
input_ids = torch.cat([input_ids, next_tokens], dim=1)
# Check for EOS
if (next_tokens == self.tokenizer.eos_token_id).all():
break
reasoning_ids = input_ids[:, (prompt_ids.shape[1] if prompt_ids is not None else 1):]
reasoning_text = self.tokenizer.batch_decode(reasoning_ids, skip_special_tokens=True)
return {
'reasoning_text': reasoning_text,
'generated_ids': reasoning_ids,
'logits': torch.cat(all_logits, dim=1)
}
def compute_reward(self, image_features, reasoning_ids, ground_truth):
"""
Compute reward for reasoning quality.
In practice: verify against ground truth.
"""
# Reward model predicts quality of reasoning
base_reward = self.reward_model(image_features)
# Verification reward: check if reasoning leads to correct answer
# Simplified: use task accuracy
verification_reward = torch.tensor(
[1.0 if check_correctness(r_text, gt) else 0.0
for r_text, gt in zip(reasoning_ids, ground_truth)]
).to(image_features.device)
# Combine rewards
final_reward = 0.5 * base_reward.squeeze() + 0.5 * verification_reward
return final_reward
def check_correctness(reasoning_text, ground_truth):
"""
Simple correctness check (in practice, would be task-specific).
"""
return reasoning_text.strip().lower() == ground_truth.strip().lower()
Implement GRPO (Group Relative Policy Optimization) training for vision models:
def train_dino_with_grpo(model, train_dataloader, num_epochs=5, grpo_beta=1.0):
"""
Train DINO-R1 using GRPO: Group Relative Policy Optimization.
"""
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)
for epoch in range(num_epochs):
for batch_idx, batch in enumerate(train_dataloader):
images = batch['images'] # (batch, 3, H, W)
task_prompts = batch['prompts']
ground_truths = batch['ground_truths']
# Encode images
image_features = model.encode_image(images)
# Generate multiple reasoning trajectories per example
num_trajectories = 4
all_rewards = []
all_log_probs = []
for traj_idx in range(num_trajectories):
# Generate reasoning (stochastic due to sampling)
reasoning_output = model.generate_reasoning(
image_features,
task_prompt=task_prompts[0], # Simplified: same prompt
max_length=256
)
# Compute rewards for this trajectory
trajectory_rewards = model.compute_reward(
image_features,
reasoning_output['generated_ids'],
ground_truths
)
all_rewards.append(trajectory_rewards)
# Compute log probabilities of generated tokens
logits = reasoning_output['logits'] # (batch, seq_len, vocab_size)
log_probs = F.log_softmax(logits, dim=-1)
# Sum log probs for generated sequence
token_log_probs = torch.gather(log_probs, -1,
reasoning_output['generated_ids'].unsqueeze(-1))
seq_log_prob = token_log_probs.sum(dim=1)
all_log_probs.append(seq_log_prob)
# Stack trajectories: (num_trajectories, batch)
rewards_tensor = torch.stack(all_rewards, dim=0)
log_probs_tensor = torch.stack(all_log_probs, dim=0)
# GRPO: compute relative rewards within groups
# Group = set of trajectories for same example
mean_reward = rewards_tensor.mean(dim=0, keepdim=True)
relative_rewards = rewards_tensor - mean_reward
# Policy loss: maximize relative rewards
policy_loss = -(log_probs_tensor * relative_rewards).mean()
# KL penalty to prevent distribution shift (optional)
kl_penalty = 0 # Set if using reference policy
total_loss = policy_loss + grpo_beta * kl_penalty
optimizer.zero_grad()
total_loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
if batch_idx % 10 == 0:
mean_rewards = rewards_tensor.mean().item()
print(f"Epoch {epoch}, Step {batch_idx}: Loss={total_loss.item():.4f}, "
f"Avg Reward={mean_rewards:.3f}")
# Validation
validate_reasoning_capability(model, val_dataloader)
Implement evaluation of reasoning capability:
def evaluate_reasoning_capability(model, test_data):
"""
Measure if vision model has developed reasoning capability.
Metrics: task accuracy, reasoning quality, answer correctness.
"""
model.eval()
correct = 0
total = 0
reasoning_qualities = []
with torch.no_grad():
for batch in test_data:
images = batch['images']
ground_truths = batch['ground_truths']
# Generate reasoning
image_features = model.encode_image(images)
reasoning_output = model.generate_reasoning(
image_features,
task_prompt=batch.get('prompt', '')
)
# Evaluate correctness
for i, reasoning_text in enumerate(reasoning_output['reasoning_text']):
is_correct = check_correctness(reasoning_text, ground_truths[i])
correct += is_correct
total += 1
# Measure reasoning quality (length + specificity)
reasoning_quality = measure_reasoning_quality(reasoning_text)
reasoning_qualities.append(reasoning_quality)
accuracy = correct / total if total > 0 else 0
avg_reasoning_quality = sum(reasoning_qualities) / len(reasoning_qualities)
print(f"Task Accuracy: {accuracy:.2%}")
print(f"Avg Reasoning Quality: {avg_reasoning_quality:.3f}")
return {'accuracy': accuracy, 'reasoning_quality': avg_reasoning_quality}
def measure_reasoning_quality(text):
"""Score reasoning quality (proxy metric)"""
# Count reasoning indicators
indicators = ['because', 'therefore', 'so', 'thus', 'this means']
count = sum(1 for ind in indicators if ind in text.lower())
# Normalize by length
return min(count / max(1, len(text.split()) / 10), 1.0)
| Parameter | Typical Range | Notes | |-----------|---------------|-------| | GRPO beta | 0.5 - 2.0 | Strength of group-relative rewards | | Num trajectories per example | 2 - 8 | More = better gradient estimates, more compute | | Learning rate | 1e-5 to 5e-5 | Vision fine-tuning uses smaller LR | | Reasoning max length | 128 - 512 tokens | Longer = more detailed but costly | | Task diversity | ≥3 domains | Prevents overfitting to single reasoning pattern |
When to use DINO-R1 approach:
When NOT to use:
Common pitfalls:
DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models https://arxiv.org/abs/2505.24025
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.