skills/skillxiv-v0.0.2-claude-opus-4.6/cov-chain-of-view-spatial-reasoning/SKILL.md
Enable vision-language models to perform embodied question answering in 3D environments through active camera exploration. CoV uses training-free test-time reasoning to iteratively select relevant viewpoints and adjust camera angles until sufficient context is gathered, achieving 11-13% accuracy improvements across spatial reasoning benchmarks.
npx skillsauth add ADu2021/skillXiv cov-chain-of-view-spatial-reasoningInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Vision-language models excel at understanding single images or text, but struggle with embodied question answering in 3D environments. The core constraint: models are limited to finite input views, preventing them from exploring the scene to gather spatially distributed information. Traditional approaches either use fixed viewpoint sets or require expensive retraining to adapt viewing strategies. This creates a fundamental capability gap for spatial reasoning tasks requiring dynamic perspective selection.
Use test-time prompting to enable iterative camera control and view selection without model modification or retraining.
class ChainOfViewReasoner:
def __init__(self, vlm_model, scene_3d):
self.vlm = vlm_model
self.scene = scene_3d
self.selected_views = []
self.camera_pose = None
def run_chain_of_view(self, question, max_iterations=5):
"""Iteratively select views and adjust camera until question answered"""
# Stage 1: Coarse-grained view selection
anchor_views = self.select_anchor_views(question, num_anchors=4)
self.selected_views.extend(anchor_views)
# Stage 2: Fine-grained camera adjustment
for iteration in range(max_iterations):
# Get current visual context from selected views
visual_context = self.render_selected_views(self.selected_views)
# Prompt VLM to reason about current observations + question
reasoning = self.vlm.generate(
visual_context, question,
prompt_template="""
Given these views of a 3D scene and the question: {question}
Current observations:
[IMAGES]
What spatial relationships do you observe?
What camera action would help answer the question?
Options: forward, backward, left, right, up, down, rotate_yaw, rotate_pitch, switch_view
"""
)
# Extract action from VLM output
action = self.parse_camera_action(reasoning)
# Execute camera transformation
if action != "none":
self.apply_camera_action(action)
new_view = self.render_current_view()
self.selected_views.append(new_view)
# Check if sufficient information is gathered
if self.should_terminate(reasoning):
break
# Final answer generation with all gathered context
final_answer = self.vlm.generate(
visual_context=visual_context,
question=question,
prompt_template="Using all observed views, answer: {question}"
)
return final_answer
def select_anchor_views(self, question, num_anchors):
"""Coarse-grained selection: identify question-aligned anchor views"""
# Extract keywords from question
keywords = extract_keywords(question)
# Score all available views by relevance to keywords
view_scores = []
for view in self.scene.all_views:
relevance_score = compute_visual_relevance(view, keywords)
view_scores.append((view, relevance_score))
# Select top-K views with diversity filtering
anchor_views = select_diverse_top_k(view_scores, k=num_anchors)
return anchor_views
def apply_camera_action(self, action):
"""Update camera pose via SE(3) transformation"""
translation_map = {
"forward": [0, 0, -0.5],
"backward": [0, 0, 0.5],
"left": [-0.5, 0, 0],
"right": [0.5, 0, 0],
"up": [0, 0.5, 0],
"down": [0, -0.5, 0]
}
rotation_map = {
"rotate_yaw": (0, 0, 0.1),
"rotate_pitch": (0.1, 0, 0),
"rotate_roll": (0, 0.1, 0)
}
if action in translation_map:
translation = translation_map[action]
self.camera_pose = self.camera_pose @ se3_translate(translation)
elif action in rotation_map:
rotation = rotation_map[action]
self.camera_pose = self.camera_pose @ se3_rotate(rotation)
Two-Stage Architecture:
Stage 1: Coarse View Selection
Stage 2: Fine-Grained Camera Adjustment
Input Representation:
Action Space (Discrete):
Termination Criteria:
Accuracy Improvements:
Benchmark Coverage:
Test-Time Scaling:
Tested on:
All model-specific implementations use identical prompting strategy—no model-specific tuning.
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.