skills/skillxiv-v0.0.2-claude-opus-4.6/dream2flow-robotic-manipulation/SKILL.md
Convert video generation model outputs into executable robotic manipulation by extracting 3D object flow trajectories as an intermediate representation. Enables zero-shot manipulation of diverse object types (rigid, articulated, deformable, granular) without task-specific training. Use when pre-trained video models capture plausible manipulation patterns but need grounding in low-level robot control.
npx skillsauth add ADu2021/skillXiv dream2flow-robotic-manipulationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Video generation models excel at predicting plausible human-like manipulation. But they generate sequences of images, not robot commands:
Video Model Output:
Frame 0: Hand above object
Frame 1: Hand grasps object
Frame 2: Hand moves object right
Frame 3: Object at goal location
Needed for Robot:
Joint angles α, β, γ → Joint velocity dα/dt, dβ/dt, dγ/dt
The gap: How do we get from visual frames to motor commands?
Dream2Flow bridges this gap using 3D object flow as an intermediate representation.
Instead of trying to reverse-engineer gripper positions and object poses from video frames, Dream2Flow:
This intermediate representation is:
# Dream2Flow pipeline for robotic manipulation
class Dream2FlowManipulator:
def __init__(self, video_model, flow_extractor, controller):
self.video_model = video_model # Pre-trained video generation
self.flow_extractor = flow_extractor # Extract 3D flow from frames
self.controller = controller # Execute trajectories (RL or optimization)
def plan_manipulation(self, current_image, goal_image, object_mask):
"""Plan manipulation from visual goal without task-specific training"""
# Step 1: Generate video of manipulation from current to goal
video_sequence = self.video_model.generate(
start_frame=current_image,
goal_frame=goal_image,
num_frames=32 # Typical video length
)
# Step 2: Extract 3D object motion trajectories
# Track each pixel in object mask across frames
flow_trajectory = self.flow_extractor.extract_3d_flow(
video_sequence,
object_mask=object_mask,
num_trajectories=200 # Dense flow: many points tracked
)
# Step 3: Aggregate into reference trajectory for object center/pose
object_trajectory = self.flow_extractor.aggregate_trajectory(flow_trajectory)
# object_trajectory shape: (T, 3) where T=32, 3=x,y,z position
# Step 4: Execute trajectory on real robot
# Option A: Trajectory optimization
if self.controller.mode == 'optimization':
actions = self.controller.optimize_trajectory(
current_state=self.robot_state(),
target_trajectory=object_trajectory,
constraints=['collision_free', 'gripper_constraints']
)
# Option B: Reinforcement learning (pre-trained policy)
elif self.controller.mode == 'rl':
state = self.robot_state()
for t in range(len(object_trajectory)):
action = self.controller.get_action(
current_state=state,
target_position=object_trajectory[t]
)
state = self.execute_action(action)
return actions or trajectory_log
def extract_and_filter_flow(self, video_frames, object_mask):
"""Robust 3D flow extraction for various object types"""
flow_3d = []
for t in range(len(video_frames) - 1):
# Compute optical flow between consecutive frames
frame_t = video_frames[t]
frame_t1 = video_frames[t + 1]
optical_flow_2d = self.compute_optical_flow(frame_t, frame_t1)
# Lift to 3D using depth (monocular or from reconstruction)
depth = self.estimate_depth(frame_t)
flow_3d_frame = self.lift_to_3d(
optical_flow_2d,
depth,
camera_intrinsics=self.camera_K
)
flow_3d.append(flow_3d_frame)
return flow_3d
A key advantage: same approach works across different object types without retraining:
| Object Type | How It Works | |---|---| | Rigid (mug, box) | Flow defines how center and orientation change | | Articulated (scissors, drawer) | Flow captures relative motion of linked parts | | Deformable (cloth, rope) | Dense flow tracks surface deformation | | Granular (rice, sand) | Flow approximates bulk motion dynamics |
No task-specific training needed because the video model already understands these physics.
Strategy A: Trajectory Optimization
def trajectory_optimize(start_state, target_trajectory):
def loss(actions):
state = start_state
total_loss = 0
for t, target_pos in enumerate(target_trajectory):
state = simulate_forward(state, actions[t])
actual_object_pos = get_object_position(state)
total_loss += mse(actual_object_pos, target_pos)
return total_loss + collision_penalty(state)
return minimize(loss, init_actions=random())
Strategy B: Reinforcement Learning
| Approach | Pros | Cons | |---|---|---| | Pure zero-shot (no training) | Fast to deploy, generalizes widely | Flow extraction errors compound | | Flow-level fine-tuning | Adapt to specific robot | Still avoids task-specific training | | Controller fine-tuning | Better execution quality | Requires target task data |
Dream2Flow relies on accurate flow extraction. It struggles when:
For these cases:
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.