skills/skillxiv-v0.0.2-claude-opus-4.6/first-frame-video-customization/SKILL.md
Enable video generation customization via first-frame reuse—treat initial frame as a visual entity buffer storing objects for reuse during generation with just 20-50 examples, requiring minimal architectural change.
npx skillsauth add ADu2021/skillXiv first-frame-video-customizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Video generation models treat the first frame as a conditioning seed—it establishes spatial layout and visual context. This paper reveals the first frame plays a deeper role: it functions as a visual entity buffer that video models reference and reuse throughout generation. By explicitly leveraging this mechanism with minimal training data (20-50 examples), users can customize video content—changing object appearances, poses, and interactions—with no architectural modifications.
The insight is that existing diffusion video models already have this capability embedded; we just need to exploit it with light few-shot fine-tuning.
Video diffusion models condition on a first frame, then iteratively generate subsequent frames. Conventional wisdom treats the first frame as:
This paper demonstrates the first frame is actually more: a semantic entity repository. The model internally references the first frame's visual entities (objects, characters, textures) throughout the generation process, reusing them as building blocks for coherent video generation.
Exploiting this for customization is straightforward: provide a custom first frame with desired entity variations, and the model naturally propagates those variations through the generated video with minimal retraining.
Step 1: Extract Visual Entities from First Frame.
import torch
import torch.nn as nn
class FirstFrameEntityExtractor(nn.Module):
"""
Extract and encode visual entities from first frame.
Entities: distinct objects, characters, textures, etc.
"""
def __init__(self, vision_model_name='dino-v2', embedding_dim=768):
super().__init__()
# Vision encoder (e.g., DINO-v2 for semantic understanding)
self.vision_encoder = load_vision_model(vision_model_name)
self.embedding_dim = embedding_dim
# Entity segmentation (detect where entities are)
self.segmenter = load_segmentation_model('sam') # Segment Anything
# Entity encoder (compress each entity to embedding)
self.entity_encoder = nn.Sequential(
nn.Linear(2048, 1024), # From vision encoder
nn.ReLU(),
nn.Linear(1024, embedding_dim)
)
def forward(self, first_frame):
"""
Extract entities from first frame.
first_frame: (3, H, W) image tensor
Returns: entity_embeddings, entity_masks, entity_locations
"""
# Full frame encoding
frame_features = self.vision_encoder(first_frame.unsqueeze(0)) # (1, feat_dim)
# Segment entities (bounding boxes or masks)
entity_masks = self.segmenter(first_frame) # List of (H, W) binary masks
# Encode each entity
entity_embeddings = []
entity_locations = []
for mask in entity_masks:
# Extract entity region
entity_region = first_frame * mask.unsqueeze(0)
# Encode entity
entity_feat = self.vision_encoder(entity_region.unsqueeze(0))
entity_emb = self.entity_encoder(entity_feat)
entity_embeddings.append(entity_emb)
# Record location (centroid)
entity_loc = torch.where(mask > 0)
centroid = (entity_loc[0].float().mean(), entity_loc[1].float().mean())
entity_locations.append(centroid)
entity_embeddings = torch.cat(entity_embeddings, dim=0) # (num_entities, embedding_dim)
return {
'embeddings': entity_embeddings,
'masks': entity_masks,
'locations': entity_locations,
'frame_features': frame_features
}
Step 2: Integrate Entity Embeddings into Diffusion Model.
class FirstFrameConditionedVideoDiffusion(nn.Module):
"""
Video diffusion model enhanced with first-frame entity conditioning.
"""
def __init__(self, base_video_model, entity_embedding_dim=768):
super().__init__()
self.base_model = base_video_model
self.entity_embedding_dim = entity_embedding_dim
# Entity memory: store extracted entities for reuse
self.entity_memory = {}
# Cross-attention between generated frames and entity embeddings
self.entity_crossattn = nn.MultiheadAttention(
embed_dim=base_video_model.hidden_dim,
num_heads=8,
batch_first=True
)
def forward_with_entity_conditioning(self, noisy_video, timestep, entity_embeddings):
"""
Generate video frames with first-frame entity conditioning.
noisy_video: (batch, frames, 3, H, W) noisy video tensor
timestep: current diffusion timestep
entity_embeddings: (num_entities, embedding_dim) from first frame
"""
# Generate via base model
hidden = self.base_model.encoder(noisy_video) # (batch, frames, hidden_dim)
# Apply entity cross-attention at each frame
batch, frames, hidden_dim, *_ = hidden.shape
refined = []
for frame_idx in range(frames):
frame_hidden = hidden[:, frame_idx, :] # (batch, hidden_dim)
# Cross-attend to entity embeddings
# Query: current frame, Key/Value: entities from first frame
entity_ctx, _ = self.entity_crossattn(
frame_hidden.unsqueeze(1), # (batch, 1, hidden_dim)
entity_embeddings.unsqueeze(0).expand(batch, -1, -1), # (batch, num_entities, embedding_dim)
entity_embeddings.unsqueeze(0).expand(batch, -1, -1)
)
# Blend original and entity-informed
refined_frame = frame_hidden + entity_ctx.squeeze(1)
refined.append(refined_frame)
refined_hidden = torch.stack(refined, dim=1) # (batch, frames, hidden_dim)
# Decode back to pixel space
output = self.base_model.decoder(refined_hidden, timestep)
return output
def store_entity_memory(self, entities_dict):
"""Store extracted entities for later reference."""
self.entity_memory = entities_dict
Step 3: Few-Shot Fine-Tuning on Custom Entities.
def finetune_for_custom_entities(
model, entity_extractor,
custom_first_frames, target_videos,
num_steps=100, lr=5e-4
):
"""
Fine-tune model on small dataset of custom first frames.
custom_first_frames: list of (3, H, W) images with desired entities
target_videos: list of (frames, 3, H, W) target video outputs
"""
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
for step in range(num_steps):
# Sample a batch
batch_idx = torch.randperm(len(custom_first_frames))[:4] # Batch size 4
batch_loss = 0
for idx in batch_idx:
first_frame = custom_first_frames[idx]
target_video = target_videos[idx]
# Extract entities
entity_info = entity_extractor(first_frame)
entity_embeddings = entity_info['embeddings']
# Forward pass: generate video with entity conditioning
with torch.randn_like(target_video) as noise:
# Noisy video (simulate diffusion forward process)
timestep = torch.randint(0, 1000, (1,))
noisy_video = add_noise_to_video(target_video, timestep)
generated = model.forward_with_entity_conditioning(
noisy_video, timestep, entity_embeddings
)
# Loss: match target video
loss = torch.nn.functional.mse_loss(generated, target_video)
batch_loss += loss
avg_loss = batch_loss / len(batch_idx)
optimizer.zero_grad()
avg_loss.backward()
optimizer.step()
if step % 20 == 0:
print(f"Step {step}: loss={avg_loss.item():.4f}")
Step 4: Inference with Custom First Frame.
@torch.no_grad()
def generate_customized_video(
model, entity_extractor,
custom_first_frame, num_frames=24, num_inference_steps=50
):
"""
Generate video with custom first frame (customized entity buffer).
"""
# Extract entities from custom first frame
entity_info = entity_extractor(custom_first_frame)
entity_embeddings = entity_info['embeddings']
# Start from noise
video_shape = (1, num_frames, 3, 512, 512)
x_t = torch.randn(video_shape)
# Reverse diffusion process with entity conditioning
for t in range(num_inference_steps - 1, -1, -1):
timestep = torch.tensor([t])
# Model predicts noise with entity conditioning
noise_pred = model.forward_with_entity_conditioning(
x_t, timestep, entity_embeddings
)
# Update x_t (reverse step)
alpha_t = get_alpha_t(t)
x_t = (x_t - (1 - alpha_t) ** 0.5 * noise_pred) / (alpha_t ** 0.5)
# Add noise for next step
if t > 0:
noise = torch.randn_like(x_t)
x_t = x_t + (1 - alpha_t) ** 0.5 * noise
return x_t.clamp(-1, 1) # (1, num_frames, 3, H, W)
When to Use: Video generation tasks requiring content customization (changing actors, objects, backgrounds) with limited retraining data. Few-shot customization (20-50 examples) is key advantage.
Architecture Choices:
Pitfalls:
When NOT to Use: Tasks requiring dramatic scene changes across frames; full scene customization (not just entity variation).
Integration: Works with any diffusion-based video model (Stable Diffusion Video, Imagen Video, etc.); no architectural changes needed beyond cross-attention addition.
Reference: https://arxiv.org/abs/2511.15700
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.