skills/skillxiv-v0.0.2-claude-opus-4.6/bifrost-patch-clip-diffusion/SKILL.md
Connects multimodal language models with diffusion models using patch-level CLIP embeddings as shared latent variables, enabling controllable image generation with minimal training overhead.
npx skillsauth add ADu2021/skillXiv bifrost-patch-clip-diffusionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Bifrost-1 creates a bridge between multimodal LLMs and diffusion models using patch-level CLIP embeddings as a shared latent space. This approach leverages the fact that MLLMs already understand patch-level representations from their CLIP visual encoder during pretraining, enabling efficient integration without expensive retraining of either component.
Step 1: Extract and Analyze Patch-level CLIP Embeddings
Understand patch-level representations:
# Pseudocode for patch-level CLIP analysis
class PatchCLIPAnalyzer:
def __init__(self, clip_model, patch_size=16):
super().__init__()
self.clip_model = clip_model
self.patch_size = patch_size
def extract_patch_embeddings(self, image):
"""
Extract patch-level CLIP embeddings from image.
Args:
image: Input image (batch, 3, height, width)
Returns:
patch_embeddings: (batch, num_patches, embed_dim)
"""
# Process through CLIP vision encoder
with torch.no_grad():
# CLIP vision encoder outputs patch tokens
patch_tokens = self.clip_model.visual.transformer(image)
return patch_tokens
def compute_patch_statistics(self, images):
"""
Analyze patch embedding distribution.
"""
all_patches = []
for image in images:
patches = self.extract_patch_embeddings(image.unsqueeze(0))
all_patches.append(patches)
all_patches = torch.cat(all_patches, dim=0)
return {
'mean': all_patches.mean(dim=(0, 1)),
'std': all_patches.std(dim=(0, 1)),
'norm': all_patches.norm(dim=-1).mean()
}
def reconstruct_from_patches(self, patch_embeddings, original_image_size):
"""
Reconstruct image from patch embeddings (for visualization).
"""
batch_size, num_patches, embed_dim = patch_embeddings.shape
# Calculate grid dimensions
h = w = int(np.sqrt(num_patches))
# Reshape to spatial grid
patches_grid = patch_embeddings.reshape(batch_size, h, w, embed_dim)
# Upsample patches to image
patches_upsampled = patches_grid.permute(0, 3, 1, 2)
image = F.interpolate(
patches_upsampled,
size=original_image_size,
mode='nearest'
)
return image
Step 2: Design MLLM Visual Generation Branch
Extend MLLM to generate patch embeddings:
# Pseudocode for MLLM visual generation branch
class MLLMWithVisualGeneration(nn.Module):
def __init__(self, base_mllm, clip_embed_dim, num_patches):
super().__init__()
self.base_mllm = base_mllm
self.clip_embed_dim = clip_embed_dim
self.num_patches = num_patches
# Visual generation head
self.visual_generation_head = nn.Sequential(
nn.Linear(base_mllm.hidden_dim, 2048),
nn.GELU(),
nn.Linear(2048, num_patches * clip_embed_dim)
)
# Initialize from original MLLM parameters
self._initialize_from_base()
def _initialize_from_base(self):
"""
Initialize visual head from base MLLM parameters.
"""
# Copy similar weight structures
with torch.no_grad():
for param in self.visual_generation_head.parameters():
if len(param.shape) > 1:
nn.init.normal_(param, std=0.02)
else:
nn.init.zeros_(param)
def forward(self, input_ids, pixel_values=None, generate_image=False):
"""
Forward pass with optional image generation.
Args:
input_ids: Token IDs
pixel_values: Input image pixels (optional)
generate_image: Whether to generate patch embeddings
Returns:
outputs: Standard MLLM outputs
patch_embeddings: Generated patches if generate_image=True
"""
# Standard MLLM forward
outputs = self.base_mllm(
input_ids=input_ids,
pixel_values=pixel_values,
output_hidden_states=True
)
# Generate patch embeddings if requested
if generate_image:
hidden_states = outputs.hidden_states[-1]
# Take final hidden state for generation
final_hidden = hidden_states[:, -1, :]
# Generate patches
patch_embeddings = self.visual_generation_head(final_hidden)
patch_embeddings = patch_embeddings.reshape(
patch_embeddings.shape[0],
self.num_patches,
self.clip_embed_dim
)
return outputs, patch_embeddings
return outputs
def preserve_reasoning_capability(self):
"""
Ensure multimodal reasoning not degraded.
"""
# Verify base MLLM weights unchanged
for param in self.base_mllm.parameters():
param.requires_grad = False
# Only train visual generation head
for param in self.visual_generation_head.parameters():
param.requires_grad = True
Step 3: Implement Lightweight ControlNet Adaptation
Adapt diffusion model with minimal parameters:
# Pseudocode for ControlNet adaptation
class PatchCLIPControlNet(nn.Module):
def __init__(self, unet_in_channels, patch_embed_dim, num_patches):
super().__init__()
self.patch_embed_dim = patch_embed_dim
# Lightweight projection from patches to control signal
self.patch_projection = nn.Sequential(
nn.Linear(patch_embed_dim, 256),
nn.GELU(),
nn.Linear(256, unet_in_channels)
)
# Spatial adapter (minimal parameters)
self.spatial_adapter = nn.Conv2d(
unet_in_channels,
unet_in_channels,
kernel_size=3,
padding=1,
groups=unet_in_channels
)
def forward(self, patch_embeddings, timestep):
"""
Generate control signal from patch embeddings.
Args:
patch_embeddings: (batch, num_patches, embed_dim)
timestep: Diffusion timestep
Returns:
control_signal: Control signal for diffusion model
"""
batch_size, num_patches, embed_dim = patch_embeddings.shape
# Project patches
projected = self.patch_projection(patch_embeddings)
# Reshape to spatial format
h = w = int(np.sqrt(num_patches))
spatial = projected.reshape(batch_size, h, w, -1)
spatial = spatial.permute(0, 3, 1, 2)
# Apply spatial adaptation
control_signal = self.spatial_adapter(spatial)
return control_signal
Step 4: Integrate and Train the Full System
Combine MLLM and diffusion model:
# Pseudocode for full system integration
class BifrostSystem(nn.Module):
def __init__(self, mllm_with_vision, diffusion_model, controlnet):
super().__init__()
self.mllm = mllm_with_vision
self.diffusion = diffusion_model
self.controlnet = controlnet
def generate_image_from_text(self, text, num_steps=50):
"""
Generate image conditioned on text using Bifrost.
"""
# Tokenize text
input_ids = self.mllm.tokenizer(text, return_tensors='pt')['input_ids']
# Generate patch embeddings from text
_, patch_embeddings = self.mllm(
input_ids=input_ids,
generate_image=True
)
# Generate image using diffusion with control
images = self._diffusion_generation(patch_embeddings, num_steps)
return images
def _diffusion_generation(self, patch_embeddings, num_steps):
"""
Run diffusion process with patch control.
"""
batch_size = patch_embeddings.shape[0]
# Initialize noise
x_t = torch.randn(batch_size, 3, 512, 512, device=patch_embeddings.device)
# Reverse diffusion
for t in range(num_steps - 1, -1, -1):
timestep = torch.tensor([t], device=x_t.device)
# Get control signal
control = self.controlnet(patch_embeddings, timestep)
# Diffusion step
with torch.no_grad():
noise_pred = self.diffusion.unet(
x_t,
timestep.expand(batch_size),
control=control
)
# Update x_t
x_t = self._diffusion_step(x_t, noise_pred, t)
return x_t
def _diffusion_step(self, x_t, noise_pred, t):
"""
Single diffusion step update.
"""
# DDIM or similar update rule
alpha_t = self.diffusion.alphas[t]
alpha_prev = self.diffusion.alphas[t - 1] if t > 0 else torch.tensor(1.0)
sigma_t = torch.sqrt((1 - alpha_prev) / (1 - alpha_t))
x_prev = (x_t - torch.sqrt(1 - alpha_t) * noise_pred) / torch.sqrt(alpha_t)
x_prev = x_prev + sigma_t * torch.randn_like(x_prev)
return x_prev
def train_system(self, training_data, num_epochs=5):
"""
Train visual generation branch (minimal training).
"""
optimizer = AdamW(self.mllm.visual_generation_head.parameters(), lr=1e-4)
for epoch in range(num_epochs):
for batch in training_data:
text = batch['text']
target_image = batch['image']
# Forward pass
input_ids = self.mllm.tokenizer(text, return_tensors='pt')['input_ids']
_, patch_embeddings = self.mllm(input_ids=input_ids, generate_image=True)
# Reconstruction loss
loss = F.mse_loss(patch_embeddings, batch['target_patches'])
optimizer.zero_grad()
loss.backward()
optimizer.step()
return self
Hyperparameters and Configuration:
When to Use Bifrost-1:
When NOT to Use:
Implementation Notes:
Paper: Bifrost-1: Bridging MLLMs and Diffusion with Patch-level CLIP ArXiv: 2508.05954 Performance: High-fidelity controllable image generation with significant training efficiency vs approaches requiring MLLM retraining
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.