skills/vllm-omni-add-diffusion-model/SKILL.md
Add a new diffusion model (text-to-image, text-to-video, image-to-video, text-to-audio, image editing) to vLLM-Omni, including Cache-DiT acceleration and parallelism support (TP, SP/USP, CFG-Parallel, HSDP). Use when integrating a new diffusion model, porting a diffusers pipeline or a custom model repo to vllm-omni, creating a new DiT transformer adapter, adding diffusion model support, or enabling multi-GPU parallelism and cache acceleration for an existing model.
npx skillsauth add hsliuustc0106/vllm-omni-skills add-diffusion-modelInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill guides you through adding a new diffusion model to vLLM-Omni. The model may come from HuggingFace Diffusers (structured pipeline) or from a private/custom repo. The workflow differs significantly depending on the source.
Before starting, determine:
Check the model's HF repo for model_index.json. This determines your path:
| Scenario | How to identify | Migration path |
|----------|----------------|----------------|
| Already supported | _class_name in model_index.json matches a key in _DIFFUSION_MODELS in registry.py | Skip to Step 5 (test) and Step 7 (docs) |
| Diffusers-based | Has standard model_index.json with _diffusers_version, subfolders for transformer/, vae/, etc. | Follow Path A below |
| Custom/private repo | No diffusers model_index.json, weights in non-standard format, custom model code in a separate git repo | Follow Path B below |
| Hybrid | Has some diffusers components (VAE) but custom transformer/fusion | Mix of Path A and Path B |
For models with a standard diffusers layout. See references/transformer-adaptation.md for detailed code patterns.
model_index.jsonIdentify components: transformer, scheduler, vae, text_encoder, tokenizer.
vllm_omni/diffusion/models/your_model_name/
├── __init__.py
├── pipeline_your_model.py
└── your_model_transformer.py
ModelMixin, ConfigMixin, AttentionModuleMixin).vllm_omni.diffusion.attention.layer.Attention (QKV shape: [B, seq, heads, head_dim]).od_config: OmniDiffusionConfig | None = None to __init__.load_weights() method mapping diffusers weight names to vllm-omni names._repeated_blocks, _layerwise_offload_blocks_attr.Inherit from nn.Module. The key contract:
class YourPipeline(nn.Module):
def __init__(self, *, od_config: OmniDiffusionConfig, prefix: str = ""):
# Load VAE, text encoder, tokenizer via from_pretrained()
# Instantiate transformer (weights loaded later via weights_sources)
self.weights_sources = [
DiffusersPipelineLoader.ComponentSource(
model_or_path=od_config.model, subfolder="transformer",
prefix="transformer.", fall_back_to_pt=True)]
def forward(self, req: OmniDiffusionRequest) -> DiffusionOutput:
# Encode prompt → prepare latents → denoise loop → VAE decode
return DiffusionOutput(output=output)
def load_weights(self, weights):
return AutoWeightsLoader(self).load_weights(weights)
Add post/pre-process functions in the same pipeline file. Register them in registry.py.
For models without a diffusers pipeline — weights in custom format, model code in a private repo. Real examples: DreamID-Omni, BAGEL, HunyuanImage3.
Study the original model's code to identify:
.pth, custom checkpoint structure)This is the key design decision for custom models. Follow these placement rules:
| Code type | Where to place | Example |
|-----------|---------------|---------|
| Pipeline orchestration (init, forward, denoise loop) | vllm_omni/diffusion/models/<name>/pipeline_<name>.py | Always required |
| Custom transformer/backbone (ported and adapted to vllm-omni) | vllm_omni/diffusion/models/<name>/<name>_transformer.py or similar | wan2_2.py, fusion.py, bagel_transformer.py |
| Custom sub-models (VAE, fusion, autoencoder) | vllm_omni/diffusion/models/<name>/ as separate files | autoencoder.py, fusion.py |
| External dependency code (original repo utilities) | External repo, installed via download script or pip | dreamid_omni package via git clone |
| Hardcoded model configs | Module-level dicts in pipeline file | VIDEO_CONFIG, AUDIO_CONFIG dicts |
| Download/setup script | examples/offline_inference/<name>/download_<name>.py | download_dreamid_omni.py |
| Custom model_index.json | Generated by download script, placed at model root | Minimal: {"_class_name": "YourPipeline", ...} |
If the model's code lives in a separate git repo:
Option 1: Import with graceful fallback (recommended for models with external utils)
try:
from external_model.utils import init_vae, load_checkpoint
except ImportError:
raise ImportError(
"Failed to import from dependency 'external_model'. "
"Please run the download script first."
)
Option 2: Port the code directly (preferred when feasible)
Copy the essential model files into vllm_omni/diffusion/models/<name>/ and adapt them. This avoids external dependencies. BAGEL does this — autoencoder.py and bagel_transformer.py are ported directly.
Decision criteria: Port if the code is self-contained and won't diverge. Use external deps if the model repo is actively maintained and the code is complex.
Custom models have two patterns for weight loading:
Pattern 1: Bypass standard loader (DreamID-Omni style)
When the original model has complex custom init functions that load weights in __init__:
class CustomPipeline(nn.Module):
def __init__(self, *, od_config, prefix=""):
super().__init__()
model = od_config.model
# Load everything eagerly in __init__ using custom helpers
self.vae = custom_init_vae(model, device=self.device)
self.text_encoder = custom_init_text_encoder(model, device=self.device)
self.transformer = CustomFusionModel(CONFIG)
load_custom_checkpoint(self.transformer,
checkpoint_path=os.path.join(model, "model.safetensors"))
# NO weights_sources defined — bypasses standard loader
def load_weights(self, weights):
pass # No-op — all weights loaded in __init__
Pattern 2: Use standard loader with custom load_weights (BAGEL style)
When weights are in safetensors format but need name remapping:
class CustomPipeline(nn.Module):
def __init__(self, *, od_config, prefix=""):
super().__init__()
# Instantiate model architecture without weights
self.bagel = BagelModel(config)
self.vae = AutoEncoder(ae_params)
# Point loader at the safetensors in the model root
self.weights_sources = [
DiffusersPipelineLoader.ComponentSource(
model_or_path=od_config.model,
subfolder=None, # weights at root, not in subfolder
prefix="",
fall_back_to_pt=False,
)
]
def load_weights(self, weights):
# Custom name remapping for non-diffusers weight names
params = dict(self.named_parameters())
loaded = set()
for name, tensor in weights:
# Remap original weight names to vllm-omni module names
name = self._remap_weight_name(name)
if name in params:
default_weight_loader(params[name], tensor)
loaded.add(name)
return loaded
model_index.jsonCustom models need a model_index.json at the model root for vllm-omni to discover them. For custom models, this is minimal:
{
"_class_name": "YourModelPipeline",
"custom_key": "path/to/custom_weights.safetensors"
}
The _class_name must match a key in _DIFFUSION_MODELS in registry.py. Additional keys are model-specific (accessed via od_config.model_config).
If the model's weights come from multiple HF repos, write a download script that:
model_index.json.pth file)Place at: examples/offline_inference/<name>/download_<name>.py
If the model accepts images, audio, or other multi-modal inputs, implement the protocol classes from vllm_omni/diffusion/models/interface.py:
from vllm_omni.diffusion.models.interface import SupportImageInput, SupportAudioInput
class MyPipeline(nn.Module, SupportImageInput, SupportAudioInput):
# Protocol markers — the engine uses these to enable proper input routing
pass
Preprocessing for custom models is typically done inside forward() rather than via registered pre-process functions, since the logic is often tightly coupled to the model.
Edit vllm_omni/diffusion/registry.py:
_DIFFUSION_MODELS = {
"YourModelPipeline": ("your_model_name", "pipeline_your_model", "YourModelPipeline"),
}
_DIFFUSION_POST_PROCESS_FUNCS = {
"YourModelPipeline": "get_your_model_post_process_func", # if applicable
}
_DIFFUSION_PRE_PROCESS_FUNCS = {
"YourModelPipeline": "get_your_model_pre_process_func", # if applicable
}
The registry key is the _class_name from model_index.json. The tuple is (folder_name, module_file, class_name).
Create __init__.py exporting the pipeline class and any factory functions.
Use the appropriate existing example script:
| Category | Script |
|----------|--------|
| Text-to-Image | examples/offline_inference/text_to_image/text_to_image.py |
| Text-to-Video | examples/offline_inference/text_to_video/text_to_video.py |
| Image-to-Video | examples/offline_inference/image_to_video/image_to_video.py |
| Image-to-Image | examples/offline_inference/image_to_image/image_edit.py |
| Text-to-Audio | examples/offline_inference/text_to_audio/text_to_audio.py |
For custom/Omni models that don't fit these categories, create a dedicated example script.
Validation: No errors, output is meaningful, quality matches reference implementation.
See references/troubleshooting.md for common errors.
For Omni or custom models, create:
examples/offline_inference/your_model_name/ — offline script + READMEexamples/online_serving/your_model_name/ — server script + clientRequired updates:
docs/user_guide/diffusion/parallelism_acceleration.md — parallelism support tabledocs/user_guide/diffusion/teacache.md — if TeaCache supporteddocs/user_guide/diffusion/cache_dit_acceleration.md — if Cache-DiT supportedexamples/offline_inference/xxx/README.md — offline example docsexamples/online_serve/xxx/README.md — online serve docsCreate tests/e2e/online_serving/test_your_model_expansion.py.
Cache-DiT accelerates inference by caching intermediate computation results across denoising steps. After your model is working correctly on a single GPU, add cache-dit support.
See references/cache-dit-patterns.md for detailed code patterns.
| Model Type | Description | Action |
|------------|-------------|--------|
| Standard single-transformer | One transformer with one ModuleList of blocks | No code needed — CacheDiTBackend auto-detects via enable_cache_for_dit() |
| Multi-block-list | One transformer with multiple block lists (e.g., transformer_blocks + single_transformer_blocks) | Write custom enabler with BlockAdapter |
| Dual-transformer | Two transformers (e.g., high-noise + low-noise) | Write custom enabler with BlockAdapter wrapping both |
For standard single-transformer models, test directly:
omni = Omni(
model="your-model-name",
cache_backend="cache_dit",
cache_config={
"Fn_compute_blocks": 1,
"Bn_compute_blocks": 0,
"max_warmup_steps": 4,
}
)
Check logs for "Cache-dit enabled successfully on xxx". If it works, skip to Step 9e.
For multi-block-list or dual-transformer models, write a custom enabler function:
from cache_dit import BlockAdapter, ForwardPattern, ParamsModifier, DBCacheConfig
def enable_cache_for_your_model(pipeline, cache_config):
db_cache_config = DBCacheConfig(
num_inference_steps=None,
Fn_compute_blocks=cache_config.Fn_compute_blocks,
Bn_compute_blocks=cache_config.Bn_compute_blocks,
max_warmup_steps=cache_config.max_warmup_steps,
max_cached_steps=cache_config.max_cached_steps,
max_continuous_cached_steps=cache_config.max_continuous_cached_steps,
residual_diff_threshold=cache_config.residual_diff_threshold,
)
cache_dit.enable_cache(
BlockAdapter(
transformer=pipeline.transformer,
blocks=[
pipeline.transformer.transformer_blocks,
pipeline.transformer.single_transformer_blocks,
],
forward_pattern=[ForwardPattern.Pattern_1, ForwardPattern.Pattern_1],
params_modifiers=[ParamsModifier(...)],
),
cache_config=db_cache_config,
)
def refresh_cache_context(pipeline, num_inference_steps, verbose=True):
cache_dit.refresh_context(
pipeline.transformer, num_inference_steps=num_inference_steps, verbose=verbose
)
return refresh_cache_context
Add your enabler to CUSTOM_DIT_ENABLERS in vllm_omni/diffusion/cache/cache_dit_backend.py:
CUSTOM_DIT_ENABLERS = {
"Wan22Pipeline": enable_cache_for_wan22,
"LongCatImagePipeline": enable_cache_for_longcat_image,
"YourModelPipeline": enable_cache_for_your_model, # Add here
}
omni = Omni(
model="your-model-name",
cache_backend="cache_dit",
cache_config={
"Fn_compute_blocks": 1, "Bn_compute_blocks": 0,
"max_warmup_steps": 4, "residual_diff_threshold": 0.24,
}
)
images = omni.generate("a beautiful landscape",
OmniDiffusionSamplingParams(num_inference_steps=50))
Verify: 1) logs show cache enabled, 2) 1.5-2x speedup, 3) output quality acceptable vs baseline.
If quality degrades, lower residual_diff_threshold (try 0.12-0.18) or increase max_warmup_steps (try 6-8).
After the model works on a single GPU, add multi-GPU parallelism. Add each type incrementally, testing after each addition.
See references/parallelism-patterns.md for detailed code patterns and API reference.
Recommended order: TP → SP/USP → CFG Parallel → HSDP
Shards DiT linear layers across GPUs. Requires code changes in the transformer.
What to change in the transformer:
nn.Linear with ColumnParallelLinear / RowParallelLinear / QKVParallelLinearload_weights() to handle QKV fusion with stacked_params_mappingself.to_qkv.num_heads (local heads) instead of total heads for split sizesfrom vllm.model_executor.layers.linear import (
QKVParallelLinear, RowParallelLinear, ColumnParallelLinear,
)
# Attention: QKV → RowParallel output
self.to_qkv = QKVParallelLinear(dim, head_dim, num_heads, num_kv_heads)
self.to_out = RowParallelLinear(dim, dim, input_is_parallel=True)
# FFN: ColumnParallel → RowParallel
self.w1 = ColumnParallelLinear(dim, ffn_dim)
self.w2 = RowParallelLinear(ffn_dim, dim, input_is_parallel=True)
Constraints: num_heads % tp_size == 0 and num_kv_heads % tp_size == 0.
Test: --tensor-parallel-size 2
Splits sequence tokens across GPUs. Non-intrusive via _sp_plan on the transformer class — no changes to forward().
What to change in the transformer:
Add _sp_plan class attribute:
from vllm_omni.diffusion.distributed.sp_plan import (
SequenceParallelInput, SequenceParallelOutput,
)
class YourTransformer(nn.Module):
_sp_plan = {
"blocks.0": {
"hidden_states": SequenceParallelInput(split_dim=1, expected_dims=3),
},
"proj_out": SequenceParallelOutput(gather_dim=1, expected_dims=3),
}
If inline tensor ops (e.g., torch.cat) exist between shard/gather points, extract them into nn.Module submodules so hooks can intercept them.
For RoPE that needs splitting, add an entry for the RoPE module with split_output=True.
Test: --ulysses-degree 2 (offline) or --usp 2 (online serving)
Distributes positive/negative CFG branches across 2 GPUs. Requires the pipeline to inherit CFGParallelMixin.
What to change in the pipeline:
from vllm_omni.diffusion.distributed.cfg_parallel import CFGParallelMixin
class YourPipeline(nn.Module, CFGParallelMixin):
def diffuse(self, ...) -> torch.Tensor:
for i, t in enumerate(timesteps):
positive_kwargs = {...}
negative_kwargs = {...} if do_true_cfg else None
noise_pred = self.predict_noise_maybe_with_cfg(
do_true_cfg=do_true_cfg, true_cfg_scale=cfg_scale,
positive_kwargs=positive_kwargs, negative_kwargs=negative_kwargs,
)
latents = self.scheduler_step_maybe_with_cfg(
noise_pred, t, latents, do_true_cfg
)
return latents
Override predict_noise() if your transformer call is non-standard. Override combine_cfg_noise() for multi-output models (e.g., video + audio).
Constraint: Exactly 2 GPUs. Only for models using classifier-free guidance.
Test: --cfg-parallel-size 2
Shards transformer weights via PyTorch FSDP2 to reduce per-GPU VRAM. No code changes to the forward pass — just add a class attribute.
What to change in the transformer:
class YourTransformer(nn.Module):
@staticmethod
def _is_transformer_block(name: str, module) -> bool:
return "blocks" in name and name.split(".")[-1].isdigit()
_hsdp_shard_conditions = [_is_transformer_block]
Constraint: Cannot combine with TP. For standalone HSDP, set hsdp_shard_size explicitly.
Test: --use-hsdp or DiffusionParallelConfig(use_hsdp=True)
After adding parallelism support, update:
docs/user_guide/diffusion/parallelism_acceleration.md — add your model to the support table--enforce-eager: Disable torch.compile during debuggingdevelopment
Use before submitting a PR to vllm-project/vllm-omni — self-check the branch against project conventions, catch dead code, verify accuracy/performance claims, and confirm merge readiness. Use when the user says "pre-check", "self review", "pre-submit check", or "check my PR before I open it."
development
--- name: vllm-omni-test-report description: Two report kinds; **default output is always HTML** unless the user explicitly asks for Markdown (.md). **Release** — `scripts/compose_full_report.py` (**测试结论**, Buildkite metrics, **Test Result** = Common stack + optional `--log-dir-h*` nightly-style summaries + H100/CI block, **Issue tracking** = GitHub `ci-failure` + *local test* in:title, Open bugs); use `--format markdown` only when the user wants .md or `patch_report_*.py`. **Nightly** — `script
testing
Review PRs on vllm-project/vllm-omni by routing to the right domain skills, checking critical evidence, and focusing comments on blocking issues. Use when reviewing pull requests or local branches, triaging review depth, running detailed or default review, or checking tests, benchmarks, and breaking changes in vllm-omni.
data-ai
Generate videos with vLLM-Omni using Wan2.2 and other video generation models. Use when generating videos from text, creating videos from images, configuring video generation parameters, or working with text-to-video or image-to-video models.