skills/vllm-omni-diffusion-perf-optim/SKILL.md
Guide for achieving optimal inference performance with vLLM-Omni diffusion models. Covers all lossless and lossy optimization methods (parallelism, torch.compile, CPU offload, quantization, cache acceleration), per-model support tables, and ready-to-use recipes. Use when asked to speed up diffusion inference, reduce latency, lower VRAM usage, or tune a diffusion pipeline.
npx skillsauth add hsliuustc0106/vllm-omni-skills vllm-omni-diffusion-perf-optimInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use this guide when a user asks how to speed up diffusion inference, reduce latency, lower VRAM, or tune a diffusion pipeline in vLLM-Omni.
This skill is designed to stay up to date. Instead of hardcoding model support tables, it tells you where to look in the codebase to discover current capabilities. See Discovering Current Capabilities and Extending This Skill at the end.
Before optimizing, establish a baseline:
model_index.json → _class_name)--enforce-eager (disables torch.compile) and no parallelismOnline serving (preferred — measures real deployment latency):
# Start server
vllm serve <MODEL> --omni --port 8098 --enforce-eager
# Send request and measure e2e time
time curl -sS -X POST http://localhost:8098/v1/videos \
-F "prompt=..." -F "width=768" -F "height=480" \
-F "num_frames=41" -F "num_inference_steps=20" -F "seed=42"
# Poll until completed, record inference_time_s from status response
curl -sS http://localhost:8098/v1/videos/<VIDEO_ID> | jq '.inference_time_s'
Offline inference (useful for quick iteration):
python examples/offline_inference/text_to_video/text_to_video.py \
--model <MODEL> --enforce-eager --prompt "..." --output baseline.mp4
Important: Always report online serving numbers for deployment decisions. Offline benchmarks may differ due to process startup, torch.compile warmup, and measurement methodology.
These do not affect output quality. Apply in order of impact.
What: Compiles repeated DiT transformer blocks via torch.compile(dynamic=True). Fuses ops, reduces kernel launch overhead.
How: Enabled by default. Use --enforce-eager to disable.
Speedup: Model- and GPU-dependent. May provide 1.1–1.5× on the denoising loop, but on some GPU architectures (e.g., H800) and models, warm-request latency may match eager.
Requirements: Model transformer must define _repeated_blocks attribute. First request pays compilation overhead (~5–15s extra).
Online serving note: The first request after server start incurs compilation warmup. Subsequent requests run at compiled speed. For latency-sensitive deployments, consider --enforce-eager to avoid first-request penalty, especially if compile does not measurably improve warm latency for your model/GPU.
Config: OmniDiffusionConfig.enforce_eager (default False = compile enabled).
Source: vllm_omni/diffusion/compile.py, vllm_omni/diffusion/worker/diffusion_model_runner.py
All configured via DiffusionParallelConfig. Check docs/user_guide/diffusion/parallelism_acceleration.md for the per-model support table before enabling.
What: Splits sequence tokens across GPUs using all-to-all communication (DeepSpeed Ulysses).
How: --ulysses-degree N (offline) or --usp N (online serving)
Speedup: Near-linear scaling. Best for long-sequence models (video, high-res image).
from vllm_omni.diffusion.data import DiffusionParallelConfig
parallel_config = DiffusionParallelConfig(ulysses_degree=2)
omni = Omni(model="...", parallel_config=parallel_config)
What: Ring-based P2P communication for attention across GPUs.
How: --ring-degree N (offline) or --ring N (online serving)
Note: Can combine with Ulysses: ulysses_degree × ring_degree = total SP GPUs.
What: Runs positive/negative CFG branches on separate GPUs. Only rank 0 computes scheduler step.
How: --cfg-parallel-size 2
Speedup: ~2× on models using classifier-free guidance.
Constraint: Requires exactly 2 GPUs. Only for models that use CFG.
# 4-GPU: CFG parallel (2) × Ulysses (2)
python text_to_image.py --model Qwen/Qwen-Image \
--cfg-parallel-size 2 --ulysses-degree 2
What: Shards DiT linear layers across GPUs using ColumnParallelLinear, RowParallelLinear, QKVParallelLinear.
How: --tensor-parallel-size N
Note: Only DiT blocks are sharded — text encoder is replicated on all ranks (extra VRAM per GPU). See Issue #771.
What: Shards VAE decode spatially across ranks using tiling.
How: --vae-patch-parallel-size N
Constraint: Auto-enables --vae-use-tiling.
What: Shards model weights across GPUs using PyTorch FSDP2. Reduces per-GPU VRAM.
How: Via DiffusionParallelConfig(use_hsdp=True). Requires multi-GPU.
What: Shards MoE experts across devices with all-to-all token routing.
How: --enable-expert-parallel
Constraint: Only for MoE models (e.g., HunyuanImage3.0).
Two mutually exclusive strategies. Both single-GPU only.
What: Swaps DiT ↔ encoders on GPU. Only one group is on GPU at a time.
How: --enable-cpu-offload or Omni(enable_cpu_offload=True)
Tradeoff: Adds H2D transfer latency between encoder and denoising phases.
What: Keeps only 1 transformer block on GPU at a time. Async prefetch via separate CUDA stream.
How: --enable-layerwise-offload or Omni(enable_layerwise_offload=True)
Best for: Large video models (Wan A14B) where per-block compute >> H2D transfer → nearly zero-cost offload.
Requirement: Model DiT must define _layerwise_offload_blocks_attr.
VRAM savings: Dramatic (e.g., 40+ GB → ~11 GB for Wan A14B).
--vae-use-slicing: Process VAE in slices (saves VRAM).--vae-use-tiling: Process VAE in tiles (saves VRAM, enables patch parallel).Both are boolean flags. Use when OOM during VAE decode.
What: Online quantization of DiT linear layers to FP8.
How: --quantization fp8
Requirements: Ada/Hopper GPU (SM89+). Native hardware FP8.
VRAM: ~50% reduction on DiT weights. Speedup: 1.3–1.5×.
python text_to_image.py --model Qwen/Qwen-Image --quantization fp8
Layer skipping: --ignored-layers 'add_kv_proj,to_add_out' to exclude specific layers from quantization.
What: Loads pre-quantized GGUF weights for transformer.
How: --quantization gguf --gguf-model <path-or-hf-id>
Source: docs/user_guide/diffusion/quantization/gguf.md
These trade quality for speed. Always compare output quality against baseline.
What: Caches transformer computations when consecutive timesteps are similar. Skips redundant forward passes.
Speedup: 1.5–2.0× depending on rel_l1_thresh.
How:
omni = Omni(
model="Qwen/Qwen-Image",
cache_backend="tea_cache",
cache_config={"rel_l1_thresh": 0.2},
)
CLI: --cache-backend tea_cache
Online: vllm serve <MODEL> --omni --cache-backend tea_cache --cache-config '{"rel_l1_thresh": 0.2}'
Quality tuning:
0.1–0.2: minimal quality loss (~1.5× speedup)0.4: slight quality loss (~1.8× speedup)0.6–0.8: noticeable quality loss (~2.0–2.25× speedup)Supported models: Qwen-Image family, BAGEL. See docs/user_guide/diffusion/teacache.md.
What: Hybrid caching with three sub-methods:
Speedup: 1.5–2.5× depending on configuration.
How:
omni = Omni(
model="Qwen/Qwen-Image",
cache_backend="cache_dit",
cache_config={
# DBCache
"Fn_compute_blocks": 1,
"Bn_compute_blocks": 0,
"max_warmup_steps": 4,
"residual_diff_threshold": 0.24,
"max_continuous_cached_steps": 3,
# TaylorSeer (optional)
"enable_taylorseer": False,
"taylorseer_order": 1,
# SCM (optional)
"scm_steps_mask_policy": None, # "slow"/"medium"/"fast"/"ultra"
"scm_steps_policy": "dynamic",
},
)
CLI: --cache-backend cache_dit
Excluded models: NextStep11Pipeline, StableDiffusionPipeline (see _NO_CACHE_ACCELERATION in registry.py).
Source: docs/user_guide/diffusion/cache_dit_acceleration.md
Reducing --num-inference-steps gives linear speedup but affects quality. Typical ranges:
The tables below may become stale as new models and methods are added. Always verify against the live codebase using these source-of-truth files:
Read the canonical table in docs/user_guide/diffusion/parallelism_acceleration.md.
It lists every model with ✅/❌ for each parallelism method (Ulysses-SP, Ring, CFG, TP, VAE-Patch, Expert, HSDP).
To check programmatically whether a specific model supports a method:
| Check | How |
|-------|-----|
| Ulysses / Ring SP | Transformer class defines _sp_plan. Search: grep -r '_sp_plan' vllm_omni/diffusion/models/ |
| CFG Parallel | Pipeline or transformer inherits CFGParallelMixin. Search: grep -r 'CFGParallelMixin' vllm_omni/diffusion/models/ |
| TP | Transformer uses ColumnParallelLinear / RowParallelLinear / QKVParallelLinear. Search: grep -r 'ParallelLinear\|QKVParallel' vllm_omni/diffusion/models/<model>/ |
| Layerwise offload | Pipeline defines _layerwise_offload_blocks_attr. Search: grep -r '_layerwise_offload_blocks_attr' vllm_omni/diffusion/models/ |
| torch.compile | Transformer defines _repeated_blocks. Search: grep -r '_repeated_blocks' vllm_omni/diffusion/models/ |
| HSDP | Check DiffusionParallelConfig usage in docs and tests. |
_NO_CACHE_ACCELERATION in vllm_omni/diffusion/registry.py. Any pipeline class in that set does not support tea_cache or cache_dit.docs/user_guide/diffusion/teacache.md for the current list._NO_CACHE_ACCELERATION. See docs/user_guide/diffusion/cache_dit_acceleration.md.vllm_omni/diffusion/quantization/. Each .py file is a method (e.g., fp8.py, gguf.py).OmniDiffusionConfig.quantization field in vllm_omni/diffusion/data.py.docs/user_guide/diffusion/quantization/Run vllm serve --help and look for --omni-related flags. Key flags:
--usp, --ring, --cfg-parallel-size, --tensor-parallel-size, --vae-patch-parallel-size,
--cache-backend, --quantization, --enforce-eager, --enable-cpu-offload,
--enable-layerwise-offload, --vae-use-slicing, --vae-use-tiling, --use-hsdp,
--enable-expert-parallel, --flow-shift, --boundary-ratio.
Recipes show both online serving (preferred for deployment) and offline variants.
# Server
vllm serve Qwen/Qwen-Image --omni --port 8098 --quantization fp8
# Client
curl -X POST http://localhost:8098/v1/images/generations \
-F "prompt=A futuristic city at sunset" -F "seed=42"
vllm serve Qwen/Qwen-Image --omni --port 8098 \
--cfg-parallel-size 2 --usp 2 --quantization fp8
vllm serve Wan-AI/Wan2.2-T2V-A14B-Diffusers --omni --port 8098 \
--enable-layerwise-offload --vae-use-slicing --vae-use-tiling
vllm serve Wan-AI/Wan2.2-T2V-A14B-Diffusers --omni --port 8098 \
--usp 4 --ring 2 --vae-patch-parallel-size 8 --quantization fp8
vllm serve Qwen/Qwen-Image --omni --port 8098 \
--enforce-eager --cache-backend cache_dit
vllm serve Lightricks/LTX-2 --omni --port 8098 \
--enforce-eager --flow-shift 1.0 --boundary-ratio 1.0
vllm serve Lightricks/LTX-2 --omni --port 8098 \
--enforce-eager --flow-shift 1.0 --boundary-ratio 1.0 \
--cache-backend cache_dit
For quick local testing, replace vllm serve ... --omni with the offline scripts:
# Image
python examples/offline_inference/text_to_image/text_to_image.py \
--model Qwen/Qwen-Image --prompt "..." --quantization fp8
# Video
python examples/offline_inference/text_to_video/text_to_video.py \
--model Lightricks/LTX-2 --prompt "..." --enforce-eager
Is output quality paramount?
├── YES → Use only Step 1 (lossless)
│ ├── Single GPU? → torch.compile (default) + FP8 quantization
│ ├── Multi-GPU? → Add SP/TP/CFG parallel (check support table)
│ └── OOM? → Enable CPU offload or VAE slicing/tiling
└── NO → Also apply Step 2 (lossy)
├── TeaCache supported? → Use tea_cache with rel_l1_thresh=0.2
└── DiT model? → Use cache_dit with defaults
--enforce-eager unless torch.compile measurably improves warm-request latency for your model/GPU. This avoids first-request compilation overhead._NO_CACHE_ACCELERATION in registry.py before enabling cache backends — UNet-based and some specialized models don't support them.| File | What |
|------|------|
| vllm_omni/diffusion/data.py | OmniDiffusionConfig, DiffusionParallelConfig, DiffusionCacheConfig |
| vllm_omni/diffusion/compile.py | Regional torch.compile logic |
| vllm_omni/diffusion/registry.py | _NO_CACHE_ACCELERATION, model registry |
| vllm_omni/diffusion/distributed/cfg_parallel.py | CFGParallelMixin |
| vllm_omni/diffusion/cache/ | TeaCache and CacheDiT backends |
| vllm_omni/diffusion/offloader/ | CPU offload backends |
| vllm_omni/diffusion/quantization/ | Quantization backends (fp8, gguf, ...) |
| docs/user_guide/diffusion/ | All user-facing docs |
| docs/user_guide/diffusion/parallelism_acceleration.md | Canonical parallelism support table |
When a new optimization method is added to vLLM-Omni, update this skill as follows:
registry.py).vllm_omni/diffusion/quantization/._new_attr").docs/user_guide/diffusion/parallelism_acceleration.md with the
new column in the support table.--flag-name) and online
serving (--flag-name via vllm serve). Online serving flags sometimes differ
(e.g., --ulysses-degree offline vs --usp online).development
Use before submitting a PR to vllm-project/vllm-omni — self-check the branch against project conventions, catch dead code, verify accuracy/performance claims, and confirm merge readiness. Use when the user says "pre-check", "self review", "pre-submit check", or "check my PR before I open it."
development
--- name: vllm-omni-test-report description: Two report kinds; **default output is always HTML** unless the user explicitly asks for Markdown (.md). **Release** — `scripts/compose_full_report.py` (**测试结论**, Buildkite metrics, **Test Result** = Common stack + optional `--log-dir-h*` nightly-style summaries + H100/CI block, **Issue tracking** = GitHub `ci-failure` + *local test* in:title, Open bugs); use `--format markdown` only when the user wants .md or `patch_report_*.py`. **Nightly** — `script
testing
Review PRs on vllm-project/vllm-omni by routing to the right domain skills, checking critical evidence, and focusing comments on blocking issues. Use when reviewing pull requests or local branches, triaging review depth, running detailed or default review, or checking tests, benchmarks, and breaking changes in vllm-omni.
data-ai
Generate videos with vLLM-Omni using Wan2.2 and other video generation models. Use when generating videos from text, creating videos from images, configuring video generation parameters, or working with text-to-video or image-to-video models.