skills/vllm-omni-distributed/SKILL.md
Scale vLLM-Omni across multiple GPUs and nodes using tensor parallelism, pipeline parallelism, OmniConnector disaggregation, connector backends, and Ray. Use when setting up multi-GPU inference, distributing model execution across machines, deploying disaggregated execution, developing OmniConnector backends, or scaling inference horizontally.
npx skillsauth add hsliuustc0106/vllm-omni-skills vllm-omni-distributedInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
vLLM-Omni supports distributed execution through multiple strategies: tensor parallelism (TP), pipeline parallelism (PP), expert parallelism (EP), and fully disaggregated execution via OmniConnector. These can be combined for optimal throughput and latency.
| Strategy | Splits | Best For | Trade-off | |----------|--------|----------|-----------| | Tensor Parallel (TP) | Model layers across GPUs | Latency reduction | Requires fast GPU interconnect | | Pipeline Parallel (PP) | Model stages across GPU groups | Throughput increase | Adds latency per stage | | Expert Parallel (EP) | MoE experts across GPUs | MoE models | Requires MoE architecture | | Disaggregation | Entire pipeline stages | Independent scaling | Network overhead between stages |
Split model weights across GPUs on a single node:
# 2-GPU tensor parallelism
vllm serve Qwen/Qwen2.5-Omni-7B --omni --tensor-parallel-size 2
# 4-GPU tensor parallelism
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --tensor-parallel-size 4
Requirements:
Split model stages across sequential GPU groups:
vllm serve <model> --omni \
--tensor-parallel-size 2 \
--pipeline-parallel-size 2
This uses 4 GPUs total: 2 groups of 2 GPUs each. Each group handles a portion of the model layers.
vLLM-Omni's OmniConnector enables fully disaggregated serving, where different pipeline stages (Encode, Prefill, Decode, Generate) run on separate GPU pools:
Request → [E] Encode → [P] Prefill → [D] Decode → [G] Generate → Response
Each stage can be scaled independently:
Use this skill for connector implementation work as well as connector usage.
SharedMemoryConnector, MooncakeStoreConnector, YuanrongConnector, and MooncakeTransferEngineConnector as implementation references before adding a new backendput/get, config loading, stage flow, then KV cache flow# 1. Verify basic put/get contract
python -c "
from vllm_omni.omni_connector import create_connector
conn = create_connector('shared_memory', config={})
conn.put('test_key', b'test_data')
assert conn.get('test_key') == b'test_data', 'Basic put/get failed'
print('OK: put/get contract passes')
"
# 2. Verify config loading from stage YAML
python -c "
import yaml
with open('stage_config.yaml') as f:
cfg = yaml.safe_load(f)
assert 'connector' in cfg, 'Missing connector config'
print(f'OK: connector type = {cfg[\"connector\"][\"type\"]}')
"
# 3. Test stage flow end-to-end (start stages, send one request, verify output)
For models that exceed single-node GPU capacity:
# Head node
ray start --head --port=6379
# Worker nodes (run on each additional machine)
ray start --address=<head-node-ip>:6379
import ray
ray.init(address="auto")
resources = ray.cluster_resources()
num_gpus = resources.get("GPU", 0)
assert num_gpus >= 8, f"Need 8 GPUs, found {num_gpus}"
print(f"OK: cluster has {num_gpus} GPUs across {resources.get('node:__internal_head__', 0) + 1} nodes")
vllm serve <model> --omni \
--tensor-parallel-size 8 \
--port 8091
For DiT models, sequence parallelism splits the denoising sequence across GPUs:
vllm serve Wan-AI/Wan2.2-T2V-A14B-Diffusers --omni \
--tensor-parallel-size 4
This accelerates video/image generation by parallelizing the diffusion computation.
vllm serve Tongyi-MAI/Z-Image-Turbo --omni
vllm serve Qwen/Qwen2.5-Omni-7B --omni --tensor-parallel-size 2
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --tensor-parallel-size 4
ray start --head
vllm serve <model> --omni --tensor-parallel-size 8
NCCL timeout: GPU-to-GPU communication is timing out. Check NVLink/InfiniBand connectivity. Increase timeout with NCCL_TIMEOUT=1800.
Uneven GPU utilization: Common with pipeline parallelism. Adjust stage placement to balance load.
Ray worker disconnected: Check network connectivity between nodes and ensure Ray dashboard shows all workers.
Multimodal cache miss across AR replicas (distributed): Fixed in #3605. When multiple AR replicas serve the same image, multimodal UUIDs are now scoped per replica to prevent tensor transfer being skipped.
HunyuanImage3 KV reuse broken under sequence parallel: Fixed in #3546. ar_kv_reuse_len is now correctly propagated through the DiT forward pass and SP seq_len calculations.
SHM connector test_chunk_transfer_adapter failures: Fixed in #3650. Updated test assertions for connector transfer adapter protocol changes.
development
Use before submitting a PR to vllm-project/vllm-omni — self-check the branch against project conventions, catch dead code, verify accuracy/performance claims, and confirm merge readiness. Use when the user says "pre-check", "self review", "pre-submit check", or "check my PR before I open it."
development
--- name: vllm-omni-test-report description: Two report kinds; **default output is always HTML** unless the user explicitly asks for Markdown (.md). **Release** — `scripts/compose_full_report.py` (**测试结论**, Buildkite metrics, **Test Result** = Common stack + optional `--log-dir-h*` nightly-style summaries + H100/CI block, **Issue tracking** = GitHub `ci-failure` + *local test* in:title, Open bugs); use `--format markdown` only when the user wants .md or `patch_report_*.py`. **Nightly** — `script
testing
Review PRs on vllm-project/vllm-omni by routing to the right domain skills, checking critical evidence, and focusing comments on blocking issues. Use when reviewing pull requests or local branches, triaging review depth, running detailed or default review, or checking tests, benchmarks, and breaking changes in vllm-omni.
data-ai
Generate videos with vLLM-Omni using Wan2.2 and other video generation models. Use when generating videos from text, creating videos from images, configuring video generation parameters, or working with text-to-video or image-to-video models.