Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

hsliuustc0106/vllm-omni-distributed

Name: vllm-omni-distributed
Author: hsliuustc0106

skills/vllm-omni-distributed/SKILL.md

npx skillsauth add hsliuustc0106/vllm-omni-skills vllm-omni-distributed

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

vLLM-Omni Distributed Inference

Overview

vLLM-Omni supports distributed execution through multiple strategies: tensor parallelism (TP), pipeline parallelism (PP), expert parallelism (EP), and fully disaggregated execution via OmniConnector. These can be combined for optimal throughput and latency.

Parallelism Strategies

| Strategy | Splits | Best For | Trade-off | |----------|--------|----------|-----------| | Tensor Parallel (TP) | Model layers across GPUs | Latency reduction | Requires fast GPU interconnect | | Pipeline Parallel (PP) | Model stages across GPU groups | Throughput increase | Adds latency per stage | | Expert Parallel (EP) | MoE experts across GPUs | MoE models | Requires MoE architecture | | Disaggregation | Entire pipeline stages | Independent scaling | Network overhead between stages |

Tensor Parallelism

Split model weights across GPUs on a single node:

# 2-GPU tensor parallelism
vllm serve Qwen/Qwen2.5-Omni-7B --omni --tensor-parallel-size 2

# 4-GPU tensor parallelism
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --tensor-parallel-size 4

Requirements:

GPUs must be on the same node
NVLink/NVSwitch preferred for NVIDIA GPUs
TP size must evenly divide attention heads
Total GPUs = TP size x PP size

Pipeline Parallelism

Split model stages across sequential GPU groups:

vllm serve <model> --omni \
  --tensor-parallel-size 2 \
  --pipeline-parallel-size 2

This uses 4 GPUs total: 2 groups of 2 GPUs each. Each group handles a portion of the model layers.

Disaggregated Execution (OmniConnector)

vLLM-Omni's OmniConnector enables fully disaggregated serving, where different pipeline stages (Encode, Prefill, Decode, Generate) run on separate GPU pools:

Request → [E] Encode → [P] Prefill → [D] Decode → [G] Generate → Response

Each stage can be scaled independently:

Encode (E): Processes multi-modal inputs (images, audio, video)
Prefill (P): Runs initial forward pass to populate KV cache
Decode (D): Autoregressive token generation
Generate (G): Diffusion or audio generation

Benefits

Scale each stage based on its bottleneck independently
Mix GPU types (e.g., cheaper GPUs for encoding, premium GPUs for generation)
Better GPU utilization by matching capacity to demand per stage

OmniConnector Development

Use this skill for connector implementation work as well as connector usage.

Refer to SharedMemoryConnector, MooncakeStoreConnector, YuanrongConnector, and MooncakeTransferEngineConnector as implementation references before adding a new backend
Keep connector edge config role-neutral in YAML; let runtime inject sender or receiver details
Validate connector changes from the smallest contract outward: basic put/get, config loading, stage flow, then KV cache flow
Support both metadata-driven and key-only retrieval paths when designing connector behavior

OmniConnector Validation Workflow

# 1. Verify basic put/get contract
python -c "
from vllm_omni.omni_connector import create_connector
conn = create_connector('shared_memory', config={})
conn.put('test_key', b'test_data')
assert conn.get('test_key') == b'test_data', 'Basic put/get failed'
print('OK: put/get contract passes')
"

# 2. Verify config loading from stage YAML
python -c "
import yaml
with open('stage_config.yaml') as f:
    cfg = yaml.safe_load(f)
assert 'connector' in cfg, 'Missing connector config'
print(f'OK: connector type = {cfg[\"connector\"][\"type\"]}')
"

# 3. Test stage flow end-to-end (start stages, send one request, verify output)

Multi-Node with Ray

For models that exceed single-node GPU capacity:

Step 1: Start Ray Cluster

# Head node
ray start --head --port=6379

# Worker nodes (run on each additional machine)
ray start --address=<head-node-ip>:6379

Step 2: Verify Cluster Before Launching

import ray
ray.init(address="auto")
resources = ray.cluster_resources()
num_gpus = resources.get("GPU", 0)
assert num_gpus >= 8, f"Need 8 GPUs, found {num_gpus}"
print(f"OK: cluster has {num_gpus} GPUs across {resources.get('node:__internal_head__', 0) + 1} nodes")

Step 3: Launch Server

vllm serve <model> --omni \
  --tensor-parallel-size 8 \
  --port 8091

Sequence Parallelism for Diffusion

For DiT models, sequence parallelism splits the denoising sequence across GPUs:

vllm serve Wan-AI/Wan2.2-T2V-A14B-Diffusers --omni \
  --tensor-parallel-size 4

This accelerates video/image generation by parallelizing the diffusion computation.

Configuration Examples

Small model, single GPU

vllm serve Tongyi-MAI/Z-Image-Turbo --omni

Medium model, dual GPU

vllm serve Qwen/Qwen2.5-Omni-7B --omni --tensor-parallel-size 2

Large MoE model, quad GPU

vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --tensor-parallel-size 4

Very large model, multi-node

ray start --head
vllm serve <model> --omni --tensor-parallel-size 8

Troubleshooting

NCCL timeout: GPU-to-GPU communication is timing out. Check NVLink/InfiniBand connectivity. Increase timeout with NCCL_TIMEOUT=1800.

Uneven GPU utilization: Common with pipeline parallelism. Adjust stage placement to balance load.

Ray worker disconnected: Check network connectivity between nodes and ensure Ray dashboard shows all workers.

Multimodal cache miss across AR replicas (distributed): Fixed in #3605. When multiple AR replicas serve the same image, multimodal UUIDs are now scoped per replica to prevent tensor transfer being skipped.

HunyuanImage3 KV reuse broken under sequence parallel: Fixed in #3546. ar_kv_reuse_len is now correctly propagated through the DiT forward pass and SP seq_len calculations.

SHM connector test_chunk_transfer_adapter failures: Fixed in #3650. Updated test assertions for connector transfer adapter protocol changes.

References

For disaggregation architecture details, see references/disaggregation.md
For OmniConnector backend contract, config wiring, and validation, see references/connector-development.md
For Ray execution setup, see references/ray-execution.md

hsliuustc0106/vllm-omni-distributed

skills/vllm-omni-distributed/SKILL.md

Scale vLLM-Omni across multiple GPUs and nodes using tensor parallelism, pipeline parallelism, OmniConnector disaggregation, connector backends, and Ray. Use when setting up multi-GPU inference, distributing model execution across machines, deploying disaggregated execution, developing OmniConnector backends, or scaling inference horizontally.

67 stars

development

Updated May 25, 2026

$ install --global

skillsauth

npx skillsauth add hsliuustc0106/vllm-omni-skills vllm-omni-distributed

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 25, 2026, 2:17 AM11.5s4 files scanned

SKILL.md

name:: vllm-omni-distributed
description:: Scale vLLM-Omni across multiple GPUs and nodes using tensor parallelism, pipeline parallelism, OmniConnector disaggregation, connector backends, and Ray. Use when setting up multi-GPU inference, distributing model execution across machines, deploying disaggregated execution, developing OmniConnector backends, or scaling inference horizontally.

vLLM-Omni Distributed Inference

Overview

Parallelism Strategies

Tensor Parallelism

Split model weights across GPUs on a single node:

# 2-GPU tensor parallelism
vllm serve Qwen/Qwen2.5-Omni-7B --omni --tensor-parallel-size 2

# 4-GPU tensor parallelism
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --tensor-parallel-size 4

Requirements:

GPUs must be on the same node
NVLink/NVSwitch preferred for NVIDIA GPUs
TP size must evenly divide attention heads
Total GPUs = TP size x PP size

Pipeline Parallelism

Split model stages across sequential GPU groups:

vllm serve <model> --omni \
  --tensor-parallel-size 2 \
  --pipeline-parallel-size 2

This uses 4 GPUs total: 2 groups of 2 GPUs each. Each group handles a portion of the model layers.

Disaggregated Execution (OmniConnector)

vLLM-Omni's OmniConnector enables fully disaggregated serving, where different pipeline stages (Encode, Prefill, Decode, Generate) run on separate GPU pools:

Request → [E] Encode → [P] Prefill → [D] Decode → [G] Generate → Response

Each stage can be scaled independently:

Encode (E): Processes multi-modal inputs (images, audio, video)
Prefill (P): Runs initial forward pass to populate KV cache
Decode (D): Autoregressive token generation
Generate (G): Diffusion or audio generation

Benefits

Scale each stage based on its bottleneck independently
Mix GPU types (e.g., cheaper GPUs for encoding, premium GPUs for generation)
Better GPU utilization by matching capacity to demand per stage

OmniConnector Development

Use this skill for connector implementation work as well as connector usage.

Refer to SharedMemoryConnector, MooncakeStoreConnector, YuanrongConnector, and MooncakeTransferEngineConnector as implementation references before adding a new backend
Keep connector edge config role-neutral in YAML; let runtime inject sender or receiver details
Validate connector changes from the smallest contract outward: basic put/get, config loading, stage flow, then KV cache flow
Support both metadata-driven and key-only retrieval paths when designing connector behavior

OmniConnector Validation Workflow

# 1. Verify basic put/get contract
python -c "
from vllm_omni.omni_connector import create_connector
conn = create_connector('shared_memory', config={})
conn.put('test_key', b'test_data')
assert conn.get('test_key') == b'test_data', 'Basic put/get failed'
print('OK: put/get contract passes')
"

# 2. Verify config loading from stage YAML
python -c "
import yaml
with open('stage_config.yaml') as f:
    cfg = yaml.safe_load(f)
assert 'connector' in cfg, 'Missing connector config'
print(f'OK: connector type = {cfg[\"connector\"][\"type\"]}')
"

# 3. Test stage flow end-to-end (start stages, send one request, verify output)

Multi-Node with Ray

For models that exceed single-node GPU capacity:

Step 1: Start Ray Cluster

# Head node
ray start --head --port=6379

# Worker nodes (run on each additional machine)
ray start --address=<head-node-ip>:6379

Step 2: Verify Cluster Before Launching

import ray
ray.init(address="auto")
resources = ray.cluster_resources()
num_gpus = resources.get("GPU", 0)
assert num_gpus >= 8, f"Need 8 GPUs, found {num_gpus}"
print(f"OK: cluster has {num_gpus} GPUs across {resources.get('node:__internal_head__', 0) + 1} nodes")

Step 3: Launch Server

vllm serve <model> --omni \
  --tensor-parallel-size 8 \
  --port 8091

Sequence Parallelism for Diffusion

For DiT models, sequence parallelism splits the denoising sequence across GPUs:

vllm serve Wan-AI/Wan2.2-T2V-A14B-Diffusers --omni \
  --tensor-parallel-size 4

This accelerates video/image generation by parallelizing the diffusion computation.

Configuration Examples

Small model, single GPU

vllm serve Tongyi-MAI/Z-Image-Turbo --omni

Medium model, dual GPU

vllm serve Qwen/Qwen2.5-Omni-7B --omni --tensor-parallel-size 2

Large MoE model, quad GPU

vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --tensor-parallel-size 4

Very large model, multi-node

ray start --head
vllm serve <model> --omni --tensor-parallel-size 8

Troubleshooting

NCCL timeout: GPU-to-GPU communication is timing out. Check NVLink/InfiniBand connectivity. Increase timeout with NCCL_TIMEOUT=1800.

Uneven GPU utilization: Common with pipeline parallelism. Adjust stage placement to balance load.

Ray worker disconnected: Check network connectivity between nodes and ensure Ray dashboard shows all workers.

HunyuanImage3 KV reuse broken under sequence parallel: Fixed in #3546. ar_kv_reuse_len is now correctly propagated through the DiT forward pass and SP seq_len calculations.

SHM connector test_chunk_transfer_adapter failures: Fixed in #3650. Updated test assertions for connector transfer adapter protocol changes.

References

For disaggregation architecture details, see references/disaggregation.md
For OmniConnector backend contract, config wiring, and validation, see references/connector-development.md
For Ray execution setup, see references/ray-execution.md

Related Skills

hsliuustc0106/vllm-omni-pre-check

development

VerifiedTrustedCommunity

Use before submitting a PR to vllm-project/vllm-omni — self-check the branch against project conventions, catch dead code, verify accuracy/performance claims, and confirm merge readiness. Use when the user says "pre-check", "self review", "pre-submit check", or "check my PR before I open it."

69SKILL.mdUpdated May 29, 2026

hsliuustc0106/vllm-omni-pre-check

hsliuustc0106/skills/vllm-omni-test-report

development

VerifiedTrustedCommunity

--- name: vllm-omni-test-report description: Two report kinds; **default output is always HTML** unless the user explicitly asks for Markdown (.md). **Release** — `scripts/compose_full_report.py` (**测试结论**, Buildkite metrics, **Test Result** = Common stack + optional `--log-dir-h*` nightly-style summaries + H100/CI block, **Issue tracking** = GitHub `ci-failure` + *local test* in:title, Open bugs); use `--format markdown` only when the user wants .md or `patch_report_*.py`. **Nightly** — `script

69SKILL.mdUpdated May 3, 2026

hsliuustc0106/skills/vllm-omni-test-report

hsliuustc0106/vllm-omni-review

testing

VerifiedTrustedCommunity

Review PRs on vllm-project/vllm-omni by routing to the right domain skills, checking critical evidence, and focusing comments on blocking issues. Use when reviewing pull requests or local branches, triaging review depth, running detailed or default review, or checking tests, benchmarks, and breaking changes in vllm-omni.

69SKILL.mdUpdated May 3, 2026

hsliuustc0106/vllm-omni-review

hsliuustc0106/vllm-omni-video-gen

data-ai

VerifiedTrustedCommunity

Generate videos with vLLM-Omni using Wan2.2 and other video generation models. Use when generating videos from text, creating videos from images, configuring video generation parameters, or working with text-to-video or image-to-video models.

67SKILL.mdUpdated May 3, 2026

hsliuustc0106/vllm-omni-video-gen

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/hsliuustc0106/vllm-omni-skills.git

# Copy into Claude Code skills folder (global)
cp -r vllm-omni-skills/skills/vllm-omni-distributed ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

hsliuustc0106/vllm-omni-skills

67 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT