Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

hsliuustc0106/vllm-omni-serving

Name: vllm-omni-serving
Author: hsliuustc0106

skills/vllm-omni-serving/SKILL.md

npx skillsauth add hsliuustc0106/vllm-omni-skills vllm-omni-serving

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

vLLM-Omni Model Serving

Overview

vLLM-Omni serves models via an OpenAI-compatible HTTP server. It supports autoregressive models (text, omni), diffusion models (image, video), and TTS models (audio) through a unified vllm serve command with the --omni flag.

Quick Start

vllm serve <model-name> --omni --port 8091

Examples by modality:

# Image generation
vllm serve Tongyi-MAI/Z-Image-Turbo --omni --port 8091

# Omni-modality (text + image + audio)
vllm serve Qwen/Qwen2.5-Omni-7B --omni --port 8091

# TTS
vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --omni --port 8091

# Video generation
vllm serve Wan-AI/Wan2.2-T2V-A14B-Diffusers --omni --port 8091

Server Configuration

Key CLI Arguments

| Argument | Description | Example | |----------|-------------|---------| | --omni | Enable omni-modality pipeline (required) | --omni | | --port | HTTP port | --port 8091 | | --host | Bind address | --host 0.0.0.0 | | --gpu-memory-utilization | Fraction of GPU memory to use | --gpu-memory-utilization 0.85 | | --tensor-parallel-size | Number of GPUs for tensor parallelism | --tensor-parallel-size 2 | | --pipeline-parallel-size | Pipeline parallelism stages | --pipeline-parallel-size 2 | | --max-model-len | Maximum sequence length | --max-model-len 4096 | | --dtype | Model dtype | --dtype float16 |

Stage Configuration

vLLM-Omni uses stage configs to define multi-stage pipelines. Each model has default stage configs, but you can customize them:

vllm serve Qwen/Qwen2.5-Omni-7B --omni \
  --stage-configs-path ./my-stage-config.yaml

Stage config structure:

stages:
  - name: "encoder"
    stage_type: "ar"
    stage_args:
      runtime:
        max_batch_size: 4
  - name: "diffusion"
    stage_type: "diffusion"
    stage_args:
      runtime:
        max_batch_size: 1

The max_batch_size for diffusion stages defaults to 1. Increase it only for models that support batched diffusion.

GPU Memory Configuration

Calculate memory needs based on model size and desired throughput:

# Conservative (80% GPU memory)
vllm serve <model> --omni --gpu-memory-utilization 0.8

# Aggressive (95% for maximum throughput)
vllm serve <model> --omni --gpu-memory-utilization 0.95

Multi-GPU Serving

Tensor Parallelism

Split model across multiple GPUs:

vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
  --tensor-parallel-size 4 --port 8091

Pipeline Parallelism

For very large models:

vllm serve <model> --omni \
  --tensor-parallel-size 2 \
  --pipeline-parallel-size 2

Production Deployment Checklist

[ ] Set --host 0.0.0.0 for external access
[ ] Configure --gpu-memory-utilization based on model size
[ ] Set appropriate --max-model-len
[ ] Enable --disable-log-requests for reduced I/O overhead
[ ] Place behind a reverse proxy (nginx/caddy) for TLS
[ ] Configure health check endpoint at /health
[ ] Set up log rotation for server logs
[ ] Monitor GPU utilization with nvidia-smi dmon

Running Multiple Models

Run separate server instances on different ports:

# Terminal 1: Image generation
vllm serve Tongyi-MAI/Z-Image-Turbo --omni --port 8091

# Terminal 2: Text/Omni
vllm serve Qwen/Qwen2.5-Omni-7B --omni --port 8092

Use a reverse proxy to route by path or model name.

Troubleshooting

Server fails to start: Check GPU memory availability with nvidia-smi. Reduce --gpu-memory-utilization or choose a smaller model.

Slow first request: Model weights are loaded lazily. The first request triggers full model initialization. Subsequent requests are fast.

Connection refused: Verify --host and --port settings. Default host is 127.0.0.1 (localhost only).

--dtype ignored with default stage configs: When using default stage configs (no --stage-configs-path), the --dtype arg was silently dropped from diffusion stage engine args. Fixed in #2530 — dtype now correctly propagates from CLI.

--stage-init-timeout not respected: User-configured stage init timeout was being overridden. Default is now 300s (server-side). Pass --stage-init-timeout <seconds> to customize. Fixed in #2519.

OOM errors produce no response: Diffusion pipeline OOM and execution errors now return structured HTTP error responses (e.g., 507) with request_id, stage_id, and error_type fields instead of hanging. Uses OmniRequestError dataclass for end-to-end propagation. Fixed in #2638.

DiffusionEngine.close() hangs or leaks resources: Fixed in #3494. Close now properly waits for worker thread and completes pending futures with errors.

HunyuanImage3 deploy config fails at startup: Fixed in #3537. Pipeline name changed from hunyuan_image3 to hunyuan_image_3_moe; inter-stage connectors default to rdma_connector.

Multimodal cache miss across AR replicas: Fixed in #3605. Multimodal UUIDs are now scoped per stage-0 replica to prevent the sender from skipping tensor transfers.

References

For model-specific configurations, see references/model-configs.md
For scaling and load balancing, see references/scaling-guide.md

hsliuustc0106/vllm-omni-serving

skills/vllm-omni-serving/SKILL.md

Launch and configure vLLM-Omni API servers for production model serving. Use when starting a model server, configuring stage pipelines, setting up GPU memory, enabling optimizations, or deploying models behind a load balancer.

67 stars

development

Updated May 25, 2026

$ install --global

skillsauth

npx skillsauth add hsliuustc0106/vllm-omni-skills vllm-omni-serving

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 25, 2026, 2:17 AM11.8s4 files scanned

SKILL.md

name:: vllm-omni-serving
description:: Launch and configure vLLM-Omni API servers for production model serving. Use when starting a model server, configuring stage pipelines, setting up GPU memory, enabling optimizations, or deploying models behind a load balancer.

vLLM-Omni Model Serving

Overview

Quick Start

vllm serve <model-name> --omni --port 8091

Examples by modality:

# Image generation
vllm serve Tongyi-MAI/Z-Image-Turbo --omni --port 8091

# Omni-modality (text + image + audio)
vllm serve Qwen/Qwen2.5-Omni-7B --omni --port 8091

# TTS
vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --omni --port 8091

# Video generation
vllm serve Wan-AI/Wan2.2-T2V-A14B-Diffusers --omni --port 8091

Server Configuration

Key CLI Arguments

Stage Configuration

vLLM-Omni uses stage configs to define multi-stage pipelines. Each model has default stage configs, but you can customize them:

vllm serve Qwen/Qwen2.5-Omni-7B --omni \
  --stage-configs-path ./my-stage-config.yaml

Stage config structure:

stages:
  - name: "encoder"
    stage_type: "ar"
    stage_args:
      runtime:
        max_batch_size: 4
  - name: "diffusion"
    stage_type: "diffusion"
    stage_args:
      runtime:
        max_batch_size: 1

The max_batch_size for diffusion stages defaults to 1. Increase it only for models that support batched diffusion.

GPU Memory Configuration

Calculate memory needs based on model size and desired throughput:

# Conservative (80% GPU memory)
vllm serve <model> --omni --gpu-memory-utilization 0.8

# Aggressive (95% for maximum throughput)
vllm serve <model> --omni --gpu-memory-utilization 0.95

Multi-GPU Serving

Tensor Parallelism

Split model across multiple GPUs:

vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
  --tensor-parallel-size 4 --port 8091

Pipeline Parallelism

For very large models:

vllm serve <model> --omni \
  --tensor-parallel-size 2 \
  --pipeline-parallel-size 2

Production Deployment Checklist

[ ] Set --host 0.0.0.0 for external access
[ ] Configure --gpu-memory-utilization based on model size
[ ] Set appropriate --max-model-len
[ ] Enable --disable-log-requests for reduced I/O overhead
[ ] Place behind a reverse proxy (nginx/caddy) for TLS
[ ] Configure health check endpoint at /health
[ ] Set up log rotation for server logs
[ ] Monitor GPU utilization with nvidia-smi dmon

Running Multiple Models

Run separate server instances on different ports:

# Terminal 1: Image generation
vllm serve Tongyi-MAI/Z-Image-Turbo --omni --port 8091

# Terminal 2: Text/Omni
vllm serve Qwen/Qwen2.5-Omni-7B --omni --port 8092

Use a reverse proxy to route by path or model name.

Troubleshooting

Server fails to start: Check GPU memory availability with nvidia-smi. Reduce --gpu-memory-utilization or choose a smaller model.

Slow first request: Model weights are loaded lazily. The first request triggers full model initialization. Subsequent requests are fast.

Connection refused: Verify --host and --port settings. Default host is 127.0.0.1 (localhost only).

DiffusionEngine.close() hangs or leaks resources: Fixed in #3494. Close now properly waits for worker thread and completes pending futures with errors.

HunyuanImage3 deploy config fails at startup: Fixed in #3537. Pipeline name changed from hunyuan_image3 to hunyuan_image_3_moe; inter-stage connectors default to rdma_connector.

Multimodal cache miss across AR replicas: Fixed in #3605. Multimodal UUIDs are now scoped per stage-0 replica to prevent the sender from skipping tensor transfers.

References

For model-specific configurations, see references/model-configs.md
For scaling and load balancing, see references/scaling-guide.md

Related Skills

hsliuustc0106/vllm-omni-pre-check

development

VerifiedTrustedCommunity

Use before submitting a PR to vllm-project/vllm-omni — self-check the branch against project conventions, catch dead code, verify accuracy/performance claims, and confirm merge readiness. Use when the user says "pre-check", "self review", "pre-submit check", or "check my PR before I open it."

69SKILL.mdUpdated May 29, 2026

hsliuustc0106/vllm-omni-pre-check

hsliuustc0106/skills/vllm-omni-test-report

development

VerifiedTrustedCommunity

--- name: vllm-omni-test-report description: Two report kinds; **default output is always HTML** unless the user explicitly asks for Markdown (.md). **Release** — `scripts/compose_full_report.py` (**测试结论**, Buildkite metrics, **Test Result** = Common stack + optional `--log-dir-h*` nightly-style summaries + H100/CI block, **Issue tracking** = GitHub `ci-failure` + *local test* in:title, Open bugs); use `--format markdown` only when the user wants .md or `patch_report_*.py`. **Nightly** — `script

69SKILL.mdUpdated May 3, 2026

hsliuustc0106/skills/vllm-omni-test-report

hsliuustc0106/vllm-omni-review

testing

VerifiedTrustedCommunity

Review PRs on vllm-project/vllm-omni by routing to the right domain skills, checking critical evidence, and focusing comments on blocking issues. Use when reviewing pull requests or local branches, triaging review depth, running detailed or default review, or checking tests, benchmarks, and breaking changes in vllm-omni.

69SKILL.mdUpdated May 3, 2026

hsliuustc0106/vllm-omni-review

hsliuustc0106/vllm-omni-video-gen

data-ai

VerifiedTrustedCommunity

Generate videos with vLLM-Omni using Wan2.2 and other video generation models. Use when generating videos from text, creating videos from images, configuring video generation parameters, or working with text-to-video or image-to-video models.

67SKILL.mdUpdated May 3, 2026

hsliuustc0106/vllm-omni-video-gen

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/hsliuustc0106/vllm-omni-skills.git

# Copy into Claude Code skills folder (global)
cp -r vllm-omni-skills/skills/vllm-omni-serving ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

hsliuustc0106/vllm-omni-skills

67 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT