skills/vllm-omni-multimodal/SKILL.md
Transcribe speech, generate images from prompts, analyze video content, and convert between modalities using multimodal omni-modality models like Qwen2.5-Omni and Qwen3-Omni. Use when working with multimodal models for speech recognition, image generation, video understanding, voice synthesis, or any task combining text, image, audio, and video inputs and outputs simultaneously.
npx skillsauth add hsliuustc0106/vllm-omni-skills vllm-omni-multimodalInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Omni-modality models accept multiple input types (text, image, audio, video) and produce multiple output types (text, audio) in a single model. vLLM-Omni currently supports the Qwen-Omni family for this capability.
| Model | HF ID | Inputs | Outputs | Min VRAM |
|-------|-------|--------|---------|----------|
| Qwen2.5-Omni-7B | Qwen/Qwen2.5-Omni-7B | Text, image, audio, video | Text, audio | 24 GB |
| Qwen2.5-Omni-3B | Qwen/Qwen2.5-Omni-3B | Text, image, audio, video | Text, audio | 12 GB |
| Qwen3-Omni-30B-A3B | Qwen/Qwen3-Omni-30B-A3B-Instruct | Text, image, audio, video | Text, audio | 48 GB |
from vllm_omni.entrypoints.omni import Omni
omni = Omni(model="Qwen/Qwen2.5-Omni-7B")
outputs = omni.generate("What is the capital of France?")
print(outputs[0].request_output[0].text)
vllm serve Qwen/Qwen2.5-Omni-7B --omni --port 8091
Validate media inputs before sending to avoid OOM errors and processing failures:
import os
import base64
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8091/v1", api_key="unused")
MAX_IMAGE_MB = 20
MAX_AUDIO_MB = 50
MAX_VIDEO_MB = 100
SUPPORTED_IMAGE = {".jpg", ".jpeg", ".png", ".webp"}
SUPPORTED_AUDIO = {".wav", ".mp3", ".flac"}
SUPPORTED_VIDEO = {".mp4", ".webm"}
def validate_and_encode(path: str, max_mb: float, supported_exts: set) -> str:
ext = os.path.splitext(path)[1].lower()
assert ext in supported_exts, f"Unsupported format: {ext}"
size_mb = os.path.getsize(path) / (1024 * 1024)
assert size_mb <= max_mb, f"File too large: {size_mb:.1f}MB > {max_mb}MB"
with open(path, "rb") as f:
return base64.b64encode(f.read()).decode()
img_b64 = validate_and_encode("photo.jpg", MAX_IMAGE_MB, SUPPORTED_IMAGE)
response = client.chat.completions.create(
model="Qwen/Qwen2.5-Omni-7B",
messages=[{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}},
{"type": "text", "text": "What do you see in this image?"},
],
}],
)
print(response.choices[0].message.content)
audio_b64 = validate_and_encode("recording.wav", MAX_AUDIO_MB, SUPPORTED_AUDIO)
response = client.chat.completions.create(
model="Qwen/Qwen2.5-Omni-7B",
messages=[{
"role": "user",
"content": [
{"type": "audio_url", "audio_url": {"url": f"data:audio/wav;base64,{audio_b64}"}},
{"type": "text", "text": "Transcribe this audio."},
],
}],
)
video_b64 = validate_and_encode("clip.mp4", MAX_VIDEO_MB, SUPPORTED_VIDEO)
response = client.chat.completions.create(
model="Qwen/Qwen2.5-Omni-7B",
messages=[{
"role": "user",
"content": [
{"type": "video_url", "video_url": {"url": f"data:video/mp4;base64,{video_b64}"}},
{"type": "text", "text": "Describe what happens in this video."},
],
}],
)
response = client.chat.completions.create(
model="Qwen/Qwen2.5-Omni-7B",
messages=[{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}},
{"type": "audio_url", "audio_url": {"url": f"data:audio/wav;base64,{audio_b64}"}},
{"type": "text", "text": "Does the audio describe what's in the image?"},
],
}],
)
response = client.chat.completions.create(
model="Qwen/Qwen2.5-Omni-7B",
messages=[{"role": "user", "content": "Say hello in English and Chinese."}],
extra_body={"output_modalities": ["text", "audio"]},
)
messages = [
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}},
{"type": "text", "text": "What's in this image?"},
]},
]
response = client.chat.completions.create(
model="Qwen/Qwen2.5-Omni-7B", messages=messages
)
messages.append({"role": "assistant", "content": response.choices[0].message.content})
messages.append({"role": "user", "content": "What colors are dominant?"})
response = client.chat.completions.create(
model="Qwen/Qwen2.5-Omni-7B", messages=messages
)
Qwen3-Omni uses a Mixture-of-Experts architecture (30B total, 3B active). Requires multi-GPU:
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
--tensor-parallel-size 2 --port 8091
Qwen3-Omni is compatible with the v2 model runner (vllm 0.19). Uses native launch_core_engines instead of custom spawning. add_streaming_update API removed; audio output tensors are explicitly converted to float. CUDAGraph supports tuple-returning thinker model. Fixed in #2522.
Slow with video input: Video processing requires extracting and encoding frames. Shorter clips process faster.
Audio output garbled: Ensure the client correctly handles the audio response format (base64 encoded WAV).
Out of memory with multi-modal input: Large images/videos consume significant memory. Use the validation workflow above to check file sizes before sending.
Qwen3-Omni performance: The multi-stage pipeline optimizes CPU hidden-state copying — only copies to CPU when downstream stages need payloads. Text-only inference (without --omni) is supported for benchmarking via use_omni: false. Fixed in #3203.
development
Use before submitting a PR to vllm-project/vllm-omni — self-check the branch against project conventions, catch dead code, verify accuracy/performance claims, and confirm merge readiness. Use when the user says "pre-check", "self review", "pre-submit check", or "check my PR before I open it."
development
--- name: vllm-omni-test-report description: Two report kinds; **default output is always HTML** unless the user explicitly asks for Markdown (.md). **Release** — `scripts/compose_full_report.py` (**测试结论**, Buildkite metrics, **Test Result** = Common stack + optional `--log-dir-h*` nightly-style summaries + H100/CI block, **Issue tracking** = GitHub `ci-failure` + *local test* in:title, Open bugs); use `--format markdown` only when the user wants .md or `patch_report_*.py`. **Nightly** — `script
testing
Review PRs on vllm-project/vllm-omni by routing to the right domain skills, checking critical evidence, and focusing comments on blocking issues. Use when reviewing pull requests or local branches, triaging review depth, running detailed or default review, or checking tests, benchmarks, and breaking changes in vllm-omni.
data-ai
Generate videos with vLLM-Omni using Wan2.2 and other video generation models. Use when generating videos from text, creating videos from images, configuring video generation parameters, or working with text-to-video or image-to-video models.