Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

hsliuustc0106/vllm-omni-multimodal

Name: vllm-omni-multimodal
Author: hsliuustc0106

skills/vllm-omni-multimodal/SKILL.md

npx skillsauth add hsliuustc0106/vllm-omni-skills vllm-omni-multimodal

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

vLLM-Omni Multimodal (Omni-Modality Models)

Overview

Omni-modality models accept multiple input types (text, image, audio, video) and produce multiple output types (text, audio) in a single model. vLLM-Omni currently supports the Qwen-Omni family for this capability.

Supported Omni Models

| Model | HF ID | Inputs | Outputs | Min VRAM | |-------|-------|--------|---------|----------| | Qwen2.5-Omni-7B | Qwen/Qwen2.5-Omni-7B | Text, image, audio, video | Text, audio | 24 GB | | Qwen2.5-Omni-3B | Qwen/Qwen2.5-Omni-3B | Text, image, audio, video | Text, audio | 12 GB | | Qwen3-Omni-30B-A3B | Qwen/Qwen3-Omni-30B-A3B-Instruct | Text, image, audio, video | Text, audio | 48 GB |

Quick Start

Offline: Text Conversation

from vllm_omni.entrypoints.omni import Omni

omni = Omni(model="Qwen/Qwen2.5-Omni-7B")
outputs = omni.generate("What is the capital of France?")
print(outputs[0].request_output[0].text)

Online: Start Server

vllm serve Qwen/Qwen2.5-Omni-7B --omni --port 8091

Input Validation Workflow

Validate media inputs before sending to avoid OOM errors and processing failures:

import os
import base64
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8091/v1", api_key="unused")

MAX_IMAGE_MB = 20
MAX_AUDIO_MB = 50
MAX_VIDEO_MB = 100
SUPPORTED_IMAGE = {".jpg", ".jpeg", ".png", ".webp"}
SUPPORTED_AUDIO = {".wav", ".mp3", ".flac"}
SUPPORTED_VIDEO = {".mp4", ".webm"}

def validate_and_encode(path: str, max_mb: float, supported_exts: set) -> str:
    ext = os.path.splitext(path)[1].lower()
    assert ext in supported_exts, f"Unsupported format: {ext}"
    size_mb = os.path.getsize(path) / (1024 * 1024)
    assert size_mb <= max_mb, f"File too large: {size_mb:.1f}MB > {max_mb}MB"
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode()

Multi-Modal Input Patterns

Image Understanding

img_b64 = validate_and_encode("photo.jpg", MAX_IMAGE_MB, SUPPORTED_IMAGE)
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Omni-7B",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}},
            {"type": "text", "text": "What do you see in this image?"},
        ],
    }],
)
print(response.choices[0].message.content)

Audio Understanding (Speech-to-Text)

audio_b64 = validate_and_encode("recording.wav", MAX_AUDIO_MB, SUPPORTED_AUDIO)
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Omni-7B",
    messages=[{
        "role": "user",
        "content": [
            {"type": "audio_url", "audio_url": {"url": f"data:audio/wav;base64,{audio_b64}"}},
            {"type": "text", "text": "Transcribe this audio."},
        ],
    }],
)

Video Understanding

video_b64 = validate_and_encode("clip.mp4", MAX_VIDEO_MB, SUPPORTED_VIDEO)
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Omni-7B",
    messages=[{
        "role": "user",
        "content": [
            {"type": "video_url", "video_url": {"url": f"data:video/mp4;base64,{video_b64}"}},
            {"type": "text", "text": "Describe what happens in this video."},
        ],
    }],
)

Combined Inputs

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Omni-7B",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}},
            {"type": "audio_url", "audio_url": {"url": f"data:audio/wav;base64,{audio_b64}"}},
            {"type": "text", "text": "Does the audio describe what's in the image?"},
        ],
    }],
)

Audio Output (Voice Synthesis)

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Omni-7B",
    messages=[{"role": "user", "content": "Say hello in English and Chinese."}],
    extra_body={"output_modalities": ["text", "audio"]},
)

Multi-Turn Conversations

messages = [
    {"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}},
        {"type": "text", "text": "What's in this image?"},
    ]},
]
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Omni-7B", messages=messages
)
messages.append({"role": "assistant", "content": response.choices[0].message.content})
messages.append({"role": "user", "content": "What colors are dominant?"})
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Omni-7B", messages=messages
)

Qwen3-Omni (MoE)

Qwen3-Omni uses a Mixture-of-Experts architecture (30B total, 3B active). Requires multi-GPU:

vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
  --tensor-parallel-size 2 --port 8091

Qwen3-Omni is compatible with the v2 model runner (vllm 0.19). Uses native launch_core_engines instead of custom spawning. add_streaming_update API removed; audio output tensors are explicitly converted to float. CUDAGraph supports tuple-returning thinker model. Fixed in #2522.

Troubleshooting

Slow with video input: Video processing requires extracting and encoding frames. Shorter clips process faster.

Audio output garbled: Ensure the client correctly handles the audio response format (base64 encoded WAV).

Out of memory with multi-modal input: Large images/videos consume significant memory. Use the validation workflow above to check file sizes before sending.

Qwen3-Omni performance: The multi-stage pipeline optimizes CPU hidden-state copying — only copies to CPU when downstream stages need payloads. Text-only inference (without --omni) is supported for benchmarking via use_omni: false. Fixed in #3203.

References

For Qwen-Omni architecture and advanced config, see references/qwen-omni.md

hsliuustc0106/vllm-omni-multimodal

skills/vllm-omni-multimodal/SKILL.md

Transcribe speech, generate images from prompts, analyze video content, and convert between modalities using multimodal omni-modality models like Qwen2.5-Omni and Qwen3-Omni. Use when working with multimodal models for speech recognition, image generation, video understanding, voice synthesis, or any task combining text, image, audio, and video inputs and outputs simultaneously.

59 stars

data-ai

Updated May 3, 2026

$ install --global

skillsauth

npx skillsauth add hsliuustc0106/vllm-omni-skills vllm-omni-multimodal

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 3, 2026, 2:54 AM121.3s2 files scanned

SKILL.md

name:: vllm-omni-multimodal
description:: Transcribe speech, generate images from prompts, analyze video content, and convert between modalities using multimodal omni-modality models like Qwen2.5-Omni and Qwen3-Omni. Use when working with multimodal models for speech recognition, image generation, video understanding, voice synthesis, or any task combining text, image, audio, and video inputs and outputs simultaneously.

vLLM-Omni Multimodal (Omni-Modality Models)

Overview

Supported Omni Models

Quick Start

Offline: Text Conversation

from vllm_omni.entrypoints.omni import Omni

omni = Omni(model="Qwen/Qwen2.5-Omni-7B")
outputs = omni.generate("What is the capital of France?")
print(outputs[0].request_output[0].text)

Online: Start Server

vllm serve Qwen/Qwen2.5-Omni-7B --omni --port 8091

Input Validation Workflow

Validate media inputs before sending to avoid OOM errors and processing failures:

import os
import base64
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8091/v1", api_key="unused")

MAX_IMAGE_MB = 20
MAX_AUDIO_MB = 50
MAX_VIDEO_MB = 100
SUPPORTED_IMAGE = {".jpg", ".jpeg", ".png", ".webp"}
SUPPORTED_AUDIO = {".wav", ".mp3", ".flac"}
SUPPORTED_VIDEO = {".mp4", ".webm"}

def validate_and_encode(path: str, max_mb: float, supported_exts: set) -> str:
    ext = os.path.splitext(path)[1].lower()
    assert ext in supported_exts, f"Unsupported format: {ext}"
    size_mb = os.path.getsize(path) / (1024 * 1024)
    assert size_mb <= max_mb, f"File too large: {size_mb:.1f}MB > {max_mb}MB"
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode()

Multi-Modal Input Patterns

Image Understanding

img_b64 = validate_and_encode("photo.jpg", MAX_IMAGE_MB, SUPPORTED_IMAGE)
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Omni-7B",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}},
            {"type": "text", "text": "What do you see in this image?"},
        ],
    }],
)
print(response.choices[0].message.content)

Audio Understanding (Speech-to-Text)

audio_b64 = validate_and_encode("recording.wav", MAX_AUDIO_MB, SUPPORTED_AUDIO)
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Omni-7B",
    messages=[{
        "role": "user",
        "content": [
            {"type": "audio_url", "audio_url": {"url": f"data:audio/wav;base64,{audio_b64}"}},
            {"type": "text", "text": "Transcribe this audio."},
        ],
    }],
)

Video Understanding

video_b64 = validate_and_encode("clip.mp4", MAX_VIDEO_MB, SUPPORTED_VIDEO)
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Omni-7B",
    messages=[{
        "role": "user",
        "content": [
            {"type": "video_url", "video_url": {"url": f"data:video/mp4;base64,{video_b64}"}},
            {"type": "text", "text": "Describe what happens in this video."},
        ],
    }],
)

Combined Inputs

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Omni-7B",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}},
            {"type": "audio_url", "audio_url": {"url": f"data:audio/wav;base64,{audio_b64}"}},
            {"type": "text", "text": "Does the audio describe what's in the image?"},
        ],
    }],
)

Audio Output (Voice Synthesis)

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Omni-7B",
    messages=[{"role": "user", "content": "Say hello in English and Chinese."}],
    extra_body={"output_modalities": ["text", "audio"]},
)

Multi-Turn Conversations

messages = [
    {"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}},
        {"type": "text", "text": "What's in this image?"},
    ]},
]
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Omni-7B", messages=messages
)
messages.append({"role": "assistant", "content": response.choices[0].message.content})
messages.append({"role": "user", "content": "What colors are dominant?"})
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Omni-7B", messages=messages
)

Qwen3-Omni (MoE)

Qwen3-Omni uses a Mixture-of-Experts architecture (30B total, 3B active). Requires multi-GPU:

vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
  --tensor-parallel-size 2 --port 8091

Troubleshooting

Slow with video input: Video processing requires extracting and encoding frames. Shorter clips process faster.

Audio output garbled: Ensure the client correctly handles the audio response format (base64 encoded WAV).

Out of memory with multi-modal input: Large images/videos consume significant memory. Use the validation workflow above to check file sizes before sending.

References

For Qwen-Omni architecture and advanced config, see references/qwen-omni.md

Related Skills

hsliuustc0106/vllm-omni-pre-check

development

VerifiedTrustedCommunity

Use before submitting a PR to vllm-project/vllm-omni — self-check the branch against project conventions, catch dead code, verify accuracy/performance claims, and confirm merge readiness. Use when the user says "pre-check", "self review", "pre-submit check", or "check my PR before I open it."

69SKILL.mdUpdated May 29, 2026

hsliuustc0106/vllm-omni-pre-check

hsliuustc0106/skills/vllm-omni-test-report

development

VerifiedTrustedCommunity

--- name: vllm-omni-test-report description: Two report kinds; **default output is always HTML** unless the user explicitly asks for Markdown (.md). **Release** — `scripts/compose_full_report.py` (**测试结论**, Buildkite metrics, **Test Result** = Common stack + optional `--log-dir-h*` nightly-style summaries + H100/CI block, **Issue tracking** = GitHub `ci-failure` + *local test* in:title, Open bugs); use `--format markdown` only when the user wants .md or `patch_report_*.py`. **Nightly** — `script

69SKILL.mdUpdated May 3, 2026

hsliuustc0106/skills/vllm-omni-test-report

hsliuustc0106/vllm-omni-review

testing

VerifiedTrustedCommunity

Review PRs on vllm-project/vllm-omni by routing to the right domain skills, checking critical evidence, and focusing comments on blocking issues. Use when reviewing pull requests or local branches, triaging review depth, running detailed or default review, or checking tests, benchmarks, and breaking changes in vllm-omni.

69SKILL.mdUpdated May 3, 2026

hsliuustc0106/vllm-omni-review

hsliuustc0106/vllm-omni-video-gen

data-ai

VerifiedTrustedCommunity

Generate videos with vLLM-Omni using Wan2.2 and other video generation models. Use when generating videos from text, creating videos from images, configuring video generation parameters, or working with text-to-video or image-to-video models.

67SKILL.mdUpdated May 3, 2026

hsliuustc0106/vllm-omni-video-gen

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/hsliuustc0106/vllm-omni-skills.git

# Copy into Claude Code skills folder (global)
cp -r vllm-omni-skills/skills/vllm-omni-multimodal ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

hsliuustc0106/vllm-omni-skills

59 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT