skills/vllm-omni-audio-tts/SKILL.md
Generate audio and speech with vLLM-Omni using Qwen3-TTS, Fish Speech S2 Pro, CosyVoice3, MiMo-Audio, and Stable-Audio models. Use when synthesizing speech from text, generating audio effects or music, configuring TTS parameters, cloning voices, adding new TTS models, or working with text-to-speech models.
npx skillsauth add hsliuustc0106/vllm-omni-skills vllm-omni-audio-ttsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
vLLM-Omni supports text-to-speech (TTS), text-to-audio (sound effects, music), and audio understanding through multiple model families. TTS models use a two-stage autoregressive pipeline (Code Predictor + Code2Wav decoder), while audio generation uses diffusion.
| Model | HF ID | Type | Min VRAM |
|-------|-------|------|----------|
| Qwen3-TTS 1.7B CustomVoice | Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice | TTS + voice cloning | 8 GB |
| Qwen3-TTS 1.7B VoiceDesign | Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign | TTS + voice design | 8 GB |
| Qwen3-TTS 1.7B Base | Qwen/Qwen3-TTS-12Hz-1.7B-Base | Basic TTS | 8 GB |
| Qwen3-TTS 0.6B CustomVoice | Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice | TTS + voice cloning | 4 GB |
| Qwen3-TTS 0.6B Base | Qwen/Qwen3-TTS-12Hz-0.6B-Base | Basic TTS | 4 GB |
| Fish Speech S2 Pro | fishaudio/s2-pro | TTS + voice cloning (dual-AR + DAC) | 16 GB |
| CosyVoice3 0.5B | FunAudioLLM/Fun-CosyVoice3-0.5B-2512 | TTS (AR + flow matching) | 4 GB |
| MiMo-Audio-7B | XiaomiMiMo/MiMo-Audio-7B-Instruct | Audio understanding + TTS | 24 GB |
| MiMo-V2.5-ASR | XiaomiMiMo/MiMo-V2.5-ASR | ASR (speech-to-text) | 24 GB |
| OmniVoice | nvidia/OmniVoice | TTS + voice cloning (HiggsAudioV2) | 8 GB |
| VoxCPM2 | openbmb/VoxCPM2 | TTS (native AR, 30+ languages) | 8 GB |
| Stable-Audio-Open | stabilityai/stable-audio-open-1.0 | Text-to-audio (music/effects) | 8 GB |
OmniVoice supports voice cloning via ref_audio + ref_text (requires transformers>=5.3). VoxCPM2 is a 2B tokenizer-free native AR TTS model producing 48kHz audio in 30+ languages (requires pip install voxcpm).
Both Qwen3-TTS and CosyVoice3 use a two-stage autoregressive pipeline. See the reference docs for architecture details, key files, and model variants:
from vllm_omni.entrypoints.omni import Omni
omni = Omni(model="Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice")
outputs = omni.generate("Hello, welcome to vLLM-Omni!")
audio = outputs[0].request_output[0].audio
audio.save("greeting.wav")
vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --omni --port 8091
curl -s http://localhost:8091/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
"input": "Hello, welcome to vLLM-Omni!",
"voice": "default"
}' --output greeting.wav
Clone a voice from a reference audio sample:
omni = Omni(model="Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice")
outputs = omni.generate(
prompt="This is a test of voice cloning with vLLM-Omni.",
audio_references=["reference_voice.wav"],
)
outputs[0].request_output[0].audio.save("cloned_speech.wav")
Design a voice by describing its characteristics:
omni = Omni(model="Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign")
outputs = omni.generate(
prompt="Welcome to our product launch event!",
voice_description="A warm, professional female voice with a calm tone",
)
outputs[0].request_output[0].audio.save("designed_voice.wav")
Generate music or sound effects with Stable-Audio-Open:
vllm serve stabilityai/stable-audio-open-1.0 --omni --port 8091
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8091/v1", api_key="unused")
response = client.chat.completions.create(
model="stabilityai/stable-audio-open-1.0",
messages=[{"role": "user", "content": "Relaxing piano music with rain sounds"}],
)
MiMo-Audio can both understand audio input and generate speech:
omni = Omni(model="XiaomiMiMo/MiMo-Audio-7B-Instruct")
# Transcribe/understand audio
outputs = omni.generate(
prompt="What is being said in this audio?",
audio_inputs=["recording.wav"],
)
print(outputs[0].request_output[0].text)
async_scheduling is enabled by default for Qwen3-TTS models, improving first-packet latency and throughput.
Default stage config uses async_chunk streaming (qwen3_tts.yaml). Key knobs:
| Config | Description | Default |
|--------|-------------|---------|
| async_chunk | Enable inter-stage streaming | true |
| runtime.max_batch_size | Max requests batched per stage | 1 |
| enforce_eager | Disable CUDA Graph (Stage 0: false, Stage 1: true) | varies |
| codec_chunk_frames | AR frames per async chunk (inter-stage streaming only) | 25 |
| codec_left_context_frames | Sliding context window for smooth boundaries | 25 |
| initial_codec_chunk_frames | Frames for first emitted codec chunk only (lowers TTFA) | 0 |
| decode_chunk_frames | Code2Wav internal decode chunk size (independent of codec streaming) | 300 |
| decode_left_context_frames | Code2Wav internal left context for decode | 25 |
Connector streaming chunking (codec_chunk_frames / codec_left_context_frames) is decoupled from Code2Wav internal decode chunking (decode_chunk_frames / decode_left_context_frames). The connector controls inter-stage streaming windows only, while Code2Wav keeps its own independent decode parameters. Use initial_codec_chunk_frames to emit a small first chunk for low TTFA, then subsequent chunks return to the normal codec_chunk_frames window.
The uniproc Code2Wav stage default max_num_seqs is now 10 (was 1). Avoid reducing below 10 for latency-sensitive deployments.
CUDA Graph warmup for Qwen3-TTS now accounts for custom decode_chunk_frames / decode_left_context_frames overrides.
For high-concurrency TTS serving (voice cloning, c=64+), use qwen3_tts_high_concurrency.yaml:
vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --omni \
--stage-configs-path vllm_omni/deploy/qwen3_tts_high_concurrency.yaml
This profile enables batched CUDA graph decoder, prefix CUDA graphs for the code predictor, bounded reference-code context, and first-chunk fast emit (initial_codec_chunk_frames: 1). Tuned for 2-GPU serving with Seed-TTS voice-clone workload. Median TTFP is higher than default profile; use for throughput/E2E rather than first-packet-latency optimization.
Additional high-concurrency knobs available in the deploy config:
decode_cudagraph_batch_sizes: Multi-batch-size CUDA graph capture for Code2Wavdecode_batch_bucket_frames / decode_batch_max_size: Variable-length chunk batchingref_code_context_frames: Limits reference-audio code frames per chunk for stable stage-1 shapesdecode_enable_tf32: true: Opt-in TF32 for Code2Wavcode_predictor_prefix_graphs: true: Prefix CUDA graph warmup for Stage0 code predictorFor batch mode (no streaming), use qwen3_tts_batch.yaml.
Fish Speech uses fish_speech_s2_pro.yaml with similar knobs. Its DAC codec outputs at 44.1 kHz (vs Qwen3-TTS's 24 kHz).
Note: CosyVoice3 does not support async_chunk streaming yet - use cosyvoice3.yaml (batch mode only).
For real-time TTS streaming:
response = client.chat.completions.create(
model="Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
messages=[{"role": "user", "content": "A long paragraph of text to stream..."}],
stream=True,
)
For a step-by-step guide on integrating a new TTS model into vLLM-Omni, see the TTS model developer guide. Offline examples are consolidated under examples/offline_inference/text_to_speech/<model>/end2end.py, and online serving examples under examples/online_serving/text_to_speech/<model>/.
Audio quality issues: Ensure reference audio for voice cloning is clean (no background noise), 10-20 seconds, single speaker.
Qwen3-TTS code predictor crash: Fixed in #1619. If you encounter a crash in the code predictor stage, update to the latest vllm-omni.
Qwen3-TTS NaN on fp16-only GPUs: The code predictor auto-upcasts to float32 for numerical stability on GPUs without bf16 support (Turing, Volta). No manual override needed. Fixed in #3253.
Qwen3-TTS speaker_embedding dimension error: Speaker embedding dimensions must match the model's talker hidden_size (2048 for 1.7B, 1024 for 0.6B). Mismatched dimensions return HTTP 400. Fixed in #3191.
Qwen3-TTS load_format: dummy: speaker_encoder is always constructed at init time. Voice cloning works under load_format: dummy without extra configuration. Fixed in #3117.
Slow generation: TTS models are autoregressive - generation time scales with output duration. Enable async_chunk for lower first-packet latency. For throughput, increase max_batch_size.
Fish Speech voice cloning latency: Uploaded voices via /v1/audio/voice/upload now auto-cache DAC-encoded reference audio. First request encodes the reference; subsequent requests reuse the cached codes for faster TTFP. Fixed in #2609.
Event loop blocking under concurrent TTS: Blocking tokenizer operations (_build_voxtral_prompt, _build_fish_speech_prompt) now run in a shared ThreadPoolExecutor(max_workers=1). This prevents /health latency spikes under concurrent load. Fixed in #2511.
development
Use before submitting a PR to vllm-project/vllm-omni — self-check the branch against project conventions, catch dead code, verify accuracy/performance claims, and confirm merge readiness. Use when the user says "pre-check", "self review", "pre-submit check", or "check my PR before I open it."
development
--- name: vllm-omni-test-report description: Two report kinds; **default output is always HTML** unless the user explicitly asks for Markdown (.md). **Release** — `scripts/compose_full_report.py` (**测试结论**, Buildkite metrics, **Test Result** = Common stack + optional `--log-dir-h*` nightly-style summaries + H100/CI block, **Issue tracking** = GitHub `ci-failure` + *local test* in:title, Open bugs); use `--format markdown` only when the user wants .md or `patch_report_*.py`. **Nightly** — `script
testing
Review PRs on vllm-project/vllm-omni by routing to the right domain skills, checking critical evidence, and focusing comments on blocking issues. Use when reviewing pull requests or local branches, triaging review depth, running detailed or default review, or checking tests, benchmarks, and breaking changes in vllm-omni.
data-ai
Generate videos with vLLM-Omni using Wan2.2 and other video generation models. Use when generating videos from text, creating videos from images, configuring video generation parameters, or working with text-to-video or image-to-video models.