skills/vllm-omni-tts-integration/SKILL.md
Integrate a new text-to-speech model into vLLM-Omni from HuggingFace reference implementation through production-ready serving with streaming and CUDA graph acceleration. Use when adding a new TTS model, wiring stage separation for speech synthesis, enabling online voice generation serving, debugging TTS integration behavior, or building audio output pipelines.
npx skillsauth add hsliuustc0106/vllm-omni-skills vllm-omni-tts-integrationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
HF Reference -> Stage Separation -> Online Serving -> Async Chunk -> CUDA Graph
(Phase 1) (Phase 2) (Phase 3) (Phase 4) (Phase 5)
Goal: Understand the reference implementation and verify it produces correct audio.
config.json fields, model_type, sub-model configs<|voice|>, <|audio_start|>, <|im_end|>, etc.)Goal: Split the model into vLLM-Omni stages and get offline inference working.
vllm_omni/model_executor/models/registry.pyconfiguration_<model>.py) with model_type registrationforward() for autoregressive token generationforward(): codec codes -> audio waveformOmniOutput with multimodal_outputs| Parameter | Impact if Wrong | |-----------|----------------| | Hop length | Audio duration wrong, streaming noise | | Token ID mapping | Garbage codes -> noise output | | Codebook count/size | Shape mismatch crashes | | Stop token | Generation never stops or stops too early | | dtype / autocast | Numerical issues, silent quality degradation | | Repetition penalty | Must match reference (often 1.0 for TTS) |
When audio output is wrong, check in this order:
vllm_omni/model_executor/models/<model_name>/end2end.py with correct audio outputGoal: Expose the model via /v1/audio/speech API endpoint.
serving_speech.py:
_TTS_MODEL_STAGES set_is_fish_speech)_build_fish_speech_prompt())ref_audio encoding and prompt injectionmax_new_tokens override in sampling paramstokens_input() for prompt construction: All TTS prompt builders must use vllm.inputs.tokens_input(prompt_token_ids=...) instead of raw dicts. This avoids deprecated preprocess() fallback warnings. The returned dict has "type": "token" automatically. Add model-specific keys (e.g., additional_information) to the returned dict.speech_client.py, run_server.shimport base64
from pathlib import Path
from vllm.inputs import tokens_input
def build_voice_clone_prompt(ref_audio_path: str, text: str, codec, tokenizer) -> dict:
"""Build prompt with reference audio for voice cloning in serving_speech.py."""
audio_bytes = Path(ref_audio_path).read_bytes()
codes = codec.encode(audio_bytes) # Encode on CPU using model's codec (e.g., DAC)
token_ids = [code + codec.vocab_offset for code in codes.flatten().tolist()]
# Build prompt with voice tokens and tokenize
voice_tokens = "".join(chr(t) for t in token_ids)
prompt = f"<|voice|>{voice_tokens}{text}"
prompt_token_ids = tokenizer.encode(prompt)
# Use tokens_input() to avoid deprecated preprocess() fallback
result = tokens_input(prompt_token_ids=prompt_token_ids)
return result
serving_speech.py with model-specific prompt builderGoal: Enable inter-stage streaming so audio chunks are produced while AR generation continues.
async_chunk: true
codec_chunk_frames: 25 # inter-stage streaming window
codec_left_context_frames: 25 # overlap for smooth boundaries
initial_codec_chunk_frames: 10 # first chunk only (optional, lowers TTFA)
OmniOutputcodec_chunk_frames / codec_left_context_frames control inter-stage streaming windowsdecode_chunk_frames / decode_left_context_frames control the decoder's internal processing (independent)initial_codec_chunk_frames for a smaller first chunk to lower TTFA, then return to normal codec_chunk_frames for subsequent chunksstream=true with PCM outputStage 0 (AR) Stage 1 (Decoder)
| |
|-- chunk 0 (25 frames) ------> decode -> audio chunk 0 -> client
|-- chunk 1 (25 frames) ------> decode -> audio chunk 1 -> client
|-- chunk 2 (25 frames) ------> decode -> audio chunk 2 -> client
...
context_audio_samples = context_frames * hop_lengthGoal: Capture the AR loop as a CUDA graph for significant speedup.
torch.argmax instead of torch.multinomial (graph-safe)import torch
class CodePredictorGraph:
"""Captures the 16-step code predictor AR loop as a single CUDA graph."""
def setup_graph(self, device: torch.device, kv_heads: int = 4, head_dim: int = 64):
self.num_steps = 16
self.kv_cache = torch.zeros(1, kv_heads, self.num_steps, head_dim, device=device)
self.positions = torch.arange(self.num_steps, device=device)
self.causal_mask = torch.tril(torch.ones(self.num_steps, self.num_steps, device=device))
self.input_buf = torch.zeros(1, 1, kv_heads * head_dim, device=device)
self.output_buf = torch.zeros(1, self.num_steps, device=device, dtype=torch.long)
# Warm up, then: self.graph = torch.cuda.CUDAGraph(); self.graph.capture(...)
def run_graph(self, initial_input: torch.Tensor) -> torch.Tensor:
self.input_buf.copy_(initial_input)
self.graph.replay()
return self.output_buf.clone()
Based on Qwen3-TTS code predictor experience:
Use this checklist when integrating a new TTS model:
registry.pymodel_type registrationend2end.py produces audio matching reference qualityserving_speech.pytokens_input() from vllm.inputs (not raw dicts)async_chunk: truestream=true) worksdevelopment
Use before submitting a PR to vllm-project/vllm-omni — self-check the branch against project conventions, catch dead code, verify accuracy/performance claims, and confirm merge readiness. Use when the user says "pre-check", "self review", "pre-submit check", or "check my PR before I open it."
development
--- name: vllm-omni-test-report description: Two report kinds; **default output is always HTML** unless the user explicitly asks for Markdown (.md). **Release** — `scripts/compose_full_report.py` (**测试结论**, Buildkite metrics, **Test Result** = Common stack + optional `--log-dir-h*` nightly-style summaries + H100/CI block, **Issue tracking** = GitHub `ci-failure` + *local test* in:title, Open bugs); use `--format markdown` only when the user wants .md or `patch_report_*.py`. **Nightly** — `script
testing
Review PRs on vllm-project/vllm-omni by routing to the right domain skills, checking critical evidence, and focusing comments on blocking issues. Use when reviewing pull requests or local branches, triaging review depth, running detailed or default review, or checking tests, benchmarks, and breaking changes in vllm-omni.
data-ai
Generate videos with vLLM-Omni using Wan2.2 and other video generation models. Use when generating videos from text, creating videos from images, configuring video generation parameters, or working with text-to-video or image-to-video models.