
Use before submitting a PR to vllm-project/vllm-omni — self-check the branch against project conventions, catch dead code, verify accuracy/performance claims, and confirm merge readiness. Use when the user says "pre-check", "self review", "pre-submit check", or "check my PR before I open it."
Review PRs on vllm-project/vllm-omni by routing to the right domain skills, checking critical evidence, and focusing comments on blocking issues. Use when reviewing pull requests or local branches, triaging review depth, running detailed or default review, or checking tests, benchmarks, and breaking changes in vllm-omni.
--- name: vllm-omni-test-report description: Two report kinds; **default output is always HTML** unless the user explicitly asks for Markdown (.md). **Release** — `scripts/compose_full_report.py` (**测试结论**, Buildkite metrics, **Test Result** = Common stack + optional `--log-dir-h*` nightly-style summaries + H100/CI block, **Issue tracking** = GitHub `ci-failure` + *local test* in:title, Open bugs); use `--format markdown` only when the user wants .md or `patch_report_*.py`. **Nightly** — `script
Configure vLLM-Omni for different hardware backends including NVIDIA CUDA, AMD ROCm, Huawei NPU, and Intel XPU. Use when selecting a hardware backend, troubleshooting GPU issues, configuring device placement, or optimizing for specific accelerators.
Scale vLLM-Omni across multiple GPUs and nodes using tensor parallelism, pipeline parallelism, OmniConnector disaggregation, connector backends, and Ray. Use when setting up multi-GPU inference, distributing model execution across machines, deploying disaggregated execution, developing OmniConnector backends, or scaling inference horizontally.
Generate audio and speech with vLLM-Omni using Qwen3-TTS, Fish Speech S2 Pro, CosyVoice3, MiMo-Audio, and Stable-Audio models. Use when synthesizing speech from text, generating audio effects or music, configuring TTS parameters, cloning voices, adding new TTS models, or working with text-to-speech models.
Optimize vLLM-Omni performance through benchmarking, TeaCache, Cache-DiT, quantization, CPU offloading, and parallelism tuning. Use when improving inference speed, reducing latency, lowering memory usage, running benchmarks, or enabling diffusion acceleration.
Generate and edit images with vLLM-Omni using models like FLUX, Stable Diffusion 3, Qwen-Image, GLM-Image, BAGEL, and Z-Image. Use when generating images from text, editing images, configuring diffusion parameters, or working with image generation models.
Use when working on vLLM-Omni quantization for autoregressive, diffusion, or multi-stage omni models, choosing methods such as `awq`, `gptq`, `fp8`, `int8`, `gguf`, or ModelOpt checkpoints, adding quantized model support, or debugging memory, loader, quality, or performance issues.
Launch and configure vLLM-Omni API servers for production model serving. Use when starting a model server, configuring stage pipelines, setting up GPU memory, enabling optimizations, or deploying models behind a load balancer.
Generate videos with vLLM-Omni using Wan2.2 and other video generation models. Use when generating videos from text, creating videos from images, configuring video generation parameters, or working with text-to-video or image-to-video models.
Install and configure vLLM-Omni for omni-modality model inference. Use when setting up vllm-omni, configuring the environment, installing dependencies, resolving GPU driver issues, or preparing a machine for model serving.
Contribute to vLLM-Omni by adding new model support, fixing bugs, or improving features. Use when integrating a new model into vllm-omni, setting up a development environment, writing tests, or submitting pull requests to the vllm-omni project.
Integrate a new text-to-speech model into vLLM-Omni from HuggingFace reference implementation through production-ready serving with streaming and CUDA graph acceleration. Use when adding a new TTS model, wiring stage separation for speech synthesis, enabling online voice generation serving, debugging TTS integration behavior, or building audio output pipelines.
--- name: vllm-omni-nightly-local description: On HK - SSH, Slurm, non-interactive docker exec (bash -lc): **`source /rebase/.venv/bin/activate`** inside the container before repo commands, then run `tools/nightly/run_nightly_jobs.sh` and write logs under logs/nightly_jobs. Sync logs and optional logs/nightly_perf_manual.xlsx to your laptop, then use vllm-omni-test-report report kind nightly + scripts/nightly_local_log_report.py — **default output HTML** (`--html-report`) unless the user explici
Add a new diffusion model (text-to-image, text-to-video, image-to-video, text-to-audio, image editing) to vLLM-Omni, including Cache-DiT acceleration and parallelism support (TP, SP/USP, CFG-Parallel, HSDP). Use when integrating a new diffusion model, porting a diffusers pipeline or a custom model repo to vllm-omni, creating a new DiT transformer adapter, adding diffusion model support, or enabling multi-GPU parallelism and cache acceleration for an existing model.
Transcribe speech, generate images from prompts, analyze video content, and convert between modalities using multimodal omni-modality models like Qwen2.5-Omni and Qwen3-Omni. Use when working with multimodal models for speech recognition, image generation, video understanding, voice synthesis, or any task combining text, image, audio, and video inputs and outputs simultaneously.
Use when adding a recipe for omnimodal models (text-to-image, text-to-video, text-to-audio, image-to-video, any-to-any, diffusion transformers) to the vLLM recipes repository, or documenting vLLM-Omni deployment
Scan pull requests of a specific type (e.g., new model support) from vLLM and vLLM-Omni repos, extract code review patterns and suggestions, then generate a specialized review automation skill. Use when learning from historical pull request reviews, building domain-specific code review expertise, or automating review pattern extraction.
Use when drafting or editing release notes for vllm-project/vllm-omni, especially when summarizing changes between tags, organizing highlights, and matching the style of recent vLLM-Omni releases
Integrate with vLLM-Omni using the OpenAI-compatible API for text, image, video, and audio generation. Use when building client applications, calling vllm-omni endpoints, sending requests to the API server, or integrating vllm-omni into an application.
Benchmark, profile, and tune vLLM-Omni diffusion models, especially Wan/Qwen/Helios style pipelines on GPU or NPU. Use this when measuring denoise latency, collecting torch or torch_npu profiler traces, reading ASCEND_PROFILER_OUTPUT artifacts, comparing before/after performance, or diagnosing bottlenecks such as SP communication, VAE convs, data transforms, offload overlap, and RoPE overhead.
Guide for achieving optimal inference performance with vLLM-Omni diffusion models. Covers all lossless and lossy optimization methods (parallelism, torch.compile, CPU offload, quantization, cache acceleration), per-model support tables, and ready-to-use recipes. Use when asked to speed up diffusion inference, reduce latency, lower VRAM usage, or tune a diffusion pipeline.
Set up CI/CD pipelines for vLLM-Omni model deployments including Docker builds, automated testing, rolling updates, and deployment validation. Use when creating deployment pipelines, automating model serving updates, setting up Docker workflows, or configuring GitHub Actions for vllm-omni.