skills/model-optimization/vllm/vllm-qwen-vlm-omni-asr-optimization/SKILL.md
PR-backed optimization manual for Qwen2.5-VL / Qwen3-VL / Qwen3-Omni / Qwen3-ASR in vLLM. Use when an engineer needs to audit, debug, extend, or document the multimodal Qwen runtime in vLLM, especially Qwen2.5-VL attention hot paths, Qwen3-VL video and interleaved MRoPE handling, Qwen3-Omni thinker audio-in-video logic, and Qwen3-ASR / realtime speech support.
npx skillsauth add BBuf/AI-Infra-Auto-Driven-SKILLS vllm-qwen-vlm-omni-asr-optimizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This family is already supported on the checked mainline commit, but the high-risk logic lives in processors and multimodal position handling more than in the text decoder itself. The important milestones are:
#13155: Qwen2.5-VL hot-path optimization#24727: Qwen3-VL and Qwen3-VL-MoE bring-up#25055: interleaved MRoPE support for Qwen3-VL#25550: Qwen3-Omni thinker path#33312: Qwen3-ASR base transcription path#34613: Qwen3-ASR realtime streaming pathEvidence snapshot:
0f7be0f2f76814f80f9091220a5fbbb53912ad00references/pr-history.mdmodel-pr-optimization-history/vllm/qwen-vlm-omni-asr/README.zh.md and
README.en.mdUse skills/model-optimization/model-pr-diff-dossier/SKILL.md as the bar.
When touching this family, name the exact processor/model file because regressions
usually come from placeholder expansion, timestamps, audio lengths, or MRoPE
layout, not from generic "multimodal" code.
vllm/vllm/model_executor/models/qwen2_5_vl.pyvllm/vllm/model_executor/models/qwen3_vl.pyvllm/vllm/model_executor/models/qwen3_vl_moe.pyvllm/vllm/model_executor/models/qwen3_omni_moe_thinker.pyvllm/vllm/model_executor/models/qwen3_asr.pyvllm/vllm/model_executor/models/qwen3_asr_realtime.pyvllm/vllm/model_executor/layers/rotary_embedding/mrope.pyvllm/vllm/transformers_utils/processors/qwen3_asr.pyuse_audio_in_video.Qwen2.5-VL OptimizationSupport Qwen3-VL Model SeriesAdd Triton kernel for Qwen3-VL interleaved MRoPEAdd Qwen3-Omni moe thinkerQwen3-ASRAdd Qwen3-ASR realtime streaming supportuse_audio_in_video.references/pr-history.md: diff-reviewed Qwen multimodal cards.Use skills/model-optimization/model-pr-diff-dossier/SKILL.md as the production bar.
Every PR cited for this family must be based on diff reading, not only PR titles.
development
Perform SGLang code review in the style of human maintainers by consulting the full non-agent PR review episode corpus from project start through the latest refresh (June 2026), including inline review threads, top-level PR comments, review submissions, original multilingual text, and multi-round discussions. Use when reviewing SGLang PRs, diffs, patches, or local changes for correctness, tests, performance, GPU/runtime risks, API compatibility, and maintainability.
documentation
Use when an SGLang, vLLM, or TensorRT-LLM serving/model optimization task needs prior model-family PR evidence. Query and read the PR-driven history docs under model-pr-optimization-history before choosing source paths, fast paths, kernel/fusion ideas, regression risks, or validation lanes.
development
Run an autonomous Humanize-governed vLLM SOTA performance loop for one LLM model: first perform the fixed fair vLLM/SGLang/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches vLLM code, optionally uses ncu-report-skill for kernel evidence, and revalidates until vLLM matches or beats the best observed framework under the same workload and SLA.
devops
Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.