skills/model-optimization/vllm/vllm-qwen3-core-optimization/SKILL.md
PR-backed optimization manual for Qwen3 Core in vLLM. Use when an engineer needs to audit, debug, extend, or document Qwen3 dense, Qwen3 MoE, embeddings/rerankers, GGUF/GPTQ/ModelOpt quant paths, and Eagle3 speculative decoding in vLLM.
npx skillsauth add BBuf/AI-Infra-Auto-Driven-SKILLS vllm-qwen3-core-optimizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill covers Qwen3 dense, Qwen3 MoE, embeddings/rerankers, GGUF/GPTQ/ModelOpt quant paths, and Eagle3 speculative decoding in vLLM.
Evidence snapshot:
0f7be0f2f76814f80f9091220a5fbbb53912ad00references/pr-history.mdmodel-pr-optimization-history/vllm/qwen3-core/README.zh.md and README.en.mdUse skills/model-optimization/model-pr-diff-dossier/SKILL.md as the production bar.
Every PR cited for this family must be based on diff reading, not only PR titles.
vllm/vllm/model_executor/models/qwen3.pyvllm/vllm/model_executor/models/qwen3_moe.pyvllm/vllm/model_executor/models/voyage.pyAdd Qwen3 and Qwen3MoE: Initial Qwen3 dense and MoE support landed here.Support Qwen3 Embedding & Reranker: Extended the family to bidirectional embedding / reranker models.Skip loading extra parameters for modelopt Qwen3 MoE model: Fixed a concrete ModelOpt launch failure on Qwen3 MoE.KeyError for Qwen3-MoE with GPTQ on ROCm: Closed a GPTQ loading failure in the Qwen3 MoE path.Fix GGUF loader for Qwen3 MoE: Made the Qwen3 MoE loader accept GGUF weights again.Fix Qwen3 MoE GPTQ inference: Patched runtime correctness after GPTQ startup succeeded.Add EAGLE-3 Speculative Decoding Support for Qwen3 MoE: Enabled the draft-model path on top of the base Qwen3 MoE runtime.development
Run an autonomous Humanize-governed vLLM SOTA performance loop for one LLM model: first perform the fixed fair vLLM/SGLang/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches vLLM code, optionally uses ncu-report-skill for kernel evidence, and revalidates until vLLM matches or beats the best observed framework under the same workload and SLA.
devops
Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.
development
Run an autonomous Humanize-governed SGLang SOTA performance loop for one LLM model: first perform the fixed fair SGLang/vLLM/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches SGLang code, optionally uses ncu-report-skill for kernel evidence, and revalidates until SGLang matches or beats the best observed framework under the same workload and SLA.
documentation
Use when an SGLang, vLLM, or TensorRT-LLM serving/model optimization task needs prior model-family PR evidence. Query and read the PR-driven history docs under model-pr-optimization-history before choosing source paths, fast paths, kernel/fusion ideas, regression risks, or validation lanes.