skills/model-optimization/sglang/sglang-qwen3-next-optimization/SKILL.md
PR-diff-backed optimization manual for Qwen3-Next, Qwen3-Next MTP, and Qwen3-Coder-Next shared hybrid paths in SGLang. Use when an engineer needs to audit, extend, or debug Qwen3-Next GDN/Mamba/RadixLinearAttention, MTP/EAGLE/NEXTN, FP8/NVFP4/ModelOpt loading, CPU offload, FlashInfer/CuTe/Gluon GDN kernels, AMD/NPU/Blackwell paths, mixed-chunk extra_buffer behavior, or Qwen3-Next cookbook deployment flags.
npx skillsauth add BBuf/AI-Infra-Auto-Driven-SKILLS sglang-qwen3-next-optimizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Before writing or changing Qwen3-Next docs/code, read:
references/pr-history.mdreferences/playbook.mdThis skill follows the repository-level requirement captured in skills/model-optimization/model-pr-diff-dossier: every PR entry must be based on code diff inspection and must include motivation, implementation idea, and the key code fragment or pseudo-fragment that explains the change.
Treat Qwen3-Next as an independent model family. It should not be collapsed into generic Qwen3 MoE unless the change is genuinely shared and no Qwen3-Next-specific runtime behavior is involved.
Covered surfaces:
Qwen3NextForCausalLMQwen3NextForCausalLMMTPextra_buffer, mixed chunk, and prefix reuseCapture these facts before debugging or documenting:
SGLANG_ENABLE_SPEC_V2--mamba-scheduler-strategy: no_buffer or extra_buffer--enable-mixed-chunk--mamba-ssm-dtype, --page-size, radix/prefix cache flags--cpu-offload-gbInitial architecture and state:
#10233: initial Qwen3-Next support#10322: Gemma RMSNorm normalization fix#10379, #11969, #16164: Ascend NPU bring-up and W8A8 follow-ups#10912: PD disaggregation with Mamba extra pool transferGDN/Mamba performance:
#12508: fused GDN gating#13081, #17613, #19220: PCG evolution#15631, #17981, #17983, #23273: CuTe/Gluon/FlashInfer GDN kernels#18917, #19321, #19434, #21019: fused projection and fused norm/gate pathMTP/speculative:
#10392: MTP + DP fixes#14607: EAGLE3#19767: MTP + EPLB#22458: TP-rank broadcast to avoid MTP NCCL hang#12892, #14502, #16488: open state-copy/PCG/TBO optimization radarQuantization/offload:
#10466, #10622, #17627, #18224, #21313, #21496, #21662, #21698#23474: CPU offload for hybrid linear-attention tied params and cached tensor viewsScheduler/cache correctness:
#21684: allocator clone for memory leak#22876: guard mixed chunk + extra_buffer#23075: root-cause metadata slicing fix for mixed chunk + extra_buffer#21313 and #21496 only as intermediate loader history; current behavior comes from #21662.#22073 is Qwen3-ASR and only touches shared Qwen-family surfaces; do not treat it as a GDN optimization.#19812 is an example; merged Qwen3-Next MTP/EPLB behavior comes from #19767.Run the narrowest relevant tests first:
python -m pytest test/registered/4-gpu-models/test_qwen3_next_models.py
python -m pytest test/registered/4-gpu-models/test_qwen3_next_models_mtp.py
python -m pytest test/registered/models/test_qwen3_next_models_fp4.py
Then add lane-specific checks:
extra_buffer, no_buffer, concurrency, prefix reuseEach PR card should include:
Use skills/model-optimization/model-pr-diff-dossier/SKILL.md as the production bar.
Every PR cited for this family must be based on diff reading, not only PR titles.
development
Perform SGLang code review in the style of human maintainers by consulting the full non-agent PR review episode corpus from project start through the latest refresh (June 2026), including inline review threads, top-level PR comments, review submissions, original multilingual text, and multi-round discussions. Use when reviewing SGLang PRs, diffs, patches, or local changes for correctness, tests, performance, GPU/runtime risks, API compatibility, and maintainability.
documentation
Use when an SGLang, vLLM, or TensorRT-LLM serving/model optimization task needs prior model-family PR evidence. Query and read the PR-driven history docs under model-pr-optimization-history before choosing source paths, fast paths, kernel/fusion ideas, regression risks, or validation lanes.
development
Run an autonomous Humanize-governed vLLM SOTA performance loop for one LLM model: first perform the fixed fair vLLM/SGLang/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches vLLM code, optionally uses ncu-report-skill for kernel evidence, and revalidates until vLLM matches or beats the best observed framework under the same workload and SLA.
devops
Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.