skills/model-optimization/sglang/sglang-qwen3-core-optimization/SKILL.md
PR-diff-backed optimization manual for Qwen3 dense and Qwen3 MoE in SGLang. Use when an engineer needs to recover, extend, audit, or write documentation for Qwen3/Qwen3-30B/Qwen3-235B-A22B, FP8/NVFP4/MXFP4/W4A4, fused QK-norm/RoPE/KV-store paths, FlashInfer TRTLLM-GEN-MoE, DeepEP/EPLB/TBO/context parallel, EAGLE3, LoRA, PP/tied embeddings, Ascend NPU/XPU/MLX support, or Qwen3 reasoning/tool-parser behavior.
npx skillsauth add BBuf/AI-Infra-Auto-Driven-SKILLS sglang-qwen3-core-optimizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill covers the non-hybrid Qwen3 text path: dense Qwen3, Qwen3 MoE, Qwen3-30B-A3B, Qwen3-235B-A22B, Qwen3 Instruct/Thinking variants, and the shared Qwen3 infrastructure reused by Qwen3.5, Qwen3-Next, Qwen3.6, Qwen3 Omni thinker-only, and Qwen-family quantization loaders.
When writing or revising Qwen3 Core optimization docs, use ../../model-pr-diff-dossier/SKILL.md as the production standard.
Do not fill PR history from a script, generated table, or title-level summary. For every PR you cite, read the PR diff or the final merge commit and record:
If a PR is docs-only, quote the exact launch/config line that changed and explain why that line matters for serving or validation.
model-pr-optimization-history/sglang/qwen3-core/README.zh.md: Chinese PR optimization history.model-pr-optimization-history/sglang/qwen3-core/README.en.md: English PR optimization history.Current evidence snapshot:
2026-04-22: b3e6cf60a2026-04-21: 816bad5Start from these files before making a Qwen3 Core change:
python/sglang/srt/models/qwen3.pypython/sglang/srt/models/qwen3_moe.pypython/sglang/srt/models/qwen2.pypython/sglang/srt/layers/moe/python/sglang/srt/layers/layernorm.pypython/sglang/srt/layers/quantization/python/sglang/srt/layers/attention/python/sglang/srt/distributed/python/sglang/srt/function_call/qwen25_detector.pytest/registered/models/test_qwen_models.pytest/registered/4-gpu-models/test_qwen3_30b.pytest/registered/stress/test_stress_qwen3_235b.pytest/srt/models/test_lora_qwen3.pytest/registered/backends/test_qwen3_fp4_trtllm_gen_moe.pytest/registered/npu/docs/basic_usage/qwen3.mddocs_new/cookbook/autoregressive/Qwen/Qwen3.mdxTreat Qwen3 Core as the compatibility baseline:
Qwen3ForCausalLM, Qwen3MoeForCausalLM, pooled output, embedding, or a downstream Qwen family reusing Qwen3 logic.rope_parameters, top-level rope_theta, rope_scaling, layer_types, tie_word_embeddings, and quantization config.lm_head.weight.<tool_call> before and after </think>.Use skills/model-optimization/model-pr-diff-dossier/SKILL.md as the production bar.
Every PR cited for this family must be based on diff reading, not only PR titles.
development
Perform SGLang code review in the style of human maintainers by consulting the full non-agent PR review episode corpus from project start through the latest refresh (June 2026), including inline review threads, top-level PR comments, review submissions, original multilingual text, and multi-round discussions. Use when reviewing SGLang PRs, diffs, patches, or local changes for correctness, tests, performance, GPU/runtime risks, API compatibility, and maintainability.
documentation
Use when an SGLang, vLLM, or TensorRT-LLM serving/model optimization task needs prior model-family PR evidence. Query and read the PR-driven history docs under model-pr-optimization-history before choosing source paths, fast paths, kernel/fusion ideas, regression risks, or validation lanes.
development
Run an autonomous Humanize-governed vLLM SOTA performance loop for one LLM model: first perform the fixed fair vLLM/SGLang/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches vLLM code, optionally uses ncu-report-skill for kernel evidence, and revalidates until vLLM matches or beats the best observed framework under the same workload and SLA.
devops
Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.