skills/model-optimization/sglang/sglang-qwen36-optimization/SKILL.md
PR-backed and current-main optimization manual for Qwen3.6 in SGLang. Use when an engineer needs to recover, extend, or audit Qwen3.6-35B-A3B/27B dense deployment, hybrid Gated Delta Network behavior, multimodal inputs, thinking preservation, Qwen3 reasoning plus Qwen3-Coder tool parser, MTP, Mamba scheduler strategy, FP8/BF16 commands, CPU offload, or cookbook parity.
npx skillsauth add BBuf/AI-Infra-Auto-Driven-SKILLS sglang-qwen36-optimizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Qwen3.6 is split out because the current SGLang support is mostly a deployment/cookbook layer over hybrid Qwen infrastructure, with unique user-facing behavior: thinking preservation, multimodal input, Qwen3 reasoning parser, Qwen3-Coder tool parser, MTP, and Mamba scheduler strategy.
Current evidence snapshot:
origin/main: b3e6cf60a on 2026-04-22origin/main: 816bad5 on 2026-04-21docs_new/cookbook/autoregressive/Qwen/Qwen3.6.mdx, docs_new/src/snippets/autoregressive/qwen36-deployment.jsxqwen3_next.py, qwen3_5.py, Qwen VLM processors, and Qwen3-Coder parser depending on the checkpoint pathUse skills/model-optimization/model-pr-diff-dossier/SKILL.md as the production bar.
Every PR cited for this family must be based on diff reading, not only PR titles.
Capture:
qwen3qwen3_coder--mamba-scheduler-strategy: default no_buffer or extra_bufferSGLANG_ENABLE_SPEC_V2=1 is set--attention-backend trtllm_mhaQwen3.6 work should preserve deployment semantics while reusing lower-level Qwen3-Next/Qwen3.5 machinery.
docs_new/cookbook/autoregressive/Qwen/Qwen3.6.mdxdocs_new/src/snippets/autoregressive/qwen36-deployment.jsxdocs_new/docs.jsonpython/sglang/srt/models/qwen3_next.pypython/sglang/srt/models/qwen3_5.pypython/sglang/srt/function_call/qwen3_coder_detector.pypython/sglang/srt/multimodal/processors/qwen_vl.pyBefore adding Qwen3.6 PR evidence, open the PR diff/source and write a full card in references/pr-history.md with motivation, key implementation, code excerpt, reviewed files, and validation implications.
Do not add title-only open PR lists. If an open PR matters for Qwen3.6, keep it in a clearly marked open optimization card only after reviewing the diff, as done for #23474.
qwen36-deployment.jsx.Qwen3.6.mdx.extra_buffer and no-MTP with default scheduler.development
Perform SGLang code review in the style of human maintainers by consulting the full non-agent PR review episode corpus from project start through the latest refresh (June 2026), including inline review threads, top-level PR comments, review submissions, original multilingual text, and multi-round discussions. Use when reviewing SGLang PRs, diffs, patches, or local changes for correctness, tests, performance, GPU/runtime risks, API compatibility, and maintainability.
documentation
Use when an SGLang, vLLM, or TensorRT-LLM serving/model optimization task needs prior model-family PR evidence. Query and read the PR-driven history docs under model-pr-optimization-history before choosing source paths, fast paths, kernel/fusion ideas, regression risks, or validation lanes.
development
Run an autonomous Humanize-governed vLLM SOTA performance loop for one LLM model: first perform the fixed fair vLLM/SGLang/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches vLLM code, optionally uses ncu-report-skill for kernel evidence, and revalidates until vLLM matches or beats the best observed framework under the same workload and SLA.
devops
Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.