skills/model-optimization/sglang/sglang-glm46-glm47-optimization/SKILL.md
PR-backed and current-main optimization manual for GLM-4.6, GLM-4.6V-adjacent text paths, GLM-4.7, and GLM-4.7-Flash in SGLang. Use when an engineer needs to recover, extend, or audit GLM shared-expert fusion, dual-stream MoE GEMM overlap, GLM-4.7 tool parser, NVFP4/MTP, GLM4-MoE-Lite/Flash loading, AMD/NPU validation, or GLM-4.6/4.7 cookbook recipes.
npx skillsauth add BBuf/AI-Infra-Auto-Driven-SKILLS sglang-glm46-glm47-optimizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
GLM-4.6 and GLM-4.7 share enough GLM4 MoE machinery to live in one skill, but this lane is still separate from GLM-4.5 and GLM-5 because it has its own shared-expert fusion, GLM-4.7 tool parser, GLM-4.7-Flash/lite loading, NVFP4/MTP, and AMD/NPU enablement.
Current evidence snapshot:
origin/main: b3e6cf60a on 2026-04-22origin/main: 816bad5 on 2026-04-212026-04-23glm4_moe.py, glm4_moe_lite.py, glm4_moe_nextn.pyglm47_moe_detector.py, glm4_moe_detector.pyWhen producing or updating GLM-4.6/4.7 optimization documentation, every cited PR must be read from its GitHub diff before writing. Do not fill PR motivation, implementation, or code snippets from title search, a script-generated table, or a shallow summary.
For each cited PR, include:
Use model-pr-diff-dossier when producing new model PR-history docs.
Capture:
GLM-4.6/4.7 optimization is MoE fusion plus parser/loading correctness, with MTP and hardware backend work layered on top.
python/sglang/srt/models/glm4_moe.pypython/sglang/srt/models/glm4_moe_lite.pypython/sglang/srt/models/glm4_moe_nextn.pypython/sglang/srt/function_call/glm47_moe_detector.pypython/sglang/srt/function_call/glm4_moe_detector.pydocs_new/cookbook/autoregressive/GLM/GLM-4.6.mdxdocs_new/cookbook/autoregressive/GLM/GLM-4.7.mdxdocs_new/cookbook/autoregressive/GLM/GLM-4.7-Flash.mdxtest/registered/8-gpu-models/test_glm_46.pyRead references/pr-history.md before making a code or documentation claim. Important trails:
#12456, #13989, #15333, #15520, #15753, #15754, #20543, open #11951, open #23067.#13786, #13873, #14668, #21660, #21851.#17166, #19246, open #22315, merged #22823.#17247, #21851, #22509, #22720, open #19040, open #19106.#21403, #21534, #19246, #22509, open #17869, open #18930, open #22801.Glm4MoeLiteConfig and A2A MoE for GLM-4.7-Flash.continue_final_message kwargs for glm45 reasoning parser.references/pr-history.md: manually diff-reviewed PR cards.references/playbook.md: symptom map, validation lanes, and production toggle checklist.Use skills/model-optimization/model-pr-diff-dossier/SKILL.md as the production bar.
Every PR cited for this family must be based on diff reading, not only PR titles.
development
Run an autonomous Humanize-governed vLLM SOTA performance loop for one LLM model: first perform the fixed fair vLLM/SGLang/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches vLLM code, optionally uses ncu-report-skill for kernel evidence, and revalidates until vLLM matches or beats the best observed framework under the same workload and SLA.
devops
Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.
development
Run an autonomous Humanize-governed SGLang SOTA performance loop for one LLM model: first perform the fixed fair SGLang/vLLM/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches SGLang code, optionally uses ncu-report-skill for kernel evidence, and revalidates until SGLang matches or beats the best observed framework under the same workload and SLA.
documentation
Use when an SGLang, vLLM, or TensorRT-LLM serving/model optimization task needs prior model-family PR evidence. Query and read the PR-driven history docs under model-pr-optimization-history before choosing source paths, fast paths, kernel/fusion ideas, regression risks, or validation lanes.