skills/model-optimization/vllm/vllm-glm5-glm51-optimization/SKILL.md
PR-backed optimization manual for GLM-5 / GLM-5.1 in vLLM. Use when an engineer needs to audit, debug, extend, or document the current partial GLM-5 bring-up in vLLM, especially the `GlmMoeDsaForCausalLM` aliasing into the DeepSeek-V2/V3 runtime, rope interleave handling, and GLM-5 MTP correctness.
npx skillsauth add BBuf/AI-Infra-Auto-Driven-SKILLS vllm-glm5-glm51-optimizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
GLM-5 support in vLLM is not a standalone glm5.py runtime. The current landed
path adapts GLM-5 into the existing DeepSeek-V2/V3 MLA/MoE implementation and
then fixes the GLM-5 MTP draft-model correctness bug separately.
Evidence snapshot:
0f7be0f2f76814f80f9091220a5fbbb53912ad00references/pr-history.mdmodel-pr-optimization-history/vllm/glm5-glm51/README.zh.md
and README.en.mdUse skills/model-optimization/model-pr-diff-dossier/SKILL.md as the bar.
For GLM-5, do not pretend support comes from the older glm4* modules; the
actual landed path is the DeepSeek-based alias plus MTP follow-up.
vllm/vllm/model_executor/models/deepseek_v2.pyvllm/vllm/model_executor/models/registry.pyvllm/vllm/config/speculative.pyvllm/vllm/transformers_utils/model_arch_config_convertor.pyvllm/vllm/v1/spec_decode/eagle.pyvllm/tests/models/registry.pyvllm/tests/models/test_initialization.py#34124 adds GlmMoeDsaForCausalLM as a DeepSeek-V2-derived runtime rather
than introducing a dedicated GLM-5 file.indexer_rope_interleave and teaches speculative config conversion to treat
glm_moe_dsa like a DeepSeek MTP family.#34385 fixes GLM-5 MTP accuracy by explicitly sharing the target
lm_head into every MTP layer shared_head.head; without that, logits could
become zero or NaN.GLM adaptationFix MTP accuracy for GLM-5GlmMoeDsaForCausalLM alias.#34385.is_neox_style handling in deepseek_v2.py.references/pr-history.md: diff-reviewed GLM-5 / 5.1 cards.Use skills/model-optimization/model-pr-diff-dossier/SKILL.md as the production bar.
Every PR cited for this family must be based on diff reading, not only PR titles.
development
Perform SGLang code review in the style of human maintainers by consulting the full non-agent PR review episode corpus from project start through the latest refresh (June 2026), including inline review threads, top-level PR comments, review submissions, original multilingual text, and multi-round discussions. Use when reviewing SGLang PRs, diffs, patches, or local changes for correctness, tests, performance, GPU/runtime risks, API compatibility, and maintainability.
documentation
Use when an SGLang, vLLM, or TensorRT-LLM serving/model optimization task needs prior model-family PR evidence. Query and read the PR-driven history docs under model-pr-optimization-history before choosing source paths, fast paths, kernel/fusion ideas, regression risks, or validation lanes.
development
Run an autonomous Humanize-governed vLLM SOTA performance loop for one LLM model: first perform the fixed fair vLLM/SGLang/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches vLLM code, optionally uses ncu-report-skill for kernel evidence, and revalidates until vLLM matches or beats the best observed framework under the same workload and SLA.
devops
Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.