skills/model-optimization/sglang/sglang-glm45-optimization/SKILL.md
PR-backed and current-main optimization manual for GLM-4.5 and GLM-4.5 Air/MoE in SGLang. Use when an engineer needs to recover, extend, or audit GLM-4.5 MoE loading, A2A/DeepEP, reduce-scatter behavior, NVFP4 padding, tool parser behavior, AMD/NPU/Blackwell validation, or GLM-4.5 cookbook recipes.
npx skillsauth add BBuf/AI-Infra-Auto-Driven-SKILLS sglang-glm45-optimizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
GLM-4.5 is the first GLM lane because it introduced the GLM MoE support and later GLM-4.6/4.7/5.x work inherits many model, parser, quantization, and platform decisions from it.
Current evidence snapshot:
origin/main: b3e6cf60a on 2026-04-22origin/main: 816bad5 on 2026-04-212026-04-23references/pr-history.mdpython/sglang/srt/models/glm4.py, python/sglang/srt/models/glm4_moe.pypython/sglang/srt/function_call/glm4_moe_detector.pydocs/basic_usage/glm45.md, docs_new/cookbook/autoregressive/GLM/GLM-4.5.mdx, glm-45-deployment.jsxUse skills/model-optimization/model-pr-diff-dossier/SKILL.md as the production bar.
Every PR cited for this family must be based on diff reading, not only PR titles.
Do not summarize a GLM-4.5 optimization PR from title or search results. For every cited PR, read the actual diff first and record:
Use references/pr-history.md as the source of truth for already reviewed PRs. If a new PR is added, append a card with the same level of detail before using it in a recommendation.
Capture:
--enable-deepep-moe, reduce-scatter settings, and MoE backendTreat GLM-4.5 as the GLM MoE baseline.
sglang-glm-vlm-ocr-optimization.python/sglang/srt/models/glm4.pypython/sglang/srt/models/glm4_moe.pypython/sglang/srt/models/glm4_moe_lite.pypython/sglang/srt/function_call/glm4_moe_detector.pydocs/basic_usage/glm45.mddocs_new/cookbook/autoregressive/GLM/GLM-4.5.mdxdocs_new/src/snippets/autoregressive/glm-45-deployment.jsxreferences/pr-history.md: manually diff-reviewed PR cards for GLM-4.5/GLM4-MoE/model parser work, including merged and open PRs.references/playbook.md: symptom map, investigation order, validation checklist, and change rules.development
Perform SGLang code review in the style of human maintainers by consulting the full non-agent PR review episode corpus from project start through the latest refresh (June 2026), including inline review threads, top-level PR comments, review submissions, original multilingual text, and multi-round discussions. Use when reviewing SGLang PRs, diffs, patches, or local changes for correctness, tests, performance, GPU/runtime risks, API compatibility, and maintainability.
documentation
Use when an SGLang, vLLM, or TensorRT-LLM serving/model optimization task needs prior model-family PR evidence. Query and read the PR-driven history docs under model-pr-optimization-history before choosing source paths, fast paths, kernel/fusion ideas, regression risks, or validation lanes.
development
Run an autonomous Humanize-governed vLLM SOTA performance loop for one LLM model: first perform the fixed fair vLLM/SGLang/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches vLLM code, optionally uses ncu-report-skill for kernel evidence, and revalidates until vLLM matches or beats the best observed framework under the same workload and SLA.
devops
Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.