skills/model-optimization/sglang/sglang-glm5-glm51-optimization/SKILL.md
PR-backed and current-main optimization manual for GLM-5 and GLM-5.1 in SGLang. Use when an engineer needs to recover, extend, or audit GLM-5 DSA/NSA/NSA indexer paths, GLM-5.1 FP8/MXFP4/NVFP4, NextN/MTP, dense-attention threshold, NSA TileLang/AITER, tool templates, EAGLE, PCG, AMD/Blackwell/NPU validation, or GLM-5 cookbook recipes.
npx skillsauth add BBuf/AI-Infra-Auto-Driven-SKILLS sglang-glm5-glm51-optimizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
GLM-5/5.1 is a separate lane because it moves from GLM4 MoE into GLM MoE DSA/NSA-adjacent behavior, NextN/MTP, FP8/MXFP4/NVFP4, NSA indexer work, and GLM-5.1 tool-template details.
Current evidence snapshot:
origin/main: bca3dd958 on 2026-04-24origin/main: 816bad5 on 2026-04-21glm4_moe.py, glm4_moe_nextn.py, deepseek_nextn.py when shared MTP infrastructure is touchedGLM-5.mdx, GLM-5.1.mdx, glm-5-deployment.jsx, glm-51-deployment.jsxtest/registered/8-gpu-models/test_glm_51_fp8.py, test/registered/gb300/test_glm5_fp8.py, test/registered/gb300/test_glm5_nvfp4.pyUse skills/model-optimization/model-pr-diff-dossier/SKILL.md as the production bar.
Every PR cited for this family must be based on diff reading, not only PR titles.
Capture:
GLM-5/5.1 should be debugged like a GLM-specific DSA/MTP system, not like plain GLM4.
deepseek_nextn.py even when the user-facing model is GLM.python/sglang/srt/models/glm4_moe.pypython/sglang/srt/models/glm4_moe_nextn.pypython/sglang/srt/models/deepseek_nextn.pypython/sglang/srt/layers/attention/nsa/docs_new/cookbook/autoregressive/GLM/GLM-5.mdxdocs_new/cookbook/autoregressive/GLM/GLM-5.1.mdxdocs_new/src/snippets/autoregressive/glm-5-deployment.jsxdocs_new/src/snippets/autoregressive/glm-51-deployment.jsxBefore adding GLM-5/5.1 PR evidence, open the PR diff/source and write a full card in references/pr-history.md with motivation, key implementation, code excerpt, reviewed files, and validation implications.
Do not add one-line open PR lists. Open PRs can be recorded only after their diffs are reviewed and they are clearly separated from merged history.
development
Perform SGLang code review in the style of human maintainers by consulting the full non-agent PR review episode corpus from project start through the latest refresh (June 2026), including inline review threads, top-level PR comments, review submissions, original multilingual text, and multi-round discussions. Use when reviewing SGLang PRs, diffs, patches, or local changes for correctness, tests, performance, GPU/runtime risks, API compatibility, and maintainability.
documentation
Use when an SGLang, vLLM, or TensorRT-LLM serving/model optimization task needs prior model-family PR evidence. Query and read the PR-driven history docs under model-pr-optimization-history before choosing source paths, fast paths, kernel/fusion ideas, regression risks, or validation lanes.
development
Run an autonomous Humanize-governed vLLM SOTA performance loop for one LLM model: first perform the fixed fair vLLM/SGLang/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches vLLM code, optionally uses ncu-report-skill for kernel evidence, and revalidates until vLLM matches or beats the best observed framework under the same workload and SLA.
devops
Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.