skills/model-optimization/vllm/vllm-glm46-glm47-optimization/SKILL.md
PR-backed optimization manual for GLM-4.6 / 4.7 in vLLM. Use when an engineer needs to audit, debug, extend, or document GLM-4.6, GLM-4.6V, GLM-4.7, GLM-4.7-Flash, GLM-Lite, and the parser / quant / fused-MoE deltas after the 4.5 generation.
npx skillsauth add BBuf/AI-Infra-Auto-Driven-SKILLS vllm-glm46-glm47-optimizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill covers GLM-4.6, GLM-4.6V, GLM-4.7, GLM-4.7-Flash, GLM-Lite, and the parser / quant / fused-MoE deltas after the 4.5 generation.
Evidence snapshot:
0f7be0f2f76814f80f9091220a5fbbb53912ad00references/pr-history.mdmodel-pr-optimization-history/vllm/glm46-glm47/README.zh.md and README.en.mdUse skills/model-optimization/model-pr-diff-dossier/SKILL.md as the production bar.
Every PR cited for this family must be based on diff reading, not only PR titles.
vllm/vllm/model_executor/models/glm4.pyvllm/vllm/model_executor/models/glm4_moe.pyvllm/vllm/model_executor/models/glm4v.pyAdd MoE tunings for GLM 4.6-FP8 and GLM 4.5 Air on B200: Added fused-MoE tuning configs for the new Blackwell deployment lane.Fix glm46 awq marlin moe compatibility: Closed an incompatibility between GLM-4.6 AWQ checkpoints and Marlin MoE assumptions.GLM-4.7 Tool Parser and Doc Update: Brought parser behavior and docs up to date for 4.7 / 4.7-Flash.GLM Model support for GLM-Lite: Extended the same runtime family to the Lite checkpoint line.Improve tool call parsing and content normalization for glm47: Fixed concrete parsing errors that surfaced in newer GLM-4.7 outputs.development
Perform SGLang code review in the style of human maintainers by consulting the full non-agent PR review episode corpus from project start through the latest refresh (June 2026), including inline review threads, top-level PR comments, review submissions, original multilingual text, and multi-round discussions. Use when reviewing SGLang PRs, diffs, patches, or local changes for correctness, tests, performance, GPU/runtime risks, API compatibility, and maintainability.
documentation
Use when an SGLang, vLLM, or TensorRT-LLM serving/model optimization task needs prior model-family PR evidence. Query and read the PR-driven history docs under model-pr-optimization-history before choosing source paths, fast paths, kernel/fusion ideas, regression risks, or validation lanes.
development
Run an autonomous Humanize-governed vLLM SOTA performance loop for one LLM model: first perform the fixed fair vLLM/SGLang/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches vLLM code, optionally uses ncu-report-skill for kernel evidence, and revalidates until vLLM matches or beats the best observed framework under the same workload and SLA.
devops
Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.