skills/model-optimization/sglang/sglang-minimax-m2-series-optimization/SKILL.md
PR-backed and current-main optimization manual for the `MiniMaxAI/MiniMax-M2` series, including M2, M2.1, M2.5, M2.7, and M2.7-highspeed. Use when an engineer needs to recover, extend, or audit MiniMax-specific optimizations, TP QK norm/all-reduce behavior, parser contracts, distributed runtime behavior, quantized loading, or backend-specific validation.
npx skillsauth add BBuf/AI-Infra-Auto-Driven-SKILLS sglang-minimax-m2-series-optimizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
The skill covers the full MiniMax optimization ladder: mainline history, the remaining still-open upstream PR track, and current-main validation lanes. Use it to recover, extend, or audit MiniMax-specific optimizations, or to reuse the patterns on a structurally similar MoE model.
As of 2026-04-21, refreshed against SGLang origin/main commit c122d343a, the MiniMax story is split across three sources of truth:
mainmain yetThis skill tracks all three, but it labels them clearly. Do not assume an optimization from a PR page is already in your local tree, and do not assume MiniMax-M2.7 or M2.7-highspeed is covered by MiniMax-M2.5 validation just because the same model file is used.
The historical evidence for every stage lives in:
Use skills/model-optimization/model-pr-diff-dossier/SKILL.md as the production bar.
Every PR cited for this family must be based on diff reading, not only PR titles.
Record the exact serving shape first:
M2, M2.1, M2.5, or M2.7
instruct or reasoning-style launch
native, AWQ, FP8, FP4, ModelSlim, or other quant format
TP / DP / EP / PP topology
DP attention enabled or not
DeepEP, FlashInfer, Triton, or other MoE / attention backend
piecewise CUDA graph enabled or not
speculative decoding or Eagle3 enabled or not
NVIDIA, AMD, NPU, or other backend
launch parser pair: --tool-call-parser minimax-m2 and --reasoning-parser minimax-append-think when tool/thinking behavior matters
exact registered suite, workflow job, or hardware lane used for validation
QK normalization depends on how heads are partitioned or replicated
M2.5 scale-out performance depends on communication strategy, not only kernels
quantized checkpoints depend on exact loader conventions
Do not treat MiniMax as a generic DeepSeek-like MoE.
SGLANG_USE_FUSED_PARALLEL_QKNORM.The optimization order matters:
Reuse this skill on a non-MiniMax model when it shares one or more of these traits:
num_key_value_heads < tp_sizeReuse the order of investigation and validation discipline, not the MiniMax-specific constants.
Use this path when the target is MiniMaxAI/MiniMax-M2* and the problem is mostly inside the core model path already on main.
The model can launch, but the earliest support path is not yet optimized and may still miss MiniMax-specific surfaces.
basic model registration and weight loading
MiniMax-specific MoE, QK norm, and tool-call integration exist
do not confuse "supported" with "optimized"
#12129
python/sglang/srt/models/minimax_m2.py exists and is the active runtime path
later performance or correctness work has a stable MiniMax-specific home
MiniMax QK normalization is numerically sensitive. Before deeper optimization, the norm path must accumulate safely.
prefer fp32 accumulation in the norm path
treat QK norm correctness as a prerequisite for later TP work
#12186
the norm path no longer relies on lower-precision accumulation where MiniMax accuracy is sensitive
MiniMax needs to expose the same capture surfaces as other spec-decoding-capable models. Without them, speculative or auxiliary-hidden-state features fail even if base generation works.
capture intermediate hidden states for selected layers
expose get_embed_and_head
keep the speculative-decoding surface area on the MiniMax model, not on ad hoc wrappers
#12798
#13297
set_eagle3_layers_to_capture(...) works
get_embed_and_head() exists and downstream speculative code can call it
Before tuning kernels, MiniMax needs the right MoE contract. This includes correct DeepEP forward usage and removing unnecessary router-side work.
keep the DeepEP forward path aligned with MiniMax's expert layout
do not add shared-expert logic that MiniMax does not use
remove unnecessary router work by specializing the top-k sigmoid path
#13892
#14047
the DeepEP MiniMax MoE path is functionally correct
the router no longer spends time on generic work MiniMax does not need
For MiniMax, QK normalization is a real decode hotspot. Once correctness is solid, the next gains come from fusing the TP-aware norm path instead of doing separate generic operations.
compute Q and K norm together
keep TP-aware reduction in the same specialized path
preserve the custom all-reduce fast path by keeping reduction buffers aligned
prefer the fused JIT TP QK norm path on supported CUDA launches instead of stopping at the older in-model RMSNormTP path
check SGLANG_USE_FUSED_PARALLEL_QKNORM, fused_parallel_qknorm(...), and CustomAllReduceV2 before claiming a missing all-reduce optimization
#14416
#16483
#20673
MiniMaxM2RMSNormTP is the active per-layer QK norm implementation
the reduction path consistently selects the fast aligned all-reduce path
MiniMaxM2QKRMSNorm can use the fused TP QK norm custom op when the JIT path is enabled and supported
focused validation exists in python/sglang/jit_kernel/tests/test_tp_qknorm.py and python/sglang/jit_kernel/benchmark/bench_tp_qknorm.py
Once the core hot paths are in place, MiniMax needs to remain usable under graph capture and PP partitioning.
keep piecewise CUDA graph contexts correct around MoE expert-distribution recording
propagate pp_proxy_tensors
make weight loading layer-range aware under PP
#18217
#19577
Family-adjacent caveat:
#18310 is for MiniMax-M2.1 and focuses on a torch.compile plus CUDA-graph crash. It is not the core M2 mainline optimization ladder, but it is worth borrowing if graph tracing regresses on a MiniMax-family branch.
MiniMax can run under PP without wrapper gaps
piecewise CUDA graph support does not regress the MiniMax-specific path
Use this path when the target is MiniMaxAI/MiniMax-M2.5 or another later MiniMax-family checkpoint. Start from the M2 core stages above, then continue here.
mainM2.5 stresses loading and quantized checkpoint conventions much harder than the early M2 path.
preserve packed_modules_mapping
preserve KV-cache scale remapping
keep ModelSlim-specific layer assumptions consistent with MiniMax layout
#19995
#20870
#20905
packed qkv and gate-up modules load correctly
KV cache scales are not silently dropped
ModelSlim quant layers do not assume a different MoE layout
mainStatus:
Tracked upstream PR work; not fully present in origin/main commit c122d343a as of 2026-04-21.
Some M2.5 quantized checkpoints use fused expert naming that the current mainline loader still does not fully cover.
support fused expert mappings such as w13
prefer explicit fused mapping before falling back to older w1/w2/w3 logic
add a focused weight-loading test when you port this work
#20031
AWQ or similar M2.5 checkpoints with fused expert weights load without local remapping hacks
Status:
Partly on main, partly still tracked from upstream PR work as of origin/main commit c122d343a on 2026-04-21.
For M2.5, the next bottleneck is often not a single kernel. It is the distributed contract across PP, EP, DP, and DeepEP.
keep PP support from the mainline path
make DeepEP runtime requirements explicit, especially hidden-size and dtype expectations
treat DP support and DP-attention support as separate stages
#19577
#17826
#19468
PP launches correctly
DeepEP no longer fails due to unsupported hidden size or dtype mismatch
the runtime contract is written down for the exact TP / DP / EP / PP shape you care about
Status:
Mixed: #20067 is part of main; #20489 and #20975 remain tracked as upstream PR work not fully present in origin/main commit c122d343a as of 2026-04-21.
This is the biggest M2.5 scale-out gap. Performance and correctness both depend on using the attention-TP group rather than blindly reusing the model-TP group.
use attention TP group and rank instead of global TP group in MiniMax attention
allow reduce-scatter after MoE when padding or DEP makes it profitable
support FP4 all-gather when the communication path can quantize before transport
allow all-reduce fusion between the MoE output and the next attention preparation
guard zero-token and empty-batch paths
#20067
#20489
#20975
DP-attention MiniMax uses attention-TP metadata consistently
DEP no longer performs an unnecessary all-reduce plus slice
empty-batch or high-rank edge cases no longer crash
Status:
Mainline as #20673 by origin/main commit c122d343a on 2026-04-21.
The older QK norm path was already specialized; the newer mainline path pushes it further by moving to a fused JIT kernel that reuses custom all-reduce v2 more efficiently.
fuse TP Q and K norm into one custom op
keep a fallback path for unsupported environments
add a dedicated benchmark and unit test with the PR
#20673
the MiniMax path can use the fused TP QK norm custom op
the fallback path is still available when the JIT kernel cannot run
Status:
Mainline as #20967 by origin/main commit c122d343a on 2026-04-21.
When num_key_value_heads < tp_size, multiple TP ranks can share the same KV head. That means the K norm weights and reductions must follow the replica layout, not a naive full-TP assumption.
shard norm weights by logical head replica
reduce only across ranks that share the same head
do not assume the full TP group is the correct reduction group
#20967
high-TP MiniMax-M2.5 runs do not produce repeated or garbled output caused by incorrect K norm sharding
Status:
Mainline as #19652 by origin/main commit c122d343a on 2026-04-21; not MiniMax-specific, but directly relevant to some MiniMax-M2.5 deployments.
If the target checkpoint is an NVFP4 MiniMax variant on A100, H100, A40, or another non-Blackwell GPU, the real blocker may be the generic FP4 Marlin fallback rather than MiniMax model code.
keep weights compressed in FP4
route unsupported native FP4 cases to Marlin fallback
preserve both linear and MoE fallback paths
#19652
NVFP4 MiniMax-family checkpoints can run coherently on non-Blackwell GPUs without decompression hacks
Use this path when the target is MiniMaxAI/MiniMax-M2.7, MiniMaxAI/MiniMax-M2.7-highspeed, or when an AMD MiniMax change might affect the currently registered M2.7 lanes. Current main has explicit AMD accuracy and performance coverage for M2.7, while first-class M2.7 and M2.7-highspeed docs are tracked by upstream PR #20873.
M2.7 currently reuses the MiniMax-M2-family runtime code, but the active registered tests are not just copies of M2.5. They launch MiniMaxAI/MiniMax-M2.7 on AMD with TP8+EP8 and the aiter attention backend.
keep the model-file assumptions from the M2/M2.5 ladder unless current code proves M2.7 has a new runtime path
validate M2.7 separately from M2.5 on AMD when changing attention, MoE communication, loader, or aiter-related behavior
use the registered M2.7 model path override MINIMAX_M27_MODEL_PATH for local mirrors
preserve launch details from the current tests: --tp 8, --ep-size 8, --attention-backend aiter, SGLANG_USE_AITER=1, --mem-fraction-static 0.85, multithread loading, and a long watchdog timeout
inspect both MI30x/MI325 and MI35x lanes because they use distinct registered suites
test/registered/amd/accuracy/mi30x/test_minimax_m27_eval_amd.py
test/registered/amd/perf/mi30x/test_minimax_m27_perf_amd.py
test/registered/amd/accuracy/mi35x/test_minimax_m27_eval_mi35x.py
test/registered/amd/perf/mi35x/test_minimax_m27_perf_mi35x.py
.github/workflows/nightly-test-amd.yml
.github/workflows/nightly-test-amd-rocm720.yml
M2.7 accuracy and performance suites pass on the target AMD lane
M2.5 and M2.7 failures are triaged independently
docs do not become the only source of truth for M2.7 until a first-class M2.7 usage doc exists
Check these active upstream tracks before designing a new MiniMax skill or declaring a gap:
routed_experts_weights_of_layer on MiniMaxM2ForCausalLM.--enable-tf32-matmul support aimed at reducing FP32 gate GEMM cost for MiniMax-M2.5 decode.When debugging a MiniMax issue, prefer this order:
mainFor the supporting evidence and commands, use:
packed_modules_mapping or KV-scale remapping just to make one checkpoint load.main yet.--tool-call-parser minimax-m2 or --reasoning-parser minimax-append-think when validating tool or reasoning behavior from the serving docs.SGLANG_USE_FUSED_PARALLEL_QKNORM when benchmarking or diagnosing the current TP QK norm path.development
Perform SGLang code review in the style of human maintainers by consulting the full non-agent PR review episode corpus from project start through the latest refresh (June 2026), including inline review threads, top-level PR comments, review submissions, original multilingual text, and multi-round discussions. Use when reviewing SGLang PRs, diffs, patches, or local changes for correctness, tests, performance, GPU/runtime risks, API compatibility, and maintainability.
documentation
Use when an SGLang, vLLM, or TensorRT-LLM serving/model optimization task needs prior model-family PR evidence. Query and read the PR-driven history docs under model-pr-optimization-history before choosing source paths, fast paths, kernel/fusion ideas, regression risks, or validation lanes.
development
Run an autonomous Humanize-governed vLLM SOTA performance loop for one LLM model: first perform the fixed fair vLLM/SGLang/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches vLLM code, optionally uses ncu-report-skill for kernel evidence, and revalidates until vLLM matches or beats the best observed framework under the same workload and SLA.
devops
Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.