skills/model-optimization/sglang/sglang-mimo-v2-flash-optimization/SKILL.md
PR-backed optimization manual for MiMo-V2 / MiMo-V2-Flash / MiMo-V2.5 in SGLang. Use when an engineer needs to audit, debug, extend, or document MiMo-V2 inference-centric MoE runtime, flashinfer/TRT-LLM fused all-reduce, overlap, MTP/EAGLE, multimodal/pro variants, and reasoning parser behavior.
npx skillsauth add BBuf/AI-Infra-Auto-Driven-SKILLS sglang-mimo-v2-flash-optimizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill covers the MiMo-V2 family in SGLang: historical MiMo-V2-Flash,
current MiMoV2ForCausalLM compatibility, MiMo-V2.5/Pro deployment recipes,
MTP/EAGLE paths, flashinfer/TRT-LLM fused all-reduce, overlap, multimodal
variants, and reasoning parser behavior.
Evidence snapshot:
origin/main: 6fbad22fe on 2026-04-28e88b0fd8ac5b1caa6eb42766035029220053369b#23808 renamed the runtime to mimo_v2.py /
mimo_v2_nextn.py while keeping the old Flash architecture alias loadable;
#23851 added the MiMo-V2.5 cookbook and command generator.references/pr-history.mdmodel-pr-optimization-history/sglang/mimo-v2-flash/README.zh.md and README.en.mdUse skills/model-optimization/model-pr-diff-dossier/SKILL.md as the production bar.
Every PR cited for this family must be based on diff reading, not only PR titles.
sglang/python/sglang/srt/models/mimo_v2.pysglang/python/sglang/srt/models/mimo_v2_nextn.pysglang/python/sglang/srt/models/mimo_v2_flash.py,
sglang/python/sglang/srt/models/mimo_v2_flash_nextn.pysglang/python/sglang/srt/configs/model_config.py,
sglang/python/sglang/srt/server_args.pydocs_new/cookbook/autoregressive/Xiaomi/MiMo-V2.5.mdx,
docs_new/src/snippets/autoregressive/mimo-v25-deployment.jsxMiMoV2ForCausalLM and the old
MiMoV2FlashForCausalLM alias through the renamed mimo_v2.py module.qkv_proj constraints.MiMo-V2-Flash day0 support: Initial MiMo-V2-Flash landing.Optimize MiMo-V2-Flash by flashinfer fused allreduce: Targeted decode-side communication cost.Respect --swa-full-tokens-ratio``: Fixed a concrete runtime flag integration bug.Support two batch overlap: Added overlap / throughput optimization.Add mimo reasoning parser: Completed the parser path for thinking outputs.Xiaomi MiMo-V2.5-Pro day0 support: Renamed runtime files to mimo_v2.py / mimo_v2_nextn.py, added MiMoV2ForCausalLM, and kept the Flash alias.Add MiMo-V2.5 docs: Added current MiMo-V2.5/Pro deployment cookbook and command generator.qkv_proj loading, hybrid SWA/full attention, and registered/manual tests after touching loader or processor code.references/pr-history.md: diff-reviewed MiMo-V2 family PR cards; it includes both historical Flash-file cards and current mimo_v2.py / MiMo-V2.5 updates.development
Perform SGLang code review in the style of human maintainers by consulting the full non-agent PR review episode corpus from project start through the latest refresh (June 2026), including inline review threads, top-level PR comments, review submissions, original multilingual text, and multi-round discussions. Use when reviewing SGLang PRs, diffs, patches, or local changes for correctness, tests, performance, GPU/runtime risks, API compatibility, and maintainability.
documentation
Use when an SGLang, vLLM, or TensorRT-LLM serving/model optimization task needs prior model-family PR evidence. Query and read the PR-driven history docs under model-pr-optimization-history before choosing source paths, fast paths, kernel/fusion ideas, regression risks, or validation lanes.
development
Run an autonomous Humanize-governed vLLM SOTA performance loop for one LLM model: first perform the fixed fair vLLM/SGLang/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches vLLM code, optionally uses ncu-report-skill for kernel evidence, and revalidates until vLLM matches or beats the best observed framework under the same workload and SLA.
devops
Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.