skills/model-optimization/sglang/sglang-deepseek-v3-r1-optimization/SKILL.md
PR-backed and current-main optimization manual for DeepSeek V3 and DeepSeek R1 in SGLang. Use when an engineer needs to recover, extend, or audit DeepSeek V3/R1 MLA, MoE, shared experts, FP8/FP4/W4AFP8/MXFP4/NVFP4 loading, MTP, DeepEP, DP attention, LoRA, backend selection, or validation lanes.
npx skillsauth add BBuf/AI-Infra-Auto-Driven-SKILLS sglang-deepseek-v3-r1-optimizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill covers the DeepSeek V3/R1 optimization ladder that is active in SGLang main. It intentionally excludes the V3.1 parser delta and the V3.2 DSA/NSA sparse-attention stack, which have separate skills.
Current-main snapshot:
origin/main: 929e00eea on 2026-04-21origin/main: 8ec4d03 on 2026-04-21python/sglang/srt/models/deepseek_v2.pyDeepseekV3ForCausalLMDeepseekV3ForCausalLMNextNThe historical evidence lives in:
Use skills/model-optimization/model-pr-diff-dossier/SKILL.md as the production bar.
Every PR cited for this family must be based on diff reading, not only PR titles.
Record the exact serving shape first:
--reasoning-parser deepseek-v3 for V3 thinking-style output, --reasoning-parser deepseek-r1 for R1, and --tool-call-parser deepseekv3 for V3/R1 tool callsDo not treat DeepSeek V3/R1 as only one optimization.
--enforce-shared-experts-fusion is set.trtllm_mla and flashinfer_trtllm; on ROCm, AITER and TileLang paths are separate validation surfaces.The optimization order matters:
Start from these files before changing behavior:
python/sglang/srt/models/deepseek_v2.pypython/sglang/srt/models/deepseek_nextn.pypython/sglang/srt/models/deepseek_common/deepseek_weight_loader.pypython/sglang/srt/models/deepseek_common/attention_backend_handler.pypython/sglang/srt/models/deepseek_common/attention_forward_methods/python/sglang/srt/layers/attention/flashattention_backend.pypython/sglang/srt/layers/radix_attention.pypython/sglang/srt/mem_cache/memory_pool.pypython/sglang/srt/layers/quantization/fp8_kernel.pypython/sglang/srt/layers/quantization/deep_gemm.pypython/sglang/compile_deep_gemm.pypython/sglang/srt/layers/moe/topk.pypython/sglang/srt/layers/moe/fused_moe_triton/fused_moe.pypython/sglang/srt/layers/moe/cutlass_w4a8_moe.pypython/sglang/srt/layers/moe/ep_moe/layer.pypython/sglang/srt/layers/moe/token_dispatcher/deepep.pypython/sglang/srt/layers/quantization/w4afp8.pypython/sglang/srt/layers/rotary_embedding.pypython/sglang/srt/server_args.pypython/sglang/srt/managers/schedule_batch.pypython/sglang/srt/mem_cache/common.pypython/sglang/srt/parser/reasoning_parser.pypython/sglang/srt/function_call/deepseekv3_detector.pysgl-kernel/csrc/moe/moe_fused_gate.cusgl-kernel/csrc/moe/moe_align_kernel.cusgl-kernel/csrc/attention/merge_attn_states.cusgl-kernel/include/sgl_kernel_ops.hCheck these before declaring a V3/R1 gap:
deepseek_v2, open.dsv3_router_gemm from AOT sgl-kernel to JIT, open.prepare_qkv_latent, open.2026-04-24T01:59:51Z..weight access in DeepseekV2AttentionMLA for AWQ/compressed-tensors, open.DeepseekV2MoE with CuteDSL EP plus DP attention, open.Known reverted track:
Known exploratory or closed tracks:
group_gemm_masked BMM path for MLA FP8 quantization. Treat it as an explored path, not as the default production H200 speed path.Additional current-main runtime tracks should be checked in addition to the original H200 ladder:
speculative_num_steps for EAGLE top-k=1. For V3/R1 MTP, inspect server_args.py, speculative runtime params, and EAGLE worker state before assuming the number of draft steps is static.model_runner.py, piecewise_cuda_graph_runner.py, and the server flag gate instead of treating the combination as unsupported.schedule_batch.py, mem_cache/common.py, and server_args.py; it matters for DeepSeek V3/R1 reasoning-parser cache reuse, especially when thinking tokens should not become a reusable prefix.The single-node H200 optimization notes explicitly name a March-May 2025 H200 optimization ladder. A complete V3/R1 audit must include those PRs because many later abstractions hide the original performance reason.
Required H200 PR coverage:
When updating this skill, explicitly mark whether an H200 optimization is still current-main default, current-main optional, hardware-specific, or only an explored/closed path.
The H200 ladder is the missing context behind many later V3/R1 defaults.
fp8_kernel.py, deep_gemm.py, and compile_deep_gemm.py; Hopper/Blackwell defaults should be checked with SGLANG_ENABLE_JIT_DEEPGEMM.fused_topk_deepseek abstraction. moe_fused_gate.cu, moe_align_kernel.cu, per_token_group_quant_8bit, routed scaling fusion, and shared-expert fusion all belong to this stage.server_args.py, attention_backend_handler.py, and docs/basic_usage/deepseek_v3.md.forward_absorb copies, fusing q_a_proj with kv_a_proj_with_mqa, fusing MLA KV-cache writes, overlapping q/k norm, and removing scalar/allocator overhead all live on the hot path.Success check:
Early DeepSeek support can launch, but the optimized path needs hardware-specific MLA, MoE, and quant handling.
docs/basic_usage/deepseek_v3.mdSuccess check:
DeepseekV3ForCausalLM is selectedDeepSeek V3/R1 performance depends on MLA path selection, weight absorption, KV-cache dtype, DeepGEMM, and backend fallback.
server_args.py selects trtllm_mla on SM100 when no attention backend is setdeepseek_weight_loader.py requantizes or dequantizes kv_b_proj according to quant formatKey PRs:
The main model has 256 routed experts plus one shared expert. Current main can remap mlp.shared_experts into expert slot 256 when shared-expert fusion is active.
DeepseekV2MoE computes num_fused_shared_expertsTopK is configured with grouped top-k, correction bias, routed scaling, and optional fused shared expert slotsdetermine_num_fused_shared_experts() disables fusion for incompatible shapes, W4AFP8 shared/routed mismatch, TBO/SBO, or DeepEP unless explicitly enforced256 routed experts to 256 + EP_size slotsKey PRs:
DeepSeek V3/R1 MTP uses the NextN model as an EAGLE draft path.
DeepseekV3ForCausalLMDeepseekV3ForCausalLMNextNdeepseek_nextn.py handles the single NextN layer, shared head/embed reuse, quant override, and AMD R1 MXFP4 namingKey PRs:
Validation:
test/registered/8-gpu-models/test_deepseek_v3_mtp.pytest/registered/amd/test_deepseek_v3_mtp.pytest/registered/spec/eagle/test_deepseek_v3_fp4_mtp_small.pyR1 W4AFP8 is a separate ladder from native FP8.
W4AFp8Config detects mixed precision and maps linear layers to FP8 while MoE experts use W4A8cutlass_w4a8_moe.py handles packed int4 expert weights and FP8 activationsapply_deepep_normalKey PRs:
Treat each quantization format as a separate loader and backend contract.
w4afp8.py, cutlass_w4a8_moe.py, DeepEP and TP variantsKey PRs:
The late-stage failures are usually topology bugs rather than model-architecture bugs.
Key PRs:
Use the narrowest lane that matches the change:
test/registered/8-gpu-models/test_deepseek_v3_basic.pytest/registered/8-gpu-models/test_deepseek_v3_mtp.pytest/registered/amd/test_deepseek_v3_basic.pytest/registered/amd/test_deepseek_v3_basic_kv_fp8.pytest/registered/amd/test_deepseek_r1_mxfp4_8gpu.pytest/registered/backends/test_deepseek_r1_fp8_trtllm_backend.pytest/registered/quant/test_deepseek_v3_fp4_4gpu.pytest/registered/quant/test_w4a8_deepseek_v3.pytest/registered/mla/test_mla_deepseek_v3.pytest/registered/mla/test_mla_int8_deepseek_v3.pytest/registered/lora/test_lora_deepseek_v3_base_logprob_diff.pytest/registered/kernels/test_fused_topk_deepseek.pydevelopment
Perform SGLang code review in the style of human maintainers by consulting the full non-agent PR review episode corpus from project start through the latest refresh (June 2026), including inline review threads, top-level PR comments, review submissions, original multilingual text, and multi-round discussions. Use when reviewing SGLang PRs, diffs, patches, or local changes for correctness, tests, performance, GPU/runtime risks, API compatibility, and maintainability.
documentation
Use when an SGLang, vLLM, or TensorRT-LLM serving/model optimization task needs prior model-family PR evidence. Query and read the PR-driven history docs under model-pr-optimization-history before choosing source paths, fast paths, kernel/fusion ideas, regression risks, or validation lanes.
development
Run an autonomous Humanize-governed vLLM SOTA performance loop for one LLM model: first perform the fixed fair vLLM/SGLang/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches vLLM code, optionally uses ncu-report-skill for kernel evidence, and revalidates until vLLM matches or beats the best observed framework under the same workload and SLA.
devops
Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.