skills/model-optimization/sglang/sglang-deepseek-v32-optimization/SKILL.md
PR-backed and current-main optimization manual for DeepSeek V3.2, V3.2-Exp, V3.2-Speciale, NVFP4, and MXFP4 in SGLang. Use when an engineer needs to recover, extend, or audit DSA/NSA sparse attention, NSA indexer, FP8/BF16/FP4 KV cache, context parallel, MTP, IndexCache, DSML tool calling, V3.2 docs/tests, AMD/NPU/Blackwell backends, or open NSA/DSA PRs.
npx skillsauth add BBuf/AI-Infra-Auto-Driven-SKILLS sglang-deepseek-v32-optimizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill covers the DeepSeek V3.2 support and optimization ladder active in SGLang main. V3.2 shares the DeepSeek V3/R1 model backbone, but it is a separate optimization problem because it activates DeepSeek Sparse Attention, called DSA in docs and NSA in SGLang code.
Current-main snapshot:
origin/main: 929e00eea on 2026-04-21origin/main: 8ec4d03 on 2026-04-21DeepseekV32ForCausalLM in python/sglang/srt/models/deepseek_v2.pypython/sglang/srt/layers/attention/nsa_backend.pypython/sglang/srt/layers/attention/nsa/nsa_indexer.pypython/sglang/srt/function_call/deepseekv32_detector.pyThe historical evidence lives in:
Use skills/model-optimization/model-pr-diff-dossier/SKILL.md as the production bar.
Every PR cited for this family must be based on diff reading, not only PR titles.
Record the exact serving shape first:
is_deepseek_nsa(config) is true--attention-backend, --nsa-prefill-backend, --nsa-decode-backendauto, bfloat16, fp8_e4m3, or experimental FP4 tracks--enable-dp-attention--enable-nsa-prefill-context-parallel--nsa-prefill-cp-mode: round-robin-split or in-seq-splitindex_topk_freq, index_topk_patterndeepseekv31 in the cookbook path, standard V3.2 uses deepseekv32--reasoning-parser deepseek-v3Do not treat V3.2 as ordinary DeepSeek V3.
is_deepseek_nsa(config).nsa, KV cache dtype defaults differ by architecture, and NSA prefill/decode backends are auto-selected.The optimization order matters:
Start from these files before changing behavior:
python/sglang/srt/models/deepseek_v2.pypython/sglang/srt/models/deepseek_nextn.pypython/sglang/srt/configs/model_config.pypython/sglang/srt/server_args.pypython/sglang/srt/managers/schedule_batch.pypython/sglang/srt/managers/scheduler_output_processor_mixin.pypython/sglang/srt/mem_cache/common.pypython/sglang/srt/speculative/eagle_worker_v2.pypython/sglang/srt/speculative/multi_layer_eagle_worker_v2.pypython/sglang/srt/layers/attention/nsa_backend.pypython/sglang/srt/layers/attention/nsa/nsa_indexer.pypython/sglang/srt/layers/attention/nsa/utils.pypython/sglang/srt/layers/attention/nsa/transform_index.pypython/sglang/srt/layers/attention/nsa/quant_k_cache.pypython/sglang/srt/layers/attention/nsa/dequant_k_cache.pypython/sglang/srt/layers/communicator_nsa_cp.pypython/sglang/srt/function_call/deepseekv32_detector.pyexamples/chat_template/tool_chat_template_deepseekv32.jinjaCheck these before declaring a V3.2 gap:
o_proj linear in context-parallel NSA, open.DeepseekV32Mixin, open.DeepseekV32ForCausalLM to MTP draft mapping, open.PPMissingLayer fix in DeepSeek AITER gfx95 path, open.prepare_qkv_latent, open.2026-04-24T01:59:51Z.--nsa-topk-backend and FlashInfer/PyTorch top-k, open.o_proj TP support, open.encoding_dsv32.py, open.indexer_k_quant_and_cache, open..weight access in DeepSeek MLA for AWQ/compressed-tensors, open.DeepseekV2MoE, open.Additional all-state PR coverage includes V3.2 bugfixes, closed experiments, tool-parser updates, and platform-specific backend work:
moe_dp_size == 1 with different attention_cp_size values, #21599 adds adaptive EAGLE top-k=1 draft steps, #22128 allows PCG with speculative decoding, #23219 touches shared DSA/NextN infrastructure through deepseek_nextn.py, #22950 is the closed predecessor for reasoning radix-cache stripping, #23315 is the merged opt-in thinking-token strip from radix cache, and #23336 is the open spec-v2 adaptive-spec follow-up.Key PR:
Success check:
DeepseekV32ForCausalLM existsis_deepseek_nsa(config) is trueserver_args.py selects attention_backend = "nsa"NativeSparseAttnBackend and Indexer are activeV3.2 has model-specific defaults:
fp8_e4m3 on SM100 and bfloat16 otherwisebfloat16 and fp8_e4m3 are mainline DSA KV cache dtypesflashmla_sparse, flashmla_kv, or fa3Key PRs:
The NSA indexer computes sparse indices through q/k projection, weights projection, top-k, transforms, and optional KV-cache store.
Key PRs:
Success check:
weights_proj avoids FP32 precision lossContext parallel for NSA is powerful but constrained.
Key PRs:
Success check:
round-robin-split is the current default CP token split methodin-seq-split requires DeepEP and ep_size == tp_sizeV3.2 MTP must cooperate with NSA metadata, target verify, draft extend, and context parallel.
Key PRs:
Separate the backend tracks:
IndexCache reuses NSA top-k indices across layers.
Key PR:
Success check:
skip_topk and next_skip_topk are set per layerindex_topk_freq and index_topk_pattern override behavior correctlyprev_topk_indices is carried through layerstest/registered/8-gpu-models/test_deepseek_v32_indexcache.py remains accurateStandard V3.2 uses DSML:
<|DSML|function_calls><|DSML|invoke name="tool">...</|DSML|invoke></|DSML|function_calls>
The detector supports XML parameter tags and direct JSON. Track open parser bugs:
Use the narrowest lane that matches the change:
test/registered/8-gpu-models/test_deepseek_v32.pytest_deepseek_v32_nsa_backends inside that filetest/registered/8-gpu-models/test_deepseek_v32_indexcache.pytest/manual/test_deepseek_chat_templates.pydevelopment
Perform SGLang code review in the style of human maintainers by consulting the full non-agent PR review episode corpus from project start through the latest refresh (June 2026), including inline review threads, top-level PR comments, review submissions, original multilingual text, and multi-round discussions. Use when reviewing SGLang PRs, diffs, patches, or local changes for correctness, tests, performance, GPU/runtime risks, API compatibility, and maintainability.
documentation
Use when an SGLang, vLLM, or TensorRT-LLM serving/model optimization task needs prior model-family PR evidence. Query and read the PR-driven history docs under model-pr-optimization-history before choosing source paths, fast paths, kernel/fusion ideas, regression risks, or validation lanes.
development
Run an autonomous Humanize-governed vLLM SOTA performance loop for one LLM model: first perform the fixed fair vLLM/SGLang/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches vLLM code, optionally uses ncu-report-skill for kernel evidence, and revalidates until vLLM matches or beats the best observed framework under the same workload and SLA.
devops
Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.