SGLang DeepSeek V3.2 Optimization

Overview

This skill covers the DeepSeek V3.2 support and optimization ladder active in SGLang main. V3.2 shares the DeepSeek V3/R1 model backbone, but it is a separate optimization problem because it activates DeepSeek Sparse Attention, called DSA in docs and NSA in SGLang code.

Current-main snapshot:

SGLang origin/main: 929e00eea on 2026-04-21
sgl-cookbook origin/main: 8ec4d03 on 2026-04-21
V3.2 runtime entry: DeepseekV32ForCausalLM in python/sglang/srt/models/deepseek_v2.py
NSA backend: python/sglang/srt/layers/attention/nsa_backend.py
NSA indexer: python/sglang/srt/layers/attention/nsa/nsa_indexer.py
V3.2 tool parser: python/sglang/srt/function_call/deepseekv32_detector.py

The historical evidence lives in:

references/pr-history.md: chronological PR evidence and code-level notes
references/playbook.md: investigation order, symptom mapping, validation commands

Non-Negotiable Evidence Rule

Use skills/model-optimization/model-pr-diff-dossier/SKILL.md as the production bar. Every PR cited for this family must be based on diff reading, not only PR titles.

Before You Change Anything

Record the exact serving shape first:

model: V3.2-Exp, V3.2, V3.2-Speciale, V3.2-NVFP4, or V3.2-MXFP4
whether is_deepseek_nsa(config) is true
--attention-backend, --nsa-prefill-backend, --nsa-decode-backend
KV cache dtype: auto, bfloat16, fp8_e4m3, or experimental FP4 tracks
TP / DP / EP / PP / PD topology
--enable-dp-attention
--enable-nsa-prefill-context-parallel
--nsa-prefill-cp-mode: round-robin-split or in-seq-split
MTP enabled or not
IndexCache knobs: index_topk_freq, index_topk_pattern
tool parser: V3.2-Exp may use deepseekv31 in the cookbook path, standard V3.2 uses deepseekv32
reasoning parser: --reasoning-parser deepseek-v3
hardware: H200, B200/GB200/GB300, AMD MI300/MI355, NPU, or another backend

Core Principle

Do not treat V3.2 as ordinary DeepSeek V3.

V3.2 turns on DSA/NSA through is_deepseek_nsa(config).
The attention hot path is split between the indexer, top-k transform, sparse MLA backend, and KV-cache quant/dequant.
Server defaults are model-specific: attention backend becomes nsa, KV cache dtype defaults differ by architecture, and NSA prefill/decode backends are auto-selected.
Context parallel is experimental and has strict mode-specific constraints.
MTP spans the NextN layer, NSA metadata, target_verify, draft_extend, CP positions, and speculative overlap.
V3.2 parser behavior is DSML for standard V3.2, while V3.2-Exp docs still point at the V3.1-style parser path.

The optimization order matters:

confirm DSA detection and server defaults
confirm KV cache dtype and NSA backend pair
validate indexer top-k generation and transform
validate MTP, CP, PP, or DP attention only after base DSA is correct
then tune backend-specific kernels for Blackwell, Hopper, AMD, or NPU
add model-backed tests for any IndexCache, MTP, CP, or backend change

Main Runtime Surfaces

Start from these files before changing behavior:

python/sglang/srt/models/deepseek_v2.py
python/sglang/srt/models/deepseek_nextn.py
python/sglang/srt/configs/model_config.py
python/sglang/srt/server_args.py
python/sglang/srt/managers/schedule_batch.py
python/sglang/srt/managers/scheduler_output_processor_mixin.py
python/sglang/srt/mem_cache/common.py
python/sglang/srt/speculative/eagle_worker_v2.py
python/sglang/srt/speculative/multi_layer_eagle_worker_v2.py
python/sglang/srt/layers/attention/nsa_backend.py
python/sglang/srt/layers/attention/nsa/nsa_indexer.py
python/sglang/srt/layers/attention/nsa/utils.py
python/sglang/srt/layers/attention/nsa/transform_index.py
python/sglang/srt/layers/attention/nsa/quant_k_cache.py
python/sglang/srt/layers/attention/nsa/dequant_k_cache.py
python/sglang/srt/layers/communicator_nsa_cp.py
python/sglang/srt/function_call/deepseekv32_detector.py
examples/chat_template/tool_chat_template_deepseekv32.jinja

Open PRs to Track

Check these before declaring a V3.2 gap:

#11191: sparse attention and CPU/GPU KV scheduling for GQA/DSA, open.
#12820: TP-SP for Qwen and DeepSeek V2/V3/V3.2, open.
#16148: V3.2 W4AFP8 MTP with FP8 draft model, open.
#17185: TP o_proj linear in context-parallel NSA, open.
#17761: missing Assistant token after V3.1/V3.2 tool output, open.
#18167: DCP support for V3.2, open.
#18275: NPU all-gather after qlora for V3.2, open.
#18733: V3.2 PD disaggregation test, open.
#19211: extract V3.2/NSA logic into DeepseekV32Mixin, open.
#19299: O(1) expert weight matching in DeepSeek weight loader, open.
#19609: TP indexer weight in NSA attention, open.
#19975: AMD context parallel for V3.2, open.
#20360: AMD CP round-robin split garbage output, open.
#20531: NSA indexer ragged gather mismatch in CP round-robin split, open.
#20809: add DeepseekV32ForCausalLM to MTP draft mapping, open.
#20880: reject HiCache L3 for NSA models, open.
#21179: preserve V3.2 tool-call markers in reasoning parsing, open.
#21194: AMD PPMissingLayer fix in DeepSeek AITER gfx95 path, open.
#21506: V3.2 NPU torch compile, open.
#21529: ROCm MXFP4 / Quark W4A4 support for DeepSeek architecture, open.
#21530: ROCm fused MLA decode RoPE fix for DeepSeek-variant models, open.
#21546: catch malformed JSON in V3.2 partial function-call parsing, open.
#21889: AMD FP4 KV cache quantization for NSA TileLang, open.
#22268: DeepSeek MLA LoRA adapter bypass in prepare_qkv_latent, open.
#22473: dense MLA decode fallback for short sequences, open.
#22774: MUSA backend support for DeepSeek V2/V3/R1-class layers, merged at 2026-04-24T01:59:51Z.
#22851: --nsa-topk-backend and FlashInfer/PyTorch top-k, open.
#22865: sparsity framework extension for non-NSA sparse algorithms, open.
#14332: V3.2 tool-call parsing without DSML tag, open.
#14524: NSA backend test suite, open.
#15322: V3.2 o_proj TP support, open.
#18094: V3.2 piecewise CUDA graph, open and related to #23351.
#18542: EAGLE3 plus NSA CP aux-hidden-state index bug, open.
#19987: AMD FP8 KV cache for TileLang NSA backend, open.
#20534: transfer FP8 K/K-scale for CP indexer prefill gather, open.
#21623: unit tests for encoding_dsv32.py, open.
#22792: AITER indexer_k_quant_and_cache, open.
#23268: NPU fix for NSA CP plus prefix cache, merged on 2026-04-28.
#22938: restore DeepSeek MLA MI300X paths after the MLA refactor, open.
#23195: guard .weight access in DeepSeek MLA for AWQ/compressed-tensors, open.
#23241: 3FS backend for DSA/mamba, open.
#23257: CuteDSL EP plus DP-attention double-reduce fix in DeepseekV2MoE, open.
#23336: adaptive speculative-num-steps support for spec v2 EAGLE workers, open.
#23351: piecewise CUDA graph with NSA, open.

Additional PR Coverage

Additional all-state PR coverage includes V3.2 bugfixes, closed experiments, tool-parser updates, and platform-specific backend work:

Early bring-up polish: #11063, #11194, #11308, #11309, #11450, #11557, #11565, #11682, #11815, and #11835.
Short-sequence MHA / Indexer fixes: #11892, #12094, #12582, #12583, #12645, #12788, #12816, #12964, #13022, #13459, and #13544.
DSML/tool/parser path: #14304, #14307, #14353, #14573, #14750, #15064, #15278, #16091, #18126, #18174, and #17951.
NSA backend / metadata / sparse-cache work: #14781, #14901, #15040, #15086, #15242, #15429, #16520, #16758, #16841, #17205, #17554, and #18319.
HiSparse/HiCache and platform fixes: #14741, #17409, #17518, #17523, #17633, #18297, #18526, #20343, #21932, and #22238.
Closed or superseded experiments to cite as history, not current support: #11109, #11596, #11761, #12017, #12052, #13531, #13546, #14619, #14904, #15051, #15217, #15310, #15807, #16079, #16881, #17024, #17199, #17310, and #17647.
Round-2 runtime additions: #21249 adds all-reduce fusion with context parallel, #22003 relaxes moe_dp_size == 1 with different attention_cp_size values, #21599 adds adaptive EAGLE top-k=1 draft steps, #22128 allows PCG with speculative decoding, #23219 touches shared DSA/NextN infrastructure through deepseek_nextn.py, #22950 is the closed predecessor for reasoning radix-cache stripping, #23315 is the merged opt-in thinking-token strip from radix cache, and #23336 is the open spec-v2 adaptive-spec follow-up.

Evolution Path

Stage V32-0: Bring up DSA/NSA as a separate DeepSeek class

Key PR:

#11061

Success check:

DeepseekV32ForCausalLM exists
is_deepseek_nsa(config) is true
server_args.py selects attention_backend = "nsa"
NativeSparseAttnBackend and Indexer are active

Stage V32-1: Server defaults, KV cache dtype, and backend pair

V3.2 has model-specific defaults:

DSA KV cache defaults to fp8_e4m3 on SM100 and bfloat16 otherwise
only bfloat16 and fp8_e4m3 are mainline DSA KV cache dtypes
ROCm defaults to TileLang NSA backends
Blackwell defaults now prefer TRTLLM NSA kernels
Hopper often uses flashmla_sparse, flashmla_kv, or fa3

Key PRs:

#11936
#18389
#18931
#21783
#21914

Stage V32-2: Indexer correctness and performance

The NSA indexer computes sparse indices through q/k projection, weights projection, top-k, transforms, and optional KV-cache store.

Key PRs:

#12044
#13812
#16637
#17688
#19041
#19148
#19319
#22232
#22424
#22850

Success check:

weights_proj avoids FP32 precision loss
K/S buffers use fused kernels where available
FP8 KV cache store is fused or padded correctly for the selected backend
AMD and NPU have separate indexer paths where needed

Stage V32-3: Context parallel, PP, and DP attention

Context parallel for NSA is powerful but constrained.

Key PRs:

#12065
#13959
#16119
#16156
#16305
#16380
#18613
#20438
#21192
#22914

Success check:

round-robin-split is the current default CP token split method
in-seq-split requires DeepEP and ep_size == tp_size
CP in PD decode mode is asserted away
CP positions match EAGLE NextN
key all-gather can overlap query computation

Stage V32-4: MTP and speculative decoding

V3.2 MTP must cooperate with NSA metadata, target verify, draft extend, and context parallel.

Key PRs:

#11652
#15088
#15307
#16961
#17662
#19016
#19062
#19367
#19536
#20492

Stage V32-5: Quantized checkpoints and platform lanes

Separate the backend tracks:

NVFP4 Blackwell: #17657, #18389, #20086
AMD MXFP4/TileLang/FP8 KV: #17783, #19945, #20840, #21511, #22258, #22850
NPU: #14541, #14572, #15381, #16990, #17007, #21468
HiSparse/HiCache: #21259, #22065, #22425

Stage V32-6: IndexCache

IndexCache reuses NSA top-k indices across layers.

Key PR:

#21405

Success check:

skip_topk and next_skip_topk are set per layer
index_topk_freq and index_topk_pattern override behavior correctly
prev_topk_indices is carried through layers
test/registered/8-gpu-models/test_deepseek_v32_indexcache.py remains accurate

Stage V32-7: DSML tool calling and reasoning interaction

Standard V3.2 uses DSML:

<｜DSML｜function_calls><｜DSML｜invoke name="tool">...</｜DSML｜invoke></｜DSML｜function_calls>

The detector supports XML parameter tags and direct JSON. Track open parser bugs:

#21179: reasoning parser should preserve V3.2 tool-call markers.
#21546: catch malformed JSON while parsing partial function calls.

Validation Surface

Use the narrowest lane that matches the change:

V3.2 base/MTP/DP/TP/tool-calling: test/registered/8-gpu-models/test_deepseek_v32.py
NSA backend pair: test_deepseek_v32_nsa_backends inside that file
IndexCache: test/registered/8-gpu-models/test_deepseek_v32_indexcache.py
chat template argument types: test/manual/test_deepseek_chat_templates.py
CP and DeepEP-specific changes: use the dedicated CP/DeepEP suites referenced by the PR
AMD changes: MI300/MI355 registered lanes
NPU changes: Ascend/NPU model deployment and backend tests

SGLang DeepSeek V3.2 Optimization

Overview

Current-main snapshot:

SGLang origin/main: 929e00eea on 2026-04-21
sgl-cookbook origin/main: 8ec4d03 on 2026-04-21
V3.2 runtime entry: DeepseekV32ForCausalLM in python/sglang/srt/models/deepseek_v2.py
NSA backend: python/sglang/srt/layers/attention/nsa_backend.py
NSA indexer: python/sglang/srt/layers/attention/nsa/nsa_indexer.py
V3.2 tool parser: python/sglang/srt/function_call/deepseekv32_detector.py

The historical evidence lives in:

references/pr-history.md: chronological PR evidence and code-level notes
references/playbook.md: investigation order, symptom mapping, validation commands

Non-Negotiable Evidence Rule

Use skills/model-optimization/model-pr-diff-dossier/SKILL.md as the production bar. Every PR cited for this family must be based on diff reading, not only PR titles.

Before You Change Anything

Record the exact serving shape first:

model: V3.2-Exp, V3.2, V3.2-Speciale, V3.2-NVFP4, or V3.2-MXFP4
whether is_deepseek_nsa(config) is true
--attention-backend, --nsa-prefill-backend, --nsa-decode-backend
KV cache dtype: auto, bfloat16, fp8_e4m3, or experimental FP4 tracks
TP / DP / EP / PP / PD topology
--enable-dp-attention
--enable-nsa-prefill-context-parallel
--nsa-prefill-cp-mode: round-robin-split or in-seq-split
MTP enabled or not
IndexCache knobs: index_topk_freq, index_topk_pattern
tool parser: V3.2-Exp may use deepseekv31 in the cookbook path, standard V3.2 uses deepseekv32
reasoning parser: --reasoning-parser deepseek-v3
hardware: H200, B200/GB200/GB300, AMD MI300/MI355, NPU, or another backend

Core Principle

Do not treat V3.2 as ordinary DeepSeek V3.

V3.2 turns on DSA/NSA through is_deepseek_nsa(config).
The attention hot path is split between the indexer, top-k transform, sparse MLA backend, and KV-cache quant/dequant.
Server defaults are model-specific: attention backend becomes nsa, KV cache dtype defaults differ by architecture, and NSA prefill/decode backends are auto-selected.
Context parallel is experimental and has strict mode-specific constraints.
MTP spans the NextN layer, NSA metadata, target_verify, draft_extend, CP positions, and speculative overlap.
V3.2 parser behavior is DSML for standard V3.2, while V3.2-Exp docs still point at the V3.1-style parser path.

The optimization order matters:

confirm DSA detection and server defaults
confirm KV cache dtype and NSA backend pair
validate indexer top-k generation and transform
validate MTP, CP, PP, or DP attention only after base DSA is correct
then tune backend-specific kernels for Blackwell, Hopper, AMD, or NPU
add model-backed tests for any IndexCache, MTP, CP, or backend change

Main Runtime Surfaces

Start from these files before changing behavior:

python/sglang/srt/models/deepseek_v2.py
python/sglang/srt/models/deepseek_nextn.py
python/sglang/srt/configs/model_config.py
python/sglang/srt/server_args.py
python/sglang/srt/managers/schedule_batch.py
python/sglang/srt/managers/scheduler_output_processor_mixin.py
python/sglang/srt/mem_cache/common.py
python/sglang/srt/speculative/eagle_worker_v2.py
python/sglang/srt/speculative/multi_layer_eagle_worker_v2.py
python/sglang/srt/layers/attention/nsa_backend.py
python/sglang/srt/layers/attention/nsa/nsa_indexer.py
python/sglang/srt/layers/attention/nsa/utils.py
python/sglang/srt/layers/attention/nsa/transform_index.py
python/sglang/srt/layers/attention/nsa/quant_k_cache.py
python/sglang/srt/layers/attention/nsa/dequant_k_cache.py
python/sglang/srt/layers/communicator_nsa_cp.py
python/sglang/srt/function_call/deepseekv32_detector.py
examples/chat_template/tool_chat_template_deepseekv32.jinja

Open PRs to Track

Check these before declaring a V3.2 gap:

#11191: sparse attention and CPU/GPU KV scheduling for GQA/DSA, open.
#12820: TP-SP for Qwen and DeepSeek V2/V3/V3.2, open.
#16148: V3.2 W4AFP8 MTP with FP8 draft model, open.
#17185: TP o_proj linear in context-parallel NSA, open.
#17761: missing Assistant token after V3.1/V3.2 tool output, open.
#18167: DCP support for V3.2, open.
#18275: NPU all-gather after qlora for V3.2, open.
#18733: V3.2 PD disaggregation test, open.
#19211: extract V3.2/NSA logic into DeepseekV32Mixin, open.
#19299: O(1) expert weight matching in DeepSeek weight loader, open.
#19609: TP indexer weight in NSA attention, open.
#19975: AMD context parallel for V3.2, open.
#20360: AMD CP round-robin split garbage output, open.
#20531: NSA indexer ragged gather mismatch in CP round-robin split, open.
#20809: add DeepseekV32ForCausalLM to MTP draft mapping, open.
#20880: reject HiCache L3 for NSA models, open.
#21179: preserve V3.2 tool-call markers in reasoning parsing, open.
#21194: AMD PPMissingLayer fix in DeepSeek AITER gfx95 path, open.
#21506: V3.2 NPU torch compile, open.
#21529: ROCm MXFP4 / Quark W4A4 support for DeepSeek architecture, open.
#21530: ROCm fused MLA decode RoPE fix for DeepSeek-variant models, open.
#21546: catch malformed JSON in V3.2 partial function-call parsing, open.
#21889: AMD FP4 KV cache quantization for NSA TileLang, open.
#22268: DeepSeek MLA LoRA adapter bypass in prepare_qkv_latent, open.
#22473: dense MLA decode fallback for short sequences, open.
#22774: MUSA backend support for DeepSeek V2/V3/R1-class layers, merged at 2026-04-24T01:59:51Z.
#22851: --nsa-topk-backend and FlashInfer/PyTorch top-k, open.
#22865: sparsity framework extension for non-NSA sparse algorithms, open.
#14332: V3.2 tool-call parsing without DSML tag, open.
#14524: NSA backend test suite, open.
#15322: V3.2 o_proj TP support, open.
#18094: V3.2 piecewise CUDA graph, open and related to #23351.
#18542: EAGLE3 plus NSA CP aux-hidden-state index bug, open.
#19987: AMD FP8 KV cache for TileLang NSA backend, open.
#20534: transfer FP8 K/K-scale for CP indexer prefill gather, open.
#21623: unit tests for encoding_dsv32.py, open.
#22792: AITER indexer_k_quant_and_cache, open.
#23268: NPU fix for NSA CP plus prefix cache, merged on 2026-04-28.
#22938: restore DeepSeek MLA MI300X paths after the MLA refactor, open.
#23195: guard .weight access in DeepSeek MLA for AWQ/compressed-tensors, open.
#23241: 3FS backend for DSA/mamba, open.
#23257: CuteDSL EP plus DP-attention double-reduce fix in DeepseekV2MoE, open.
#23336: adaptive speculative-num-steps support for spec v2 EAGLE workers, open.
#23351: piecewise CUDA graph with NSA, open.

Additional PR Coverage

Additional all-state PR coverage includes V3.2 bugfixes, closed experiments, tool-parser updates, and platform-specific backend work:

Early bring-up polish: #11063, #11194, #11308, #11309, #11450, #11557, #11565, #11682, #11815, and #11835.
Short-sequence MHA / Indexer fixes: #11892, #12094, #12582, #12583, #12645, #12788, #12816, #12964, #13022, #13459, and #13544.
DSML/tool/parser path: #14304, #14307, #14353, #14573, #14750, #15064, #15278, #16091, #18126, #18174, and #17951.
NSA backend / metadata / sparse-cache work: #14781, #14901, #15040, #15086, #15242, #15429, #16520, #16758, #16841, #17205, #17554, and #18319.
HiSparse/HiCache and platform fixes: #14741, #17409, #17518, #17523, #17633, #18297, #18526, #20343, #21932, and #22238.
Closed or superseded experiments to cite as history, not current support: #11109, #11596, #11761, #12017, #12052, #13531, #13546, #14619, #14904, #15051, #15217, #15310, #15807, #16079, #16881, #17024, #17199, #17310, and #17647.
Round-2 runtime additions: #21249 adds all-reduce fusion with context parallel, #22003 relaxes moe_dp_size == 1 with different attention_cp_size values, #21599 adds adaptive EAGLE top-k=1 draft steps, #22128 allows PCG with speculative decoding, #23219 touches shared DSA/NextN infrastructure through deepseek_nextn.py, #22950 is the closed predecessor for reasoning radix-cache stripping, #23315 is the merged opt-in thinking-token strip from radix cache, and #23336 is the open spec-v2 adaptive-spec follow-up.

Evolution Path

Stage V32-0: Bring up DSA/NSA as a separate DeepSeek class

Key PR:

#11061

Success check:

DeepseekV32ForCausalLM exists
is_deepseek_nsa(config) is true
server_args.py selects attention_backend = "nsa"
NativeSparseAttnBackend and Indexer are active

Stage V32-1: Server defaults, KV cache dtype, and backend pair

V3.2 has model-specific defaults:

DSA KV cache defaults to fp8_e4m3 on SM100 and bfloat16 otherwise
only bfloat16 and fp8_e4m3 are mainline DSA KV cache dtypes
ROCm defaults to TileLang NSA backends
Blackwell defaults now prefer TRTLLM NSA kernels
Hopper often uses flashmla_sparse, flashmla_kv, or fa3

Key PRs:

#11936
#18389
#18931
#21783
#21914

Stage V32-2: Indexer correctness and performance

The NSA indexer computes sparse indices through q/k projection, weights projection, top-k, transforms, and optional KV-cache store.

Key PRs:

#12044
#13812
#16637
#17688
#19041
#19148
#19319
#22232
#22424
#22850

Success check:

weights_proj avoids FP32 precision loss
K/S buffers use fused kernels where available
FP8 KV cache store is fused or padded correctly for the selected backend
AMD and NPU have separate indexer paths where needed

Stage V32-3: Context parallel, PP, and DP attention

Context parallel for NSA is powerful but constrained.

Key PRs:

#12065
#13959
#16119
#16156
#16305
#16380
#18613
#20438
#21192
#22914

Success check:

round-robin-split is the current default CP token split method
in-seq-split requires DeepEP and ep_size == tp_size
CP in PD decode mode is asserted away
CP positions match EAGLE NextN
key all-gather can overlap query computation

Stage V32-4: MTP and speculative decoding

V3.2 MTP must cooperate with NSA metadata, target verify, draft extend, and context parallel.

Key PRs:

#11652
#15088
#15307
#16961
#17662
#19016
#19062
#19367
#19536
#20492

Stage V32-5: Quantized checkpoints and platform lanes

Separate the backend tracks:

NVFP4 Blackwell: #17657, #18389, #20086
AMD MXFP4/TileLang/FP8 KV: #17783, #19945, #20840, #21511, #22258, #22850
NPU: #14541, #14572, #15381, #16990, #17007, #21468
HiSparse/HiCache: #21259, #22065, #22425

Stage V32-6: IndexCache

IndexCache reuses NSA top-k indices across layers.

Key PR:

#21405

Success check:

skip_topk and next_skip_topk are set per layer
index_topk_freq and index_topk_pattern override behavior correctly
prev_topk_indices is carried through layers
test/registered/8-gpu-models/test_deepseek_v32_indexcache.py remains accurate

Stage V32-7: DSML tool calling and reasoning interaction

Standard V3.2 uses DSML:

<｜DSML｜function_calls><｜DSML｜invoke name="tool">...</｜DSML｜invoke></｜DSML｜function_calls>

The detector supports XML parameter tags and direct JSON. Track open parser bugs:

#21179: reasoning parser should preserve V3.2 tool-call markers.
#21546: catch malformed JSON while parsing partial function calls.

Validation Surface

Use the narrowest lane that matches the change:

V3.2 base/MTP/DP/TP/tool-calling: test/registered/8-gpu-models/test_deepseek_v32.py
NSA backend pair: test_deepseek_v32_nsa_backends inside that file
IndexCache: test/registered/8-gpu-models/test_deepseek_v32_indexcache.py
chat template argument types: test/manual/test_deepseek_chat_templates.py
CP and DeepEP-specific changes: use the dedicated CP/DeepEP suites referenced by the PR
AMD changes: MI300/MI355 registered lanes
NPU changes: Ascend/NPU model deployment and backend tests

Adoption

BBuf/sglang-deepseek-v32-optimization

$ install --global

Security Scan Results

SKILL.md

SGLang DeepSeek V3.2 Optimization

Overview

Non-Negotiable Evidence Rule

Before You Change Anything

Core Principle

Main Runtime Surfaces

Open PRs to Track

Additional PR Coverage

Evolution Path

Stage V32-0: Bring up DSA/NSA as a separate DeepSeek class

Stage V32-1: Server defaults, KV cache dtype, and backend pair

Stage V32-2: Indexer correctness and performance

Stage V32-3: Context parallel, PP, and DP attention

Stage V32-4: MTP and speculative decoding

Stage V32-5: Quantized checkpoints and platform lanes

Stage V32-6: IndexCache

Stage V32-7: DSML tool calling and reasoning interaction

Validation Surface

Related Skills

BBuf/sglang-humanize-review

BBuf/model-pr-history-knowledge

BBuf/vllm-sota-humanize-loop

BBuf/llm-pipeline-analysis

BBuf/sglang-deepseek-v32-optimization

$ install --global

Security Scan Results

SKILL.md

SGLang DeepSeek V3.2 Optimization

Overview

Non-Negotiable Evidence Rule

Before You Change Anything

Core Principle

Main Runtime Surfaces

Open PRs to Track

Additional PR Coverage

Evolution Path

Stage V32-0: Bring up DSA/NSA as a separate DeepSeek class

Stage V32-1: Server defaults, KV cache dtype, and backend pair

Stage V32-2: Indexer correctness and performance

Stage V32-3: Context parallel, PP, and DP attention

Stage V32-4: MTP and speculative decoding

Stage V32-5: Quantized checkpoints and platform lanes

Stage V32-6: IndexCache

Stage V32-7: DSML tool calling and reasoning interaction

Validation Surface

Related Skills

BBuf/sglang-humanize-review

BBuf/model-pr-history-knowledge

BBuf/vllm-sota-humanize-loop

BBuf/llm-pipeline-analysis