skills/model-optimization/sglang/sglang-deepseek-v31-optimization/SKILL.md
PR-backed and current-main optimization manual for DeepSeek V3.1 and DeepSeek-V3.1-Terminus in SGLang. Use when an engineer needs to recover, extend, or audit DeepSeek V3.1 tool calling, thinking mode, chat templates, streaming parser behavior, loading fixes, MTP validation, fused MoE configs, or backend-specific tests.
npx skillsauth add BBuf/AI-Infra-Auto-Driven-SKILLS sglang-deepseek-v31-optimizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill covers the DeepSeek V3.1 support and optimization ladder that is active in SGLang main. V3.1 shares the DeepSeek V3/R1 model backbone, but its tool-call format, chat template, thinking flag, streaming parser, and validation lanes are separate enough to require an independent skill.
Current-main snapshot:
origin/main: 929e00eea on 2026-04-21origin/main: 8ec4d03 on 2026-04-21python/sglang/srt/models/deepseek_v2.pypython/sglang/srt/function_call/deepseekv31_detector.pyexamples/chat_template/tool_chat_template_deepseekv31.jinjaThe historical evidence lives in:
Use skills/model-optimization/model-pr-diff-dossier/SKILL.md as the production bar.
Every PR cited for this family must be based on diff reading, not only PR titles.
Record the exact serving shape first:
deepseek-ai/DeepSeek-V3.1, DeepSeek-V3.1-Terminus, or DeepSeek-V3.1-Specialechat_template_kwargs.thinking--tool-call-parser deepseekv31examples/chat_template/tool_chat_template_deepseekv31.jinja--reasoning-parser deepseek-v3Do not debug V3.1 as ordinary V3.
sglang-deepseek-v3-r1-optimization.function marker or fenced JSON block.chat_template_kwargs: {"thinking": true} with --reasoning-parser deepseek-v3, not R1's parser.The optimization order matters:
</think> template behaviorStart from these files before changing behavior:
python/sglang/srt/function_call/deepseekv31_detector.pypython/sglang/srt/function_call/function_call_parser.pyexamples/chat_template/tool_chat_template_deepseekv31.jinjapython/sglang/srt/entrypoints/openai/serving_chat.pypython/sglang/srt/parser/reasoning_parser.pypython/sglang/srt/managers/schedule_batch.pypython/sglang/srt/mem_cache/common.pypython/sglang/srt/models/deepseek_v2.pypython/sglang/srt/models/deepseek_common/deepseek_weight_loader.pytest/manual/test_deepseek_v31.pytest/manual/nightly/test_deepseek_v31_perf.pytest/manual/test_deepseek_chat_templates.pyCheck these before declaring a V3.1 gap:
DeepSeekV31Detector, open.Additional all-state PR coverage includes parser/template PRs that are relevant to V3.1 even though they are not the core bring-up PRs:
tool_chat_template_deepseekv31.jinja.DeepSeek V3.1 has a distinct tool-call wire format:
<|tool▁calls▁begin|><|tool▁call▁begin|>{name}<|tool▁sep|>{json_args}<|tool▁call▁end|><|tool▁calls▁end|>
It does not use V3's function literal or fenced json block.
Key PR:
Success check:
--tool-call-parser deepseekv31 resolves to DeepSeekV31DetectorV3.1 thinking mode is toggled through the chat template, not through the R1 parser.
Key PR:
Rules:
--reasoning-parser deepseek-v3extra_body: {"chat_template_kwargs": {"thinking": true}}<think> when thinking is enabled and </think> when non-thinking is desiredV3.1 multi-turn tool calls can pass tool["function"]["arguments"] as either a dict or an already serialized JSON string. The template must not double-escape JSON strings.
Key PR:
Success check:
tojsonThe structural trigger should be the generic per-call begin token, not the full name-specific begin string. Streaming must also preserve arguments that arrive in the first chunk and normal text before the tool marker.
Key PRs:
Success check:
structure_info().trigger is <|tool▁call▁begin|>V3.1 still depends on DeepSeek V3/R1 loader, MLA, MoE, and MTP surfaces.
Key PRs:
Success check:
test/manual/test_deepseek_v31.py covers TP8 and TP8+MTPenable_dp_attention flagsV3.1 shares the DeepSeek MoE shape with V3/V3.2-style fused MoE config work.
Key PR:
Success check:
E=257,N=256,fp8_w8a8Use the narrowest lane that matches the change:
DeepSeekV31Detector CPU unit teststest/manual/test_deepseek_chat_templates.pytest/manual/test_deepseek_v31.pytest/manual/nightly/test_deepseek_v31_perf.pysglang-deepseek-v3-r1-optimizationdevelopment
Run an autonomous Humanize-governed vLLM SOTA performance loop for one LLM model: first perform the fixed fair vLLM/SGLang/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches vLLM code, optionally uses ncu-report-skill for kernel evidence, and revalidates until vLLM matches or beats the best observed framework under the same workload and SLA.
devops
Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.
development
Run an autonomous Humanize-governed SGLang SOTA performance loop for one LLM model: first perform the fixed fair SGLang/vLLM/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches SGLang code, optionally uses ncu-report-skill for kernel evidence, and revalidates until SGLang matches or beats the best observed framework under the same workload and SLA.
documentation
Use when an SGLang, vLLM, or TensorRT-LLM serving/model optimization task needs prior model-family PR evidence. Query and read the PR-driven history docs under model-pr-optimization-history before choosing source paths, fast paths, kernel/fusion ideas, regression risks, or validation lanes.