skills/model-optimization/vllm/vllm-deepseek-v4-optimization/SKILL.md
PR-backed optimization manual for DeepSeek V4 in vLLM. Use when an engineer needs to audit, debug, extend, or document DeepSeek V4 current-main support in vLLM, including the model module, MTP path, tokenizer/renderer, DSML tool parser, expert-dtype handling, and BF16 persistent-topk follow-up.
npx skillsauth add BBuf/AI-Infra-Auto-Driven-SKILLS vllm-deepseek-v4-optimizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill tracks DeepSeek V4 as a current-main vLLM model family. The old open bring-up PR was superseded by the rebased mainline landing, and the runtime now includes the registry aliases, model/MTP files, tokenizer/renderer, tool parser, and follow-up handling for FP4-vs-FP8 expert checkpoints.
Evidence snapshot:
origin/main: fd74c90d9 on 2026-04-27#40860#40760, #40860, #40806, #41006, #40811references/pr-history.mdmodel-pr-optimization-history/vllm/deepseek-v4/README.zh.md
and README.en.mdUse skills/model-optimization/model-pr-diff-dossier/SKILL.md as the bar.
DeepSeek V4 support claims must be tied to current-main files and diff-reviewed
PR cards, not only PR titles.
vllm/vllm/model_executor/models/registry.pyvllm/vllm/model_executor/models/deepseek_v4.pyvllm/vllm/model_executor/models/deepseek_v4_mtp.pyvllm/vllm/model_executor/layers/deepseek_v4_attention.py,
vllm/vllm/model_executor/layers/deepseek_compressor.pyvllm/vllm/tokenizers/deepseek_v4.py,
vllm/vllm/renderers/deepseek_v4.pyvllm/vllm/tool_parsers/deepseekv4_tool_parser.pyvllm/csrc/persistent_topk.cuh, vllm/csrc/topk.cu#40860 merged the rebased DeepSeek V4 bring-up: model class, MTP model,
MLA attention, tokenizer/renderer, registry aliases, and DSML parser.#40760 is now a closed predecessor, useful only as historical context for
the initial bring-up shape.#41006 added expert-dtype-aware dispatch so FP4 expert checkpoints keep the
MXFP4 path while Flash-Base / FP8 expert checkpoints use the FP8 MoE path and
suffix mapping.#40806 fixed DSML marker leakage for DSV3.2/DSV4 streaming parser paths.#40811 extends persistent top-k from FP32-only assumptions to
BF16 input support, which matters for the DeepSeek V4 sparse indexer path.[Feat] DeepSeek V4 Rebased: mainline support landing.Support DSV4 base: FP4/FP8 expert-dtype handling and mapper split.[Bugfix] Fix the DSML token leakage in DSV4/3.2: merged parser leak fix.[Perf][Kernel] BF16 input support for persistent topK - DeepSeekV4: still open follow-up.[New Model] Support DeepseekV4: closed predecessor superseded by #40860.registry.py before claiming a new architecture alias.expert_dtype="fp8" selects FP8 MoE
scales and does not route through MXFP4.#40811 merges, rerun kernel tests in
tests/kernels/test_top_k_per_row.py, especially BF16 decode and long-context
cases.references/pr-history.md: diff-reviewed DeepSeek V4 cards.development
Run an autonomous Humanize-governed vLLM SOTA performance loop for one LLM model: first perform the fixed fair vLLM/SGLang/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches vLLM code, optionally uses ncu-report-skill for kernel evidence, and revalidates until vLLM matches or beats the best observed framework under the same workload and SLA.
devops
Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.
development
Run an autonomous Humanize-governed SGLang SOTA performance loop for one LLM model: first perform the fixed fair SGLang/vLLM/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches SGLang code, optionally uses ncu-report-skill for kernel evidence, and revalidates until SGLang matches or beats the best observed framework under the same workload and SLA.
documentation
Use when an SGLang, vLLM, or TensorRT-LLM serving/model optimization task needs prior model-family PR evidence. Query and read the PR-driven history docs under model-pr-optimization-history before choosing source paths, fast paths, kernel/fusion ideas, regression risks, or validation lanes.