skills/model-optimization/sglang/sglang-deepseek-v4-optimization/SKILL.md
PR-backed and current-main optimization manual for DeepSeek-V4 in SGLang. Use when an engineer needs to audit or extend DeepSeek-V4 Flash/Pro serving recipes, FP4-vs-FP8 checkpoint selection, H200/B200/GB300 launch commands, DeepEP dispatch-token budgets, context-parallel and PD-disaggregation recipes, MTP/EAGLE settings, or DeepSeek-V4 parser flags.
npx skillsauth add BBuf/AI-Infra-Auto-Driven-SKILLS sglang-deepseek-v4-optimizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
DeepSeek-V4 is now both a current-main runtime lane and a cookbook/command-generator lane in SGLang. The latest PRs include the original deployment matrix, AMD/DeepSeek-V4 runtime integration, CUDA-graph support, DeepGEMM warmup, benchmarking scripts, parser/tool-call support, and model-level fixes.
Current evidence snapshot:
origin/main: 6fbad22fe on 2026-04-28python/sglang/srt/models/deepseek_v4.pypython/sglang/srt/models/deepseek_v4_nextn.pypython/sglang/srt/layers/attention/deepseek_v4_backend.pydocs_new/cookbook/autoregressive/DeepSeek/DeepSeek-V4.mdxdocs_new/src/snippets/autoregressive/deepseek-v4-deployment.jsxUse skills/model-optimization/model-pr-diff-dossier/SKILL.md as the production bar.
Every PR cited for this family must be based on diff reading, not only PR titles.
Capture:
sgl-project/*-FP8--reasoning-parser deepseek-v4, --tool-call-parser deepseekv4SGLANG_ENABLE_SPEC_V2--max-running-requestsTreat the DeepSeek-V4 docs as an executable deployment matrix, not ordinary prose.
sgl-project/DeepSeek-V4-*-FP8, not the default DeepSeek FP4-mixed repos.swiglu_limit clamp in DeepseekV2MLP for V4 checkpoints;
keep that model-level fix in mind before debugging meaningless-number output.Before adding DeepSeek-V4 evidence, open the PR diff/source and update references/pr-history.md with motivation, key implementation, short code/config excerpts, reviewed files, and validation implications. Docs-only PRs still need exact command/config lines.
sgl-project/DeepSeek-V4-Flash-FP8.sgl-project/DeepSeek-V4-Pro-FP8 and TP=16 multinode note.DeepseekV4ForCausalLM, MTP/nextn, DSML parser, compressed attention, and CUDA-graph replay after changes to deepseek_v4.py or attention/indexer code.references/pr-history.md: diff-reviewed DeepSeek-V4 PR cards.development
Perform SGLang code review in the style of human maintainers by consulting the full non-agent PR review episode corpus from project start through the latest refresh (June 2026), including inline review threads, top-level PR comments, review submissions, original multilingual text, and multi-round discussions. Use when reviewing SGLang PRs, diffs, patches, or local changes for correctness, tests, performance, GPU/runtime risks, API compatibility, and maintainability.
documentation
Use when an SGLang, vLLM, or TensorRT-LLM serving/model optimization task needs prior model-family PR evidence. Query and read the PR-driven history docs under model-pr-optimization-history before choosing source paths, fast paths, kernel/fusion ideas, regression risks, or validation lanes.
development
Run an autonomous Humanize-governed vLLM SOTA performance loop for one LLM model: first perform the fixed fair vLLM/SGLang/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches vLLM code, optionally uses ncu-report-skill for kernel evidence, and revalidates until vLLM matches or beats the best observed framework under the same workload and SLA.
devops
Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.