skills/sglang-sota-performance/SKILL.md
End-to-end SGLang SOTA performance workflow. Use when a user names an LLM model and wants SGLang to match or beat the best observed vLLM and TensorRT-LLM serving performance by searching each framework's best deployment command, benchmarking them fairly, profiling SGLang if it is slower, identifying kernel/overlap/fusion bottlenecks, patching SGLang code, and revalidating with real model runs.
npx skillsauth add BBuf/AI-Infra-Auto-Driven-SKILLS sglang-sota-performanceInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use this skill as the top-level optimization loop for one model at a time. It composes two lower-level skills:
llm-serving-auto-benchmark: search and compare best deployment commands across SGLang, vLLM, and TensorRT-LLM.llm-torch-profiler-analysis: capture or analyze torch-profiler traces and produce kernel, overlap-opportunity, and fuse-pattern tables.This skill's goal is not "run one benchmark." Its goal is a reproducible SGLang improvement loop: tune every framework fairly, prove whether SGLang is behind, explain the gap with profiler evidence, patch SGLang, and re-run the same model workload until the result is SOTA for the target environment.
Treat "SOTA" as "best observed, reproducible performance under the recorded model, workload, hardware, framework commits, precision, and SLA." Do not claim global SOTA without enough external evidence.
Before a real run, read only the needed sections from:
../llm-serving-auto-benchmark/SKILL.md../llm-torch-profiler-analysis/SKILL.mdIf the run uses a remote GPU host, also read the matching host skill such as
h100, b200, rtx5090, or another operator-side skill that gives SSH,
container, workspace, and artifact-path conventions.
Collect or infer these before starting a long search:
If the user only provides a model, choose a reasonable first workload and state
it explicitly. Prefer the closest cookbook config from
llm-serving-auto-benchmark/configs/cookbook-llm/ when available.
Use one run directory per model and date, for example:
runs/YYYYMMDD_<model_slug>_sota_loop/
manifest.txt
help/
benchmark/
profiles/
analysis/
patches/
final_report.md
Record exact framework versions, git commits, container names/images, CUDA/NCCL versions, GPU ids, launch commands, benchmark commands, and environment knobs. Never write Hugging Face tokens or other secrets into artifacts.
Verify the model can be loaded by each framework before launching a sweep.
Capture each framework's current --help output and version. Remove candidate
flags that are not accepted by that exact environment.
For TensorRT-LLM, keep the server backend within the scope of
llm-serving-auto-benchmark: trtllm-serve serve --backend pytorch.
If that backend is unavailable, mark TensorRT-LLM unsupported for the run
instead of silently switching to a different serving stack.
Use llm-serving-auto-benchmark as the source of truth for benchmark fairness,
candidate generation, result schema, and comparison tables.
Run a bounded search for every available framework. Do not compare SGLang's tuned command against competitor defaults. Each framework must get a real chance to find its best deployment command under the same:
Keep failed candidates and their failure reasons. The fastest SLA-failing candidate is not the winner.
Normalize the benchmark output with
llm-serving-auto-benchmark/scripts/compare_benchmark_results.py.
The comparison must include:
If SGLang is within benchmark noise of the best framework, rerun enough samples to decide whether the difference is real. Use a default regression threshold of 3-5% unless the user specifies a tighter target.
If SGLang is meaningfully slower, fails SLA while another framework passes, or uses much more memory for the same workload, run profiler triage before patching.
Use llm-torch-profiler-analysis against the SGLang best command first:
--profile-workload both; the profiler
skill labels prefill/ and decode/ by workload directory for this modeextend/prefill and decode traces; do not use one mixed
request as the default profiler workload1, and decode uses
input 1 with the slow output lengthProfile the winning competitor too when the SGLang table alone cannot explain why the other framework is faster. Compare stage by stage, not just total QPS.
Use the profiler tables to identify the narrowest plausible bottleneck.
Typical signals:
Do not patch from vibes. State the table row, stage, source location, and benchmark symptom that justify the code change.
Patch SGLang only after the benchmark gap and profiler evidence agree.
Good patch candidates:
Avoid changes that merely make the benchmark easier:
Keep patches minimal and local. Add focused tests when behavior changes, and add microbenchmarks or profiler evidence when performance is the only intended change.
After patching, rerun:
If the patch changes SGLang's available knobs, re-search SGLang's best command. If competitor versions or commands changed during the work, rerun their best commands too. Preserve before/after artifacts.
On 2026-05-01, this workflow was smoke-validated on h100_sglang with two
real model runs and two competitor checks per run. Artifacts were saved
under
/data/bbuf/validate/sglang_sota_performance_skill/runs/20260501_two_model_validation.
| Model | GPUs | Workload | SGLang result | vLLM check | TensorRT-LLM check |
| --- | --- | --- | --- | --- | --- |
| Qwen/Qwen2.5-7B-Instruct | 2x H100, TP=2 | random, input 512/output 64, 24 prompts, 10 warmup requests | 52.09 req/s, mean TTFT 144.85 ms, mean ITL 4.91 ms | 51.06 req/s, mean TTFT 159.19 ms, mean ITL 4.85 ms | 49.71 req/s, mean TTFT 177.54 ms, mean ITL 4.77 ms |
| Qwen/Qwen2.5-32B-Instruct | 4x H100, TP=4 | random, input 512/output 64, 16 prompts, 10 warmup requests | 18.47 req/s, mean TTFT 247.06 ms, mean ITL 9.66 ms | 18.78 req/s, mean TTFT 218.68 ms, mean ITL 9.98 ms | 15.48 req/s, mean TTFT 445.62 ms, mean ITL 9.27 ms |
Use this only as a workflow health check, not as a universal performance
claim. The TensorRT-LLM checks used trtllm-serve serve --backend pytorch and
the same OpenAI-compatible random workload.
Additional 2-card validation on 2026-05-01 exercised the full handoff from
bounded cross-framework search into SGLang stage-separated profiling. The
benchmark workload was random input 512, output 64, 8 prompts, and the
profiler used the same slow-workload lengths: prefill 512->1 and decode
1->64, with warmup 10 and capture 5.
| Model | GPUs | Best SGLang | Best vLLM | Profiler result | Artifact root |
| --- | --- | --- | --- | --- | --- |
| Qwen/Qwen3-8B | 2x H100, TP=2 | sglang_mem086, 21.64 req/s | vllm_mem080, 22.88 req/s | kernel, overlap, and fuse tables rendered with separate extend/prefill and decode sections | /data/bbuf/validate/core_skill_validation_20260501/qwen3_8b/sota |
| mistralai/Mistral-7B-Instruct-v0.3 | 2x H100, TP=2 | sglang_mem080, 24.09 req/s | vllm_mem090, 24.76 req/s | kernel, overlap, and fuse tables rendered with separate extend/prefill and decode sections | /data/bbuf/validate/core_skill_validation_20260501/mistral_7b_instruct_v03/sota |
Stop with a clear report when any of these is true:
Return a compact report with:
If no code patch was needed, say why and include the benchmark evidence. If a patch was attempted but not enough, be explicit about the remaining gap.
development
Run an autonomous Humanize-governed vLLM SOTA performance loop for one LLM model: first perform the fixed fair vLLM/SGLang/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches vLLM code, optionally uses ncu-report-skill for kernel evidence, and revalidates until vLLM matches or beats the best observed framework under the same workload and SLA.
devops
Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.
development
Run an autonomous Humanize-governed SGLang SOTA performance loop for one LLM model: first perform the fixed fair SGLang/vLLM/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches SGLang code, optionally uses ncu-report-skill for kernel evidence, and revalidates until SGLang matches or beats the best observed framework under the same workload and SLA.
documentation
Use when an SGLang, vLLM, or TensorRT-LLM serving/model optimization task needs prior model-family PR evidence. Query and read the PR-driven history docs under model-pr-optimization-history before choosing source paths, fast paths, kernel/fusion ideas, regression risks, or validation lanes.