skills/llm-serving-capacity-planner/SKILL.md
Parse SGLang/vLLM startup logs to explain GPU memory use and request capacity. Use for KV cache budget, mem-fraction-static comparisons, OOM triage, and max-concurrency estimates.
npx skillsauth add BBuf/AI-Infra-Auto-Driven-SKILLS llm-serving-capacity-plannerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use this when a serving log has enough memory lines to explain where GPU HBM went. The analyzer reads SGLang/vLLM startup logs, extracts weight load, KV pool, CUDA graph, framework overhead, and token-capacity lines, then estimates concurrent requests for common token lengths.
Before running analysis, collect or verify these inputs:
| Item | Why it matters | How to obtain | Default if user skips |
|---|---|---|---|
| Log file path | Primary input; all memory data comes from here | Ask user for the serving startup log | — (required) |
| GPU type | Determines total HBM for decomposition validation | Ask user or infer from log | Auto-detected from log if possible |
| nvidia-smi output | Provides per-rank actual memory for cross-validation | Capture with nvidia-smi --query-gpu=index,memory.used,memory.free --format=csv,noheader > smi.txt | — (optional, but recommended) |
| Model config.json | Enables theoretical KV cache byte calculation and replication factor analysis | Ask user for the model's config.json path | — (optional, log data used instead) |
| Request token length | Determines concurrency estimate denominator | Ask user | 4096, 6144, 8192 |
The user should provide the startup log from an SGLang or vLLM serving instance. Key log lines that the analyzer needs:
Load weight begin. avail mem=XX GBMemory profiling: available_gpu_memory=XX GB, ... (newer sglang)SW KV memory calculation: bytes_per_full_token=XX, available_bytes=XX GB, full_token=XX (SWA models like DeepSeek-V4)Memory pool end. avail mem=XX GBCapture cuda graph end. ... mem usage=XX GB. avail mem=XX GB.max_total_num_tokens=XX, ... max_running_requests=XX, ... available_gpu_mem=XX GBserver_args=ServerArgs(...) (for serving parameters)If the log is from a running instance, capture it by redirecting stdout/stderr to a file at launch time.
For per-rank memory comparison:
docker exec <container> nvidia-smi --query-gpu=index,memory.used,memory.free --format=csv,noheader > smi.txt
python3 skills/llm-serving-capacity-planner/scripts/capacity_analyzer.py \
--log-file /path/to/sglang.log \
--nvidia-smi-file /path/to/smi.txt \
--gpu h200 \
--config-json /path/to/config.json
For JSON output (automation):
python3 skills/llm-serving-capacity-planner/scripts/capacity_analyzer.py \
--log-file /path/to/sglang.log \
--format json
The analyzer prints:
--mem-fraction-static values and their impact on KV pool capacityControls what fraction of available GPU memory after weight loading is reserved for the KV cache pool. Higher values give more KV capacity but less headroom for CUDA graph and other runtime buffers.
0.88 (default): aggressive — 88% of post-weight memory goes to KV pool0.60: conservative — more free memory left for runtime, but significantly less KV capacityWhen num_key_value_heads < tp_size, KV cache is replicated across all TP ranks rather than split. For example, models with kv_heads=1, tp=8 means each of the 8 cards stores a full copy of the KV cache — 8x the per-card KV memory compared to a split scenario.
Models like DeepSeek-V4 use CSA (Compressed Sliding Attention) and HCA
(Hierarchical Context Attention) with sliding windows. This reduces per-token
KV cache bytes compared to the theoretical full-attention calculation. The
bytes_per_full_token reported in the log already accounts for this
compression.
Include:
| Limitation | Detail | Workaround |
|---|---|---|
| SGLang-specific patterns | Currently only SGLang log patterns are fully supported | vLLM patterns to be added as encountered |
| SWA compression models | Per-token KV bytes cannot be independently calculated from model config for CSA/HCA attention — the framework's internal SWA window parameters are needed | Use bytes_per_full_token from the log directly |
| DeepGEMM JIT memory | The analyzer categorizes DeepGEMM JIT compilation memory as "other" because it is not explicitly reported in the log | Compare with nvidia-smi total for accurate accounting |
| PP (Pipeline Parallelism) | Memory decomposition is per-rank; PP configurations may have uneven memory across stages | Specify --target-rank for each PP stage |
| MoE expert buffer | Some frameworks allocate additional buffers for expert routing that are not separately reported | Included in "model weights" or "other" depending on when allocated |
references/log-patterns.md: log line patterns and their semantics for memory analysis.references/gpu-specs.json: GPU HBM specifications for h20, h100, h200, and b200 aliases.scripts/capacity_analyzer.py: the core analysis script.development
Perform SGLang code review in the style of human maintainers by consulting the full non-agent PR review episode corpus from project start through the latest refresh (June 2026), including inline review threads, top-level PR comments, review submissions, original multilingual text, and multi-round discussions. Use when reviewing SGLang PRs, diffs, patches, or local changes for correctness, tests, performance, GPU/runtime risks, API compatibility, and maintainability.
documentation
Use when an SGLang, vLLM, or TensorRT-LLM serving/model optimization task needs prior model-family PR evidence. Query and read the PR-driven history docs under model-pr-optimization-history before choosing source paths, fast paths, kernel/fusion ideas, regression risks, or validation lanes.
development
Run an autonomous Humanize-governed vLLM SOTA performance loop for one LLM model: first perform the fixed fair vLLM/SGLang/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches vLLM code, optionally uses ncu-report-skill for kernel evidence, and revalidates until vLLM matches or beats the best observed framework under the same workload and SLA.
devops
Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.