Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

BBuf/llm-serving-capacity-planner

Name: llm-serving-capacity-planner
Author: BBuf

skills/llm-serving-capacity-planner/SKILL.md

npx skillsauth add BBuf/AI-Infra-Auto-Driven-SKILLS llm-serving-capacity-planner

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

LLM Serving Capacity Planner

Overview

Use this when a serving log has enough memory lines to explain where GPU HBM went. The analyzer reads SGLang/vLLM startup logs, extracts weight load, KV pool, CUDA graph, framework overhead, and token-capacity lines, then estimates concurrent requests for common token lengths.

Confirmation Required

Before running analysis, collect or verify these inputs:

| Item | Why it matters | How to obtain | Default if user skips | |---|---|---|---| | Log file path | Primary input; all memory data comes from here | Ask user for the serving startup log | — (required) | | GPU type | Determines total HBM for decomposition validation | Ask user or infer from log | Auto-detected from log if possible | | nvidia-smi output | Provides per-rank actual memory for cross-validation | Capture with nvidia-smi --query-gpu=index,memory.used,memory.free --format=csv,noheader > smi.txt | — (optional, but recommended) | | Model config.json | Enables theoretical KV cache byte calculation and replication factor analysis | Ask user for the model's config.json path | — (optional, log data used instead) | | Request token length | Determines concurrency estimate denominator | Ask user | 4096, 6144, 8192 |

Workflow

Step 1: Collect the serving log

The user should provide the startup log from an SGLang or vLLM serving instance. Key log lines that the analyzer needs:

Load weight begin. avail mem=XX GB
Memory profiling: available_gpu_memory=XX GB, ... (newer sglang)
SW KV memory calculation: bytes_per_full_token=XX, available_bytes=XX GB, full_token=XX (SWA models like DeepSeek-V4)
Memory pool end. avail mem=XX GB
Capture cuda graph end. ... mem usage=XX GB. avail mem=XX GB.
max_total_num_tokens=XX, ... max_running_requests=XX, ... available_gpu_mem=XX GB
server_args=ServerArgs(...) (for serving parameters)

If the log is from a running instance, capture it by redirecting stdout/stderr to a file at launch time.

Step 2: Optionally capture nvidia-smi data

For per-rank memory comparison:

docker exec <container> nvidia-smi --query-gpu=index,memory.used,memory.free --format=csv,noheader > smi.txt

Step 3: Run the analyzer

python3 skills/llm-serving-capacity-planner/scripts/capacity_analyzer.py \
  --log-file /path/to/sglang.log \
  --nvidia-smi-file /path/to/smi.txt \
  --gpu h200 \
  --config-json /path/to/config.json

For JSON output (automation):

python3 skills/llm-serving-capacity-planner/scripts/capacity_analyzer.py \
  --log-file /path/to/sglang.log \
  --format json

Step 4: Review and interpret results

The analyzer prints:

Memory breakdown table: each category (weights, KV pool, CUDA graph, framework, other) with GiB, MiB, percentage, and derivation
Per-rank comparison: nvidia-smi data across all TP ranks
KV pool detail: pool configuration, KV dtype, replication factor, per-token byte calculation
Concurrency estimate: max concurrent requests for different token lengths
Tuning notes: configuration changes that may increase capacity

When To Use It

After launching an LLM serving instance, to understand how GPU memory is distributed
When comparing different --mem-fraction-static values and their impact on KV pool capacity
When planning deployment capacity: how many concurrent requests can a given GPU configuration support
When investigating OOM issues: identifying which memory category is consuming the most
When evaluating whether fp8 KV cache or EP can improve concurrency

Key Concepts

mem-fraction-static

Controls what fraction of available GPU memory after weight loading is reserved for the KV cache pool. Higher values give more KV capacity but less headroom for CUDA graph and other runtime buffers.

0.88 (default): aggressive — 88% of post-weight memory goes to KV pool
0.60: conservative — more free memory left for runtime, but significantly less KV capacity

KV Head Replication

When num_key_value_heads < tp_size, KV cache is replicated across all TP ranks rather than split. For example, models with kv_heads=1, tp=8 means each of the 8 cards stores a full copy of the KV cache — 8x the per-card KV memory compared to a split scenario.

SWA (Sliding Window Attention) Compression

Models like DeepSeek-V4 use CSA (Compressed Sliding Attention) and HCA (Hierarchical Context Attention) with sliding windows. This reduces per-token KV cache bytes compared to the theoretical full-attention calculation. The bytes_per_full_token reported in the log already accounts for this compression.

Reporting Checklist

Include:

Serving configuration: model, GPU, TP/PP/EP, mem-fraction-static, kv-cache-dtype
Memory breakdown table: category / GiB / MiB / percentage / derivation source
Per-rank nvidia-smi comparison: used and free memory per TP rank
KV pool detail: pool size, bytes_per_full_token, KV dtype, replication factor, theoretical per-token KV calculation (when config.json provided)
Concurrency estimate table: request token length / token-limit / request-limit / max concurrent
Tuning notes based on free memory and configuration

Known Limitations

| Limitation | Detail | Workaround | |---|---|---| | SGLang-specific patterns | Currently only SGLang log patterns are fully supported | vLLM patterns to be added as encountered | | SWA compression models | Per-token KV bytes cannot be independently calculated from model config for CSA/HCA attention — the framework's internal SWA window parameters are needed | Use bytes_per_full_token from the log directly | | DeepGEMM JIT memory | The analyzer categorizes DeepGEMM JIT compilation memory as "other" because it is not explicitly reported in the log | Compare with nvidia-smi total for accurate accounting | | PP (Pipeline Parallelism) | Memory decomposition is per-rank; PP configurations may have uneven memory across stages | Specify --target-rank for each PP stage | | MoE expert buffer | Some frameworks allocate additional buffers for expert routing that are not separately reported | Included in "model weights" or "other" depending on when allocated |

References

references/log-patterns.md: log line patterns and their semantics for memory analysis.
references/gpu-specs.json: GPU HBM specifications for h20, h100, h200, and b200 aliases.
scripts/capacity_analyzer.py: the core analysis script.

BBuf/llm-serving-capacity-planner

skills/llm-serving-capacity-planner/SKILL.md

Parse SGLang/vLLM startup logs to explain GPU memory use and request capacity. Use for KV cache budget, mem-fraction-static comparisons, OOM triage, and max-concurrency estimates.

313 stars

testing

Updated May 21, 2026

$ install --global

skillsauth

npx skillsauth add BBuf/AI-Infra-Auto-Driven-SKILLS llm-serving-capacity-planner

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 21, 2026, 2:21 AM187.1s4 files scanned

SKILL.md

name:: llm-serving-capacity-planner
description:: Parse SGLang/vLLM startup logs to explain GPU memory use and request capacity. Use for KV cache budget, mem-fraction-static comparisons, OOM triage, and max-concurrency estimates.

LLM Serving Capacity Planner

Overview

Confirmation Required

Before running analysis, collect or verify these inputs:

Workflow

Step 1: Collect the serving log

The user should provide the startup log from an SGLang or vLLM serving instance. Key log lines that the analyzer needs:

Load weight begin. avail mem=XX GB
Memory profiling: available_gpu_memory=XX GB, ... (newer sglang)
SW KV memory calculation: bytes_per_full_token=XX, available_bytes=XX GB, full_token=XX (SWA models like DeepSeek-V4)
Memory pool end. avail mem=XX GB
Capture cuda graph end. ... mem usage=XX GB. avail mem=XX GB.
max_total_num_tokens=XX, ... max_running_requests=XX, ... available_gpu_mem=XX GB
server_args=ServerArgs(...) (for serving parameters)

If the log is from a running instance, capture it by redirecting stdout/stderr to a file at launch time.

Step 2: Optionally capture nvidia-smi data

For per-rank memory comparison:

docker exec <container> nvidia-smi --query-gpu=index,memory.used,memory.free --format=csv,noheader > smi.txt

Step 3: Run the analyzer

python3 skills/llm-serving-capacity-planner/scripts/capacity_analyzer.py \
  --log-file /path/to/sglang.log \
  --nvidia-smi-file /path/to/smi.txt \
  --gpu h200 \
  --config-json /path/to/config.json

For JSON output (automation):

python3 skills/llm-serving-capacity-planner/scripts/capacity_analyzer.py \
  --log-file /path/to/sglang.log \
  --format json

Step 4: Review and interpret results

The analyzer prints:

Memory breakdown table: each category (weights, KV pool, CUDA graph, framework, other) with GiB, MiB, percentage, and derivation
Per-rank comparison: nvidia-smi data across all TP ranks
KV pool detail: pool configuration, KV dtype, replication factor, per-token byte calculation
Concurrency estimate: max concurrent requests for different token lengths
Tuning notes: configuration changes that may increase capacity

When To Use It

After launching an LLM serving instance, to understand how GPU memory is distributed
When comparing different --mem-fraction-static values and their impact on KV pool capacity
When planning deployment capacity: how many concurrent requests can a given GPU configuration support
When investigating OOM issues: identifying which memory category is consuming the most
When evaluating whether fp8 KV cache or EP can improve concurrency

Key Concepts

mem-fraction-static

0.88 (default): aggressive — 88% of post-weight memory goes to KV pool
0.60: conservative — more free memory left for runtime, but significantly less KV capacity

KV Head Replication

SWA (Sliding Window Attention) Compression

Reporting Checklist

Include:

Serving configuration: model, GPU, TP/PP/EP, mem-fraction-static, kv-cache-dtype
Memory breakdown table: category / GiB / MiB / percentage / derivation source
Per-rank nvidia-smi comparison: used and free memory per TP rank
KV pool detail: pool size, bytes_per_full_token, KV dtype, replication factor, theoretical per-token KV calculation (when config.json provided)
Concurrency estimate table: request token length / token-limit / request-limit / max concurrent
Tuning notes based on free memory and configuration

Known Limitations

References

references/log-patterns.md: log line patterns and their semantics for memory analysis.
references/gpu-specs.json: GPU HBM specifications for h20, h100, h200, and b200 aliases.
scripts/capacity_analyzer.py: the core analysis script.

Related Skills

BBuf/sglang-humanize-review

development

VerifiedTrustedCommunity

Perform SGLang code review in the style of human maintainers by consulting the full non-agent PR review episode corpus from project start through the latest refresh (June 2026), including inline review threads, top-level PR comments, review submissions, original multilingual text, and multi-round discussions. Use when reviewing SGLang PRs, diffs, patches, or local changes for correctness, tests, performance, GPU/runtime risks, API compatibility, and maintainability.

531SKILL.mdUpdated May 21, 2026

BBuf/sglang-humanize-review

BBuf/model-pr-history-knowledge

documentation

VerifiedTrustedCommunity

Use when an SGLang, vLLM, or TensorRT-LLM serving/model optimization task needs prior model-family PR evidence. Query and read the PR-driven history docs under model-pr-optimization-history before choosing source paths, fast paths, kernel/fusion ideas, regression risks, or validation lanes.

531SKILL.mdUpdated May 17, 2026

BBuf/model-pr-history-knowledge

BBuf/vllm-sota-humanize-loop

development

VerifiedTrustedCommunity

Run an autonomous Humanize-governed vLLM SOTA performance loop for one LLM model: first perform the fixed fair vLLM/SGLang/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches vLLM code, optionally uses ncu-report-skill for kernel evidence, and revalidates until vLLM matches or beats the best observed framework under the same workload and SLA.

423SKILL.mdUpdated May 27, 2026

BBuf/vllm-sota-humanize-loop

BBuf/llm-pipeline-analysis

devops

VerifiedTrustedCommunity

Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.

423SKILL.mdUpdated May 21, 2026

BBuf/llm-pipeline-analysis

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/BBuf/AI-Infra-Auto-Driven-SKILLS.git

# Copy into Claude Code skills folder (global)
cp -r AI-Infra-Auto-Driven-SKILLS/skills/llm-serving-capacity-planner ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

BBuf/AI-Infra-Auto-Driven-SKILLS

313 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT