Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

BBuf/sglang-torch-profiler-analysis

Name: sglang-torch-profiler-analysis
Author: BBuf

skills/sglang-torch-profiler-analysis/SKILL.md

npx skillsauth add BBuf/AI-Infra-Auto-Driven-SKILLS sglang-torch-profiler-analysis

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

SGLang Torch Profiler Analysis

Overview

Use this skill for SGLang torch.profiler analysis.

There is only one public workflow:

triage

Use the unified entrypoint:

scripts/analyze_sglang_torch_profile.py

triage always prints the same three tables:

kernel table
overlap-opportunity table
fuse-pattern table

By default, all three tables only render rows at or above 1.0% cumulative GPU-time share. Treat anything below that as noise unless the user explicitly asks for a lower cutoff.

The script-level fuse-pattern table should stay source-backed and deterministic. Do not build a fuzzy string-matching engine into the script for typo-tolerance.

If exact/source-backed matching is weak but the agent judges that a cluster of kernels still looks semantically close to a known pattern, add a short AI note after the table with one of these labels:

high: very likely the same pattern family; naming drift or minor implementation reshaping is the main uncertainty
medium: several signals line up, but one important piece is still ambiguous
low: weak resemblance only; mention it only if it is still worth a human follow-up

When To Use It

inspect an SGLang torch profiler trace or profile directory
profile a live SGLang server and immediately analyze the output
summarize which kernel families dominate prefill or decode
map kernels back to Python code paths
judge whether a code path still has overlap headroom
check whether an already-known fusion or overlap path should have applied

Diffusion Backend Gate

For diffusion benchmark or profiling work, only analyze traces produced by the native SGLang diffusion backend.

If the run that generated the trace logs any of:

Falling back to diffusers backend
Using diffusers backend
Loaded diffusers pipeline

stop the workflow instead of analyzing the trace. Treat it as a backend-selection issue, not as valid SGLang diffusion profiler evidence.

Main Flows

1. Single-trace triage from an existing profile dir or trace

python3 scripts/analyze_sglang_torch_profile.py \
  --input /path/to/profile_dir_or_trace.json.gz

Use this when you want the fastest read on kernel share and likely fused-kernel pattern matches. The overlap table stays conservative in single-trace mode and will tell you when a mapping/formal pair is needed.

2. Single-trace triage from a running server

python3 scripts/analyze_sglang_torch_profile.py \
  --url http://127.0.0.1:30000 \
  --num-steps 5 \
  --profile-by-stage

3. Two-trace triage from existing profile dirs or traces

python3 scripts/analyze_sglang_torch_profile.py triage \
  --mapping-input /path/to/graph_off_profile_dir \
  --formal-input /path/to/graph_on_profile_dir

Use this when you need stronger overlap conclusions and cleaner kernel-to-source attribution.

4. Two-trace triage from running servers

python3 scripts/analyze_sglang_torch_profile.py triage \
  --mapping-url http://127.0.0.1:31025 \
  --formal-url http://127.0.0.1:31026 \
  --num-steps 5 \
  --profile-by-stage

`profile_by_stage`

profile_by_stage is not only for PD disaggregation.

On ordinary non-PD serving, it is still useful because prefill and decode usually have very different bottlenecks.
On the current profile-v2 path inside SGLang, stage-based profiling is effectively the normal path.
PD-disaggregated serving adds one extra rule: prefill workers and decode workers must be profiled separately. That is stricter than ordinary profile_by_stage.

How To Choose The Triage Shape

Single-trace triage

Use when you want the lowest-friction report:

one trace is already available
you mainly want kernel share and fusion clues
you are comparing two runs side by side by running triage once per trace

This is the recommended default.

Two-trace triage

Use when you need:

a stronger answer about overlap headroom
graph-off source mapping plus graph-on final behavior
more trustworthy overlap recommendations in the middle table

mapping trace with --disable-cuda-graph --disable-piecewise-cuda-graph
formal trace with the real serving optimizations enabled

Do not call the mapping pass a "fast profile". It exists to recover kernel -> cpu_op -> python scope.

Workflow

Single-trace workflow

If the user only wants a quick diagnosis, one trace is enough.
Prefer rank-local TP-0 traces over merged traces.
For a live server, this skill can call sglang.profiler and automatically send a small probe request.
Prefer --profile-by-stage even on standard serving unless the user explicitly wants an all-stage mixed trace.

Two-trace workflow

Produce a mapping trace first with graph disabled.
Produce a formal trace second with graph enabled and the real serving flags kept on.
Run triage for the compact three-table report.
Read the results in this order:
- kernel table
- overlap-opportunity table
- fuse-pattern table
Before calling something a "new" optimization idea, compare the top rows against both references/fuse-overlap-catalog.md and references/overlap-catalog.md. Check mainline rows first, then the PR-backed / in-flight sections for still-moving upstream work. Prefer reporting:
- an existing fused or overlap path that should already apply here
- an existing path that appears disabled, unsupported, or regressed in this trace
- an upstream pattern that is mainline elsewhere but missing locally, or still open upstream
- a truly new opportunity only when no catalog entry fits
If no exact pattern fully matches but the trace still looks semantically close to a known family, add one flat AI similarity judgment note after the tables. Use high, medium, or low only. Base that note on the full pattern shape, not on one kernel name alone. Prefer semantic cues such as producer-consumer chain, source locations, CPU op names, TP context, and model-specific structure. Do not rewrite the script table itself to include these heuristic judgments.

References

Load these only when needed:

references/source-map.md
- upstream SGLang profiler entrypoints and trace-writing source paths
references/heuristics.md
- overlap labels, dependency-risk interpretation, and limits
references/fuse-overlap-catalog.md
- mixed source-backed catalog of existing fuse and overlap patterns, including mainline rows plus PR-backed / in-flight rows
references/overlap-catalog.md
- overlap-only lookup table across LLM, VLM, diffusion, disaggregation, HiSparse, and speculative scheduling

Output Contract

Return:

trace path or generated profile path
model/server args when available
kernel table
overlap-opportunity table
fuse-pattern table
optional AI similarity judgment note with high / medium / low when exact matching is inconclusive
one short conclusion about what dominates the run
whether the overlap conclusion came from single-trace triage or mapping/formal two-trace triage

BBuf/sglang-torch-profiler-analysis

skills/sglang-torch-profiler-analysis/SKILL.md

Compact SGLang torch-profiler triage skill. Use when Codex should inspect an existing `trace.json(.gz)` or profile directory, trigger `sglang.profiler` against a live server, and return one compact report with kernel, overlap-opportunity, and fuse-pattern tables. Single-trace triage is enough for quick diagnosis; mapping+formal two-trace triage gives stronger overlap conclusions.

83 stars

development

Updated Apr 23, 2026

$ install --global

skillsauth

npx skillsauth add BBuf/AI-Infra-Auto-Driven-SKILLS sglang-torch-profiler-analysis

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 23, 2026, 1:16 PM8.7s9 files scanned

SKILL.md

name:: sglang-torch-profiler-analysis
description:: Compact SGLang torch-profiler triage skill. Use when Codex should inspect an existing `trace.json(.gz)` or profile directory, trigger `sglang.profiler` against a live server, and return one compact report with kernel, overlap-opportunity, and fuse-pattern tables. Single-trace triage is enough for quick diagnosis; mapping+formal two-trace triage gives stronger overlap conclusions.

SGLang Torch Profiler Analysis

Overview

Use this skill for SGLang torch.profiler analysis.

There is only one public workflow:

triage

Use the unified entrypoint:

scripts/analyze_sglang_torch_profile.py

triage always prints the same three tables:

kernel table
overlap-opportunity table
fuse-pattern table

By default, all three tables only render rows at or above 1.0% cumulative GPU-time share. Treat anything below that as noise unless the user explicitly asks for a lower cutoff.

The script-level fuse-pattern table should stay source-backed and deterministic. Do not build a fuzzy string-matching engine into the script for typo-tolerance.

If exact/source-backed matching is weak but the agent judges that a cluster of kernels still looks semantically close to a known pattern, add a short AI note after the table with one of these labels:

high: very likely the same pattern family; naming drift or minor implementation reshaping is the main uncertainty
medium: several signals line up, but one important piece is still ambiguous
low: weak resemblance only; mention it only if it is still worth a human follow-up

When To Use It

inspect an SGLang torch profiler trace or profile directory
profile a live SGLang server and immediately analyze the output
summarize which kernel families dominate prefill or decode
map kernels back to Python code paths
judge whether a code path still has overlap headroom
check whether an already-known fusion or overlap path should have applied

Diffusion Backend Gate

For diffusion benchmark or profiling work, only analyze traces produced by the native SGLang diffusion backend.

If the run that generated the trace logs any of:

Falling back to diffusers backend
Using diffusers backend
Loaded diffusers pipeline

stop the workflow instead of analyzing the trace. Treat it as a backend-selection issue, not as valid SGLang diffusion profiler evidence.

Main Flows

1. Single-trace triage from an existing profile dir or trace

python3 scripts/analyze_sglang_torch_profile.py \
  --input /path/to/profile_dir_or_trace.json.gz

2. Single-trace triage from a running server

python3 scripts/analyze_sglang_torch_profile.py \
  --url http://127.0.0.1:30000 \
  --num-steps 5 \
  --profile-by-stage

3. Two-trace triage from existing profile dirs or traces

python3 scripts/analyze_sglang_torch_profile.py triage \
  --mapping-input /path/to/graph_off_profile_dir \
  --formal-input /path/to/graph_on_profile_dir

Use this when you need stronger overlap conclusions and cleaner kernel-to-source attribution.

4. Two-trace triage from running servers

python3 scripts/analyze_sglang_torch_profile.py triage \
  --mapping-url http://127.0.0.1:31025 \
  --formal-url http://127.0.0.1:31026 \
  --num-steps 5 \
  --profile-by-stage

`profile_by_stage`

profile_by_stage is not only for PD disaggregation.

On ordinary non-PD serving, it is still useful because prefill and decode usually have very different bottlenecks.
On the current profile-v2 path inside SGLang, stage-based profiling is effectively the normal path.
PD-disaggregated serving adds one extra rule: prefill workers and decode workers must be profiled separately. That is stricter than ordinary profile_by_stage.

How To Choose The Triage Shape

Single-trace triage

Use when you want the lowest-friction report:

one trace is already available
you mainly want kernel share and fusion clues
you are comparing two runs side by side by running triage once per trace

This is the recommended default.

Two-trace triage

Use when you need:

a stronger answer about overlap headroom
graph-off source mapping plus graph-on final behavior
more trustworthy overlap recommendations in the middle table

mapping trace with --disable-cuda-graph --disable-piecewise-cuda-graph
formal trace with the real serving optimizations enabled

Do not call the mapping pass a "fast profile". It exists to recover kernel -> cpu_op -> python scope.

Workflow

Single-trace workflow

If the user only wants a quick diagnosis, one trace is enough.
Prefer rank-local TP-0 traces over merged traces.
For a live server, this skill can call sglang.profiler and automatically send a small probe request.
Prefer --profile-by-stage even on standard serving unless the user explicitly wants an all-stage mixed trace.

Two-trace workflow

Produce a mapping trace first with graph disabled.
Produce a formal trace second with graph enabled and the real serving flags kept on.
Run triage for the compact three-table report.
Read the results in this order:
- kernel table
- overlap-opportunity table
- fuse-pattern table
Before calling something a "new" optimization idea, compare the top rows against both references/fuse-overlap-catalog.md and references/overlap-catalog.md. Check mainline rows first, then the PR-backed / in-flight sections for still-moving upstream work. Prefer reporting:
- an existing fused or overlap path that should already apply here
- an existing path that appears disabled, unsupported, or regressed in this trace
- an upstream pattern that is mainline elsewhere but missing locally, or still open upstream
- a truly new opportunity only when no catalog entry fits
If no exact pattern fully matches but the trace still looks semantically close to a known family, add one flat AI similarity judgment note after the tables. Use high, medium, or low only. Base that note on the full pattern shape, not on one kernel name alone. Prefer semantic cues such as producer-consumer chain, source locations, CPU op names, TP context, and model-specific structure. Do not rewrite the script table itself to include these heuristic judgments.

References

Load these only when needed:

references/source-map.md
- upstream SGLang profiler entrypoints and trace-writing source paths
references/heuristics.md
- overlap labels, dependency-risk interpretation, and limits
references/fuse-overlap-catalog.md
- mixed source-backed catalog of existing fuse and overlap patterns, including mainline rows plus PR-backed / in-flight rows
references/overlap-catalog.md
- overlap-only lookup table across LLM, VLM, diffusion, disaggregation, HiSparse, and speculative scheduling

Output Contract

Return:

trace path or generated profile path
model/server args when available
kernel table
overlap-opportunity table
fuse-pattern table
optional AI similarity judgment note with high / medium / low when exact matching is inconclusive
one short conclusion about what dominates the run
whether the overlap conclusion came from single-trace triage or mapping/formal two-trace triage

Related Skills

BBuf/sglang-humanize-review

development

VerifiedTrustedCommunity

Perform SGLang code review in the style of human maintainers by consulting the full non-agent PR review episode corpus from project start through the latest refresh (June 2026), including inline review threads, top-level PR comments, review submissions, original multilingual text, and multi-round discussions. Use when reviewing SGLang PRs, diffs, patches, or local changes for correctness, tests, performance, GPU/runtime risks, API compatibility, and maintainability.

531SKILL.mdUpdated May 21, 2026

BBuf/sglang-humanize-review

BBuf/model-pr-history-knowledge

documentation

VerifiedTrustedCommunity

Use when an SGLang, vLLM, or TensorRT-LLM serving/model optimization task needs prior model-family PR evidence. Query and read the PR-driven history docs under model-pr-optimization-history before choosing source paths, fast paths, kernel/fusion ideas, regression risks, or validation lanes.

531SKILL.mdUpdated May 17, 2026

BBuf/model-pr-history-knowledge

BBuf/vllm-sota-humanize-loop

development

VerifiedTrustedCommunity

Run an autonomous Humanize-governed vLLM SOTA performance loop for one LLM model: first perform the fixed fair vLLM/SGLang/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches vLLM code, optionally uses ncu-report-skill for kernel evidence, and revalidates until vLLM matches or beats the best observed framework under the same workload and SLA.

423SKILL.mdUpdated May 27, 2026

BBuf/vllm-sota-humanize-loop

BBuf/llm-pipeline-analysis

devops

VerifiedTrustedCommunity

Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.

423SKILL.mdUpdated May 21, 2026

BBuf/llm-pipeline-analysis

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/BBuf/AI-Infra-Auto-Driven-SKILLS.git

# Copy into Claude Code skills folder (global)
cp -r AI-Infra-Auto-Driven-SKILLS/skills/sglang-torch-profiler-analysis ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

BBuf/AI-Infra-Auto-Driven-SKILLS

83 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT