skills/sglang-torch-profiler-analysis/SKILL.md
Compact SGLang torch-profiler triage skill. Use when Codex should inspect an existing `trace.json(.gz)` or profile directory, trigger `sglang.profiler` against a live server, and return one compact report with kernel, overlap-opportunity, and fuse-pattern tables. Single-trace triage is enough for quick diagnosis; mapping+formal two-trace triage gives stronger overlap conclusions.
npx skillsauth add BBuf/AI-Infra-Auto-Driven-SKILLS sglang-torch-profiler-analysisInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use this skill for SGLang torch.profiler analysis.
There is only one public workflow:
triageUse the unified entrypoint:
triage always prints the same three tables:
By default, all three tables only render rows at or above 1.0% cumulative GPU-time share.
Treat anything below that as noise unless the user explicitly asks for a lower cutoff.
The script-level fuse-pattern table should stay source-backed and deterministic. Do not build a fuzzy string-matching engine into the script for typo-tolerance.
If exact/source-backed matching is weak but the agent judges that a cluster of kernels still looks semantically close to a known pattern, add a short AI note after the table with one of these labels:
high: very likely the same pattern family; naming drift or minor implementation reshaping is the main uncertaintymedium: several signals line up, but one important piece is still ambiguouslow: weak resemblance only; mention it only if it is still worth a human follow-upFor diffusion benchmark or profiling work, only analyze traces produced by the native SGLang diffusion backend.
If the run that generated the trace logs any of:
Falling back to diffusers backendUsing diffusers backendLoaded diffusers pipelinestop the workflow instead of analyzing the trace. Treat it as a backend-selection issue, not as valid SGLang diffusion profiler evidence.
python3 scripts/analyze_sglang_torch_profile.py \
--input /path/to/profile_dir_or_trace.json.gz
Use this when you want the fastest read on kernel share and likely fused-kernel pattern matches. The overlap table stays conservative in single-trace mode and will tell you when a mapping/formal pair is needed.
python3 scripts/analyze_sglang_torch_profile.py \
--url http://127.0.0.1:30000 \
--num-steps 5 \
--profile-by-stage
python3 scripts/analyze_sglang_torch_profile.py triage \
--mapping-input /path/to/graph_off_profile_dir \
--formal-input /path/to/graph_on_profile_dir
Use this when you need stronger overlap conclusions and cleaner kernel-to-source attribution.
python3 scripts/analyze_sglang_torch_profile.py triage \
--mapping-url http://127.0.0.1:31025 \
--formal-url http://127.0.0.1:31026 \
--num-steps 5 \
--profile-by-stage
profile_by_stageprofile_by_stage is not only for PD disaggregation.
profile_by_stage.Use when you want the lowest-friction report:
This is the recommended default.
Use when you need:
--disable-cuda-graph --disable-piecewise-cuda-graphDo not call the mapping pass a "fast profile". It exists to recover kernel -> cpu_op -> python scope.
TP-0 traces over merged traces.sglang.profiler and automatically send a small probe request.--profile-by-stage even on standard serving unless the user explicitly wants an all-stage mixed trace.triage for the compact three-table report.PR-backed / in-flight sections for still-moving upstream work. Prefer reporting:
AI similarity judgment note after the tables.
Use high, medium, or low only.
Base that note on the full pattern shape, not on one kernel name alone.
Prefer semantic cues such as producer-consumer chain, source locations, CPU op names, TP context, and model-specific structure.
Do not rewrite the script table itself to include these heuristic judgments.Load these only when needed:
Return:
AI similarity judgment note with high / medium / low when exact matching is inconclusivedevelopment
Perform SGLang code review in the style of human maintainers by consulting the full non-agent PR review episode corpus from project start through the latest refresh (June 2026), including inline review threads, top-level PR comments, review submissions, original multilingual text, and multi-round discussions. Use when reviewing SGLang PRs, diffs, patches, or local changes for correctness, tests, performance, GPU/runtime risks, API compatibility, and maintainability.
documentation
Use when an SGLang, vLLM, or TensorRT-LLM serving/model optimization task needs prior model-family PR evidence. Query and read the PR-driven history docs under model-pr-optimization-history before choosing source paths, fast paths, kernel/fusion ideas, regression risks, or validation lanes.
development
Run an autonomous Humanize-governed vLLM SOTA performance loop for one LLM model: first perform the fixed fair vLLM/SGLang/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches vLLM code, optionally uses ncu-report-skill for kernel evidence, and revalidates until vLLM matches or beats the best observed framework under the same workload and SLA.
devops
Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.