skills/sglang-prod-incident-triage/SKILL.md
Replay-first debug flow for SGLang serving problems. Use when a live or recent server shows health-check failures, latency or throughput regressions, queue growth, timeouts, distributed stalls, crash dumps, wrong outputs after deploys, or PD/EP/HiCache issues, and the job is to turn the problem into a replay plus the right next debug tool.
npx skillsauth add BBuf/AI-Infra-Auto-Driven-SKILLS sglang-prod-incident-triageInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use this skill to turn a live serving problem into a debug path you can replay.
Use one loop:
Do not start with profiling.
This skill should work with more focused skills instead of re-implementing them:
debug-cuda-crash when replay plus coredump points to a CUDA crash pathdebug-distributed-hang when the problem is clearly a TP/PP/DP/EP hangllm-torch-profiler-analysis when the issue is already narrowed to a
compute-side pathThree examples are included:
Return:
/health or /health_generate is unhealthyIf a live server is reachable, collect a read-only bundle before anything more intrusive:
python3 scripts/incident_artifact_tool.py collect-bundle \
--base-url http://127.0.0.1:30000 \
--outdir /tmp/incident_bundle
python3 scripts/incident_artifact_tool.py summarize-bundle \
/tmp/incident_bundle
If the server is protected:
python3 scripts/incident_artifact_tool.py collect-bundle \
--base-url http://127.0.0.1:30000 \
--token "$SGLANG_BEARER_TOKEN" \
--outdir /tmp/incident_bundle
The bundle script collects:
/health/health_generate/model_info/server_info/v1/loads?include=all/v1/loads?include=core,queues,disagg,spec/metrics/hicache/storage-backend on a best-effort basisUse the summary for a quick read on:
If the summary says the bundle was captured while the server was idle, recollect it during traffic or move quickly to dump plus replay.
If no live server is reachable, start from the best dump or log already available:
Read references/decision-tree.md only if the problem class is still unclear:
Then preserve the request payload that actually triggers the problem:
--crash-dump-folderDo not jump straight from a live symptom to low-level debugging without first saving something you can replay.
Read references/endpoints-and-signals.md when you need help reading the baseline bundle or the replay target.
Read references/replay-trace-profile.md when you need the replay, trace, profile, or bisect paths.
Standard order:
Use replay when:
If a crash dump exists, summarize it first:
python3 scripts/incident_artifact_tool.py summarize-dump \
--input-file /path/to/crash_dump.pkl
Then replay:
python3 /path/to/sglang/scripts/playground/replay_request_dump.py \
--input-file /path/to/crash_dump.pkl \
--host 127.0.0.1 \
--port 30000 \
--parallel 128
If safe_pickle_load blocks a locally captured trusted dump, use:
python3 scripts/replay_trusted_request_dump.py \
--input-file /path/to/request_dump.pkl \
--host 127.0.0.1 \
--port 30000 \
--parallel 1
If replay indicates a CUDA crash path, restart the same build with coredumps enabled before reproducing again:
SGLANG_CUDA_COREDUMP=1 \
SGLANG_CUDA_COREDUMP_DIR=/tmp/sglang_cuda_coredumps \
python -m sglang.launch_server \
--model-path ... \
--crash-dump-folder /tmp/sglang_crash_dump \
...
Then inspect the generated coredump:
cuda-gdb "$(which python3)" \
-ex "target cudacore /tmp/sglang_cuda_coredumps/cuda_coredump_<host>.<pid>.<ts>"
For a replay-first crash example, read references/case-studies.md.
Use tracing when:
If tracing was enabled at startup, you can change the level without restart:
curl "http://127.0.0.1:30000/set_trace_level?level=1"
curl "http://127.0.0.1:30000/set_trace_level?level=2"
Use profiling when:
At that point, switch to llm-torch-profiler-analysis. Do not duplicate
its profiling workflow here.
For a low-noise latency example, read references/case-studies.md.
If this looks like a collective stall, save the failing request, replay it on a
clean target, collect the replay-time bundle and stacks, then switch to
debug-distributed-hang.
For an example of that flow, read references/case-studies.md.
If one commit is known-good and another is known-bad, build a deterministic harness before doing deeper manual debugging:
0 on good behavior and non-zero on bad behaviorgit bisect start <bad> <good>git bisect run <harness>Prefer replay-backed bisect when the regression depends on request shape or long-running serving state.
Switch tools once the fault class is clear:
llm-torch-profiler-analysis for kernel and overlap attributiondebug-distributed-hang for collective or rank-divergence hangsdebug-cuda-crash for CUDA crash reproduction and kernel API loggingDo not switch tools before collecting the first bundle unless the user already has decisive logs or dumps.
Load only what the current step needs:
safe_pickle_load blocks stock replayIf a live bundle was collected, include its path.
If replay, trace, or profiling was chosen, say why bundle plus dump were not enough.
development
Perform SGLang code review in the style of human maintainers by consulting the full non-agent PR review episode corpus from project start through the latest refresh (June 2026), including inline review threads, top-level PR comments, review submissions, original multilingual text, and multi-round discussions. Use when reviewing SGLang PRs, diffs, patches, or local changes for correctness, tests, performance, GPU/runtime risks, API compatibility, and maintainability.
documentation
Use when an SGLang, vLLM, or TensorRT-LLM serving/model optimization task needs prior model-family PR evidence. Query and read the PR-driven history docs under model-pr-optimization-history before choosing source paths, fast paths, kernel/fusion ideas, regression risks, or validation lanes.
development
Run an autonomous Humanize-governed vLLM SOTA performance loop for one LLM model: first perform the fixed fair vLLM/SGLang/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches vLLM code, optionally uses ncu-report-skill for kernel evidence, and revalidates until vLLM matches or beats the best observed framework under the same workload and SLA.
devops
Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.