skills/h100-sglang-diffusion/SKILL.md
SSH into host `h100_sglang`, enter Docker container `sglang_bbuf`, work in `/data/bbuf/repos/sglang`, and use the ready H100 remote environment for SGLang **diffusion** development and validation. Use when a task needs diffusion model smoke tests, Triton/CUDA kernel validation, torch.compile diffusion checks, or a safe remote copy for diffusion-specific SGLang changes.
npx skillsauth add BBuf/AI-Infra-Auto-Driven-SKILLS h100-sglang-diffusionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use this skill to do SGLang diffusion development on the H100 box through h100_sglang.
The default container is sglang_bbuf and the repo lives at /data/bbuf/repos/sglang.
Prefer this skill when:
DiffGenerator, flux, etc.)torch.compile diffusion performancepython[diffusion] editable install changesThis environment is already prepared:
sglang_bbuf is running on lmsysorg/sglang:dev/data/bbuf/repos/sglangpython[all] and python[diffusion] are already done/data/.cache is mounted to /root/.cache/sys/class/infiniband, /dev/infiniband, and /usr/sbin/show_gidsssh h100_sglang 'hostname && whoami'
ssh h100_sglang 'docker ps --format "table {{.Names}}\t{{.Status}}" | sed -n "1,20p"'
ssh h100_sglang 'nvidia-smi --query-gpu=index,name,utilization.gpu,memory.used,memory.total --format=csv,noheader,nounits'
ssh h100_sglang 'docker exec -it sglang_bbuf /bin/zsh'
cd /data/bbuf/repos/sglang
echo ${HF_TOKEN:+set}
If HF_TOKEN is missing, export it before any Hub-backed diffusion run:
export HF_TOKEN=<your-hf-token>
export HUGGINGFACE_HUB_TOKEN="$HF_TOKEN"
For non-interactive docker exec ... bash -lc "<cmd>" runs, export both variables
inline instead of relying on shell startup:
ssh h100_sglang 'docker exec sglang_bbuf env HF_TOKEN=<your-hf-token> HUGGINGFACE_HUB_TOKEN=<your-hf-token> zsh -lc "..."'
Use a GPU with 0 utilization and only a few MiB allocated.
Always set CUDA_VISIBLE_DEVICES=<gpu_id> for diffusion validation commands.
ssh h100_sglang 'docker start sglang_bbuf'
ssh h100_sglang 'docker exec sglang_bbuf zsh -lc "cd /data/bbuf/repos/sglang && git branch --show-current && git status --short"'
main before creating a validation worktree.ssh h100_sglang 'docker exec sglang_bbuf zsh -lc "cd /data/bbuf/repos/sglang && git fetch origin && git checkout main && git pull --ff-only origin main"'
Never write directly into /data/bbuf/repos/sglang when it is dirty.
Use one of these isolation strategies.
Create a detached worktree for remote-only experiments:
ssh h100_sglang 'docker exec sglang_bbuf zsh -lc "cd /data/bbuf/repos/sglang && git worktree add --detach /tmp/sglang_validate_h100 HEAD"'
Stream the local working tree into the container (validates exactly what is local right now):
COPYFILE_DISABLE=1 tar --exclude=.git -cf - . | \
ssh h100_sglang 'docker exec -i sglang_bbuf sh -lc "rm -rf /tmp/sglang_local_validate && mkdir -p /tmp/sglang_local_validate && tar -xf - -C /tmp/sglang_local_validate"'
ssh h100_sglang 'docker exec sglang_bbuf zsh -lc "find /tmp/sglang_local_validate -name '\''._*'\'' -delete"'
For patch-oriented validation:
maingit apply only the focused local diff into the worktreeThis keeps /data/bbuf/repos/sglang clean while still validating the exact local delta.
Always start here before running any GPU kernel or model test.
ssh h100_sglang 'docker exec sglang_bbuf zsh -lc "cd /tmp/sglang_local_validate && python -m compileall python/sglang/jit_kernel/diffusion/triton python/sglang/multimodal_gen/runtime/layers"'
For broader coverage:
ssh h100_sglang 'docker exec sglang_bbuf zsh -lc "cd /tmp/sglang_local_validate && python -m compileall python/sglang"'
Run a targeted smoke script covering the changed primitives before any model-level test.
Cover at least these when relevant:
rms_norm_fnRMSNorm under torch.compilenorm_inferapply_rotary_embeddingPipe the smoke script through docker exec -i:
ssh h100_sglang 'docker exec -i sglang_bbuf env CUDA_VISIBLE_DEVICES=0 PYTHONPATH=python python' < /path/to/local_smoke.py
Run this after any change to jit_kernel/diffusion/triton:
ssh h100_sglang 'docker exec sglang_bbuf env CUDA_VISIBLE_DEVICES=0 PYTHONPATH=python zsh -lc "cd /tmp/sglang_local_validate && pytest -q python/sglang/jit_kernel/tests/diffusion/test_qwen_image_modulation.py -q"'
ssh h100_sglang 'docker exec sglang_bbuf env CUDA_VISIBLE_DEVICES=0 PYTHONPATH=python zsh -lc "cd /tmp/sglang_local_validate && pytest -q path/to/diffusion_test.py -q"'
DiffGenerator)Only after steps 1–4 pass.
Use a real .py file with if __name__ == "__main__": guard — multiprocessing.spawn
will fail if the entry point is stdin or unguarded top-level code.
# stream the script file to the container
scp /path/to/local_smoke_model.py h100_sglang:/tmp/smoke_model.py
ssh h100_sglang 'docker exec sglang_bbuf env CUDA_VISIBLE_DEVICES=0 HF_TOKEN=<your-hf-token> HUGGINGFACE_HUB_TOKEN=<your-hf-token> PYTHONPATH=/tmp/sglang_local_validate/python zsh -lc "python /tmp/smoke_model.py"'
Treat checkpoint, dependency, and environment failures separately from code regressions.
Only attempt after model-level smoke passes.
ssh h100_sglang 'docker exec sglang_bbuf env CUDA_VISIBLE_DEVICES=0 PYTHONPATH=python zsh -lc "cd /tmp/sglang_local_validate && python -m sglang.launch_server --model-path <model> --port 30000 &"'
When a benchmark compares eager vs torch.compile, do not stop at the speedup number.
Capture matching eager and compile perf dumps or profile dirs. Compare structured
perf dumps with the in-repo comparator:
ssh h100_sglang 'docker exec sglang_bbuf zsh -lc "cd /tmp/sglang_local_validate && python python/sglang/multimodal_gen/benchmarks/compare_perf.py eager.json compile.json"'
For trace-level attribution, use llm-torch-profiler-analysis on the matching
profile dirs and explain whether the gain came from fewer launches, fewer copies,
or fused kernels replacing eager ATen ops.
ssh h100_sglang 'docker exec sglang_bbuf rm -rf /tmp/sglang_local_validate /tmp/sglang_validate_h100 /tmp/smoke_model.py'
development
Run an autonomous Humanize-governed vLLM SOTA performance loop for one LLM model: first perform the fixed fair vLLM/SGLang/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches vLLM code, optionally uses ncu-report-skill for kernel evidence, and revalidates until vLLM matches or beats the best observed framework under the same workload and SLA.
devops
Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.
development
Run an autonomous Humanize-governed SGLang SOTA performance loop for one LLM model: first perform the fixed fair SGLang/vLLM/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches SGLang code, optionally uses ncu-report-skill for kernel evidence, and revalidates until SGLang matches or beats the best observed framework under the same workload and SLA.
documentation
Use when an SGLang, vLLM, or TensorRT-LLM serving/model optimization task needs prior model-family PR evidence. Query and read the PR-driven history docs under model-pr-optimization-history before choosing source paths, fast paths, kernel/fusion ideas, regression risks, or validation lanes.