skills/h100/SKILL.md
SSH into host `h100_sglang`, enter Docker container `sglang_bbuf`, work in `/sgl-workspace/sglang`, and use the ready H100 remote environment for SGLang development and validation. Use when a task needs remote CUDA work, GPU-backed smoke tests, diffusion checks, or a safe remote copy instead of local-only execution.
npx skillsauth add BBuf/AI-Infra-Auto-Driven-SKILLS h100Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use this skill to do SGLang development on the H100 box through h100_sglang.
The default container is sglang_bbuf and the repo lives at /sgl-workspace/sglang.
Prefer it whenever local validation is insufficient for CUDA, Triton, diffusion pipelines, or other GPU-backed SGLang behavior.
This environment is already prepared:
sglang_bbuf is running on lmsysorg/sglang:dev/sgl-workspace/sglangpython[all] and python[diffusion] are already done/root/.cache is mounted as the cache path/sys/class/infiniband, /dev/infiniband, and /usr/sbin/show_gidsHugging Face cache is already mounted, but do not assume HF_TOKEN is visible in
every docker exec context. Interactive shells and non-interactive docker exec ... bash -lc "<cmd>" can behave differently. Always verify with
echo ${HF_TOKEN:+set} before gated-model or Hub-backed runs.
ssh h100_sglang 'hostname && whoami'
ssh h100_sglang 'docker ps --format "table {{.Names}}\t{{.Status}}" | sed -n "1,20p"'
ssh h100_sglang 'nvidia-smi --query-gpu=index,name,utilization.gpu,memory.used,memory.total --format=csv,noheader,nounits'
ssh h100_sglang 'docker exec -it sglang_bbuf /bin/zsh'
cd /sgl-workspace/sglang
echo ${HF_TOKEN:+set}
If HF_TOKEN is unexpectedly missing in the current shell, export it manually before Hub-backed workflows:
export HF_TOKEN=<your-hf-token>
export HUGGINGFACE_HUB_TOKEN="$HF_TOKEN"
For non-interactive docker exec ... bash -lc "<cmd>" runs, prefer exporting both
variables inside the command itself instead of assuming the shell startup path
will populate them.
Use a GPU with 0 utilization and only a few MiB allocated.
Set CUDA_VISIBLE_DEVICES=<gpu_id> for every GPU-backed validation command.
kill-idle helper.Do not assume you can reclaim other users' idle allocations automatically.
If the free GPU list is tight, re-check nvidia-smi, choose another GPU, or coordinate before proceeding.
ssh h100_sglang 'docker start sglang_bbuf'
ssh h100_sglang 'docker exec sglang_bbuf zsh -lc "cd /sgl-workspace/sglang && git branch --show-current && git status --short"'
/sgl-workspace/sglang to the latest clean main before creating
any validation worktree.ssh h100_sglang 'docker exec sglang_bbuf zsh -lc "cd /sgl-workspace/sglang && git fetch origin && git checkout main && git pull --ff-only origin main"'
Avoid writing directly into /sgl-workspace/sglang when it is dirty or when the local snapshot differs from the remote HEAD.
Prefer one of these isolation strategies.
Create a detached worktree for remote-only experiments:
ssh h100_sglang 'docker exec sglang_bbuf zsh -lc "cd /sgl-workspace/sglang && git worktree add --detach /tmp/sglang_validate_h100 HEAD"'
Stream the exact local working tree into the container when validating the current local snapshot:
COPYFILE_DISABLE=1 tar --exclude=.git -cf - . | \
ssh h100_sglang 'docker exec -i sglang_bbuf sh -lc "rm -rf /tmp/sglang_local_validate && mkdir -p /tmp/sglang_local_validate && tar -xf - -C /tmp/sglang_local_validate"'
ssh h100_sglang 'docker exec sglang_bbuf zsh -lc "find /tmp/sglang_local_validate -name '\''._*'\'' -delete"'
Use the streamed copy when the goal is "validate exactly what is in the local repo right now". For patch-oriented remote validation, another good option is:
mainThat keeps /sgl-workspace/sglang clean while still validating the exact local delta.
ssh h100_sglang 'docker exec sglang_bbuf zsh -lc "cd /tmp/sglang_local_validate && python -m compileall python/sglang"'
For diffusion-specific edits, prefer a narrower first pass:
ssh h100_sglang 'docker exec sglang_bbuf zsh -lc "cd /tmp/sglang_local_validate && python -m compileall python/sglang/jit_kernel/diffusion/triton python/sglang/multimodal_gen/runtime/layers"'
ssh h100_sglang 'docker exec sglang_bbuf env PYTHONPATH=python zsh -lc "cd /tmp/sglang_local_validate && pytest -q path/to/test.py -q"'
For diffusion changes, start with the fused modulation regression:
ssh h100_sglang 'docker exec sglang_bbuf env CUDA_VISIBLE_DEVICES=0 PYTHONPATH=python zsh -lc "cd /tmp/sglang_local_validate && pytest -q python/sglang/jit_kernel/tests/diffusion/test_qwen_image_modulation.py -q"'
ssh h100_sglang 'docker exec sglang_bbuf env CUDA_VISIBLE_DEVICES=0 PYTHONPATH=python zsh -lc "cd /tmp/sglang_local_validate && pytest -q path/to/gpu_test.py -q"'
Cover at least these when relevant:
rms_norm_fnRMSNorm under torch.compilenorm_inferapply_rotary_embeddingPipe the script through docker exec -i ... python for pure kernel smoke.
.py file with if __name__ == "__main__": when calling DiffGenerator.from_pretrained(..., local_mode=True) or any flow that relies on multiprocessing.spawn.multiprocessing.spawn will fail if the script is executed from stdin or from unguarded top-level code.
Treat checkpoint, dependency, and environment failures separately from code regressions.
If a workflow reads from Hugging Face Hub, verify HF_TOKEN first and re-export it
explicitly in the current shell or command when needed.
When a benchmark compares eager vs torch.compile, do not stop at the speedup number.
Capture matching eager and compile perf dumps or profile dirs. Compare structured
perf dumps with python python/sglang/multimodal_gen/benchmarks/compare_perf.py eager.json compile.json, then use llm-torch-profiler-analysis on the matching
profile dirs to explain whether the gain came from fewer launches, fewer copies,
or fused kernels replacing eager ATen ops.
Remove temporary validation directories when finished.
ssh h100_sglang 'docker exec sglang_bbuf rm -rf /tmp/sglang_local_validate /tmp/sglang_validate_h100'
development
Run an autonomous Humanize-governed vLLM SOTA performance loop for one LLM model: first perform the fixed fair vLLM/SGLang/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches vLLM code, optionally uses ncu-report-skill for kernel evidence, and revalidates until vLLM matches or beats the best observed framework under the same workload and SLA.
devops
Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.
development
Run an autonomous Humanize-governed SGLang SOTA performance loop for one LLM model: first perform the fixed fair SGLang/vLLM/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches SGLang code, optionally uses ncu-report-skill for kernel evidence, and revalidates until SGLang matches or beats the best observed framework under the same workload and SLA.
documentation
Use when an SGLang, vLLM, or TensorRT-LLM serving/model optimization task needs prior model-family PR evidence. Query and read the PR-driven history docs under model-pr-optimization-history before choosing source paths, fast paths, kernel/fusion ideas, regression risks, or validation lanes.