amd-ci-test-bisect/SKILL.md
Add or update AMD/ROCm SGLang regression tests, choose the right MI325/MI35x CI suite, reproduce AMD CI locally with the upstream amd_ci container scripts, and bisect AMD CI regressions on main/nightly. Use when an agent-generated fix needs register_amd_ci coverage, when selecting MI325 vs MI35x runners, when debugging pr-test-amd or nightly AMD failures, or when documenting the register_amd_ci flow for ROCm-specific fixes.
npx skillsauth add amdpilot-org/amd-skills amd-ci-test-bisectInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill is the AMD/ROCm adaptation of SGLang's write-sglang-test,
ci-workflow-guide, and sglang-bisect-ci-regression skills.
Use it for four things:
register_amd_ci(...)scripts/ci/amd/* flowRead references/suite-matrix.md when you need exact suite/runner mappings or local reproduction commands.
Read references/bisect-playbook.md when you are debugging a failing AMD CI run on main or nightly.
register_amd_ci is AST-parsed.
Keep est_time, suite, nightly, and disabled as module-level literal constants.
Do not hide them behind helper functions, variables, or computed expressions.
Only add AMD registration when AMD coverage is the point.
Use register_amd_ci(...) for ROCm-only kernels, HIP/aiter paths, MI35x/gfx950 behavior,
RCCL/distributed AMD paths, or an AMD regression fix.
Do not duplicate backend-independent tests onto AMD just because AMD exists.
Choose the lightest AMD suite that proves the fix. Prefer 1-GPU MI325 first. Escalate to MI35x only when the failure depends on gfx950 / MI355X / ROCm 7.2 image behavior. Escalate to 2/4/8 GPU only when the bug requires distributed state, DP/TP, or model scale.
MI325 and MI35x are different validation targets.
In upstream AMD CI, MI325-class runners are generally normalized onto mi30x images.
MI35x runners use MI35x images and amd_ci_exec.sh injects GPU_ARCHS=gfx950.
Treat them as separate contracts, not interchangeable hardware.
Reproduce with the AMD CI scripts, not ad-hoc shell state.
Use ensure_vram_clear.sh, amd_ci_start_container.sh, amd_ci_install_dependency.sh,
and amd_ci_exec.sh. That is the closest match to what GitHub Actions actually runs.
For regressions, separate code regressions from runner/image drift. Always record runner label, GPU family, ROCm version, and container image tag before blaming a commit.
Ask:
is_in_amd_ci(), HIP kernels, aiter, RCCL, or ROCm image contents?If the answer is "common logic, no AMD-specific path", keep the test on CPU/CUDA and do not add AMD registration just for symmetry.
Default SGLang authoring rules still apply:
CustomTestCasetearDownClass defensivetest/registered/**For AMD-specific behavior, the common pattern is:
from sglang.test.ci.ci_register import register_amd_ci
register_amd_ci(est_time=120, suite="stage-b-test-1-gpu-small-amd")
When the same file should run on both CUDA and AMD, register both explicitly and gate behavior inside the test body with is_in_amd_ci().
Use disabled="reason" instead of deleting coverage when a suite is temporarily too expensive or unstable.
Use references/suite-matrix.md.
Practical default:
stage-a-test-1-gpu-small-amd or stage-b-test-1-gpu-small-amdstage-b-test-1-gpu-large-amdstage-b-test-1-gpu-small-amd-mi35x or stage-c-test-large-8-gpu-amd-mi35xstage-b-test-2-gpu-large-amd or stage-c-test-4-gpu-amdFrom a SGLang checkout:
bash scripts/ci/amd/ensure_vram_clear.sh
bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720
bash scripts/ci/amd/amd_ci_install_dependency.sh
bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" \
python3 run_suite.py --hw amd --suite stage-b-test-1-gpu-small-amd
Notes:
amd_ci_start_container.sh auto-detects MI325/MI35x from the runner hostname.mi30x images; MI35x keeps mi35x.amd_ci_exec.sh auto-adds SGLANG_IS_IN_CI_AMD=1, SGLANG_USE_AITER=1,
and on MI35x also GPU_ARCHS=gfx950.amd_ci_start_container_disagg.sh.Use references/bisect-playbook.md.
The AMD twist is important:
pr-test-amd.yml does push / PR / rerun-stage coverage, not scheduled cronnightly-test-amd-rocm720.yml is the scheduled source of truth for long AMD coverageSo:
main for per-commit AMD regressionsWhen you finish an AMD test / CI / bisect task, report:
register_amd_ci suite and whyregister_amd_ci(...) literals are valid for AST parsingdevelopment
FlyDSL is a Python DSL with MLIR-native backend for authoring custom AMD GPU kernels with explicit layout algebra (pre-installed at /opt/FlyDSL on images tagged *-flydsl:*). Use this skill when profiling identifies a hot per-row reduction (RMSNorm / LayerNorm / softmax), a fused elementwise chain (norm + residual add, activation + multiplier), or an unusual-shape grouped GEMM that the standard AMD backends (Triton / aiter / CK / hipBLASLt / TransformerEngine) don't serve well. Essential for any workload where Python/config/Triton-tuning gains have plateaued and the profile shows a custom kernel opportunity. Covers the `/opt/FlyDSL` availability check, the integration playbook (dispatcher + direct site-packages edit + autograd-safe output handling), kernel authoring patterns (elementwise via layout API, block reductions via wave_reduce_add, fused dx+dw designs, MFMA GEMM preshuffle), torchrun gotchas, and the critical rule that custom kernels typically only win end-to-end when stacked with `torch.compile(mode="default")`.
tools
Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Claude's capabilities with specialized knowledge, workflows, or tool integrations.
data-ai
Profile AMD GPU kernels using rocprofv3 and analyze performance bottlenecks. Use when the user wants to profile HIP/ROCm kernels, identify GPU performance issues, analyze hardware counters, or understand why a kernel is slow on AMD GPUs (MI100, MI200, MI300 series). Provides wrapper scripts for rocprofv3 execution and automated parsing of profiler output into structured, agent-friendly JSON with bottleneck classification.
testing
Analyze SGLang and vLLM profiler traces on AMD ROCm systems, especially MI355X/gfx950 nodes. Adapted from the SGLang torch-profiler workflow: triage kernel breakdown, overlap headroom, and fuse opportunities, then write structured artifacts that can be attached to amdpilot experiments, trials, and dashboard views. Use when a run needs profiling, when an optimization trial should produce machine-readable profiling artifacts, or when the user asks why a ROCm workload is slow.