skills/gpu-kernel-ako4all/SKILL.md
Use when developing, optimizing, debugging, or porting AI-infra GPU kernels through an AKO4ALL-centered loop, including Triton, CUDA C++/PTX, CUTLASS/CuTe C++, and CuTe DSL kernels; also use when setting up a sibling AKO4ALL repo, creating microbench harnesses, profiling with nsys/ncu, and validating kernel changes against real operator or model benchmarks. Do not trigger on simple Triton or CUDA API lookups; this skill is for full optimization or rewrite tasks where AKO discipline pays off.
npx skillsauth add BBuf/AI-Infra-Auto-Driven-SKILLS gpu-kernel-ako4allInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use this skill to run a disciplined GPU-kernel optimization loop with AKO4ALL as the outer framework and stack-specific bundled references for Triton, CUDA C++/PTX, CUTLASS/CuTe C++, and CuTe DSL.
This is a derivative synthesis, not original material. Every upstream skill or document referenced is copied into this skill under references/ and templates/. Do not go to .claude/skills, .copilot/skills, temporary clones, or source repositories to read the upstream skills. Read the bundled materials. Preserve the attribution in references/source-attribution.md when copying, publishing, or adapting this skill.
nsys or ncu results before changing tiling, memory movement, pipeline, or epilogue structureDo not start here from a vague "make it faster" request. First establish the target kernel, shape family, dtype/layout contract, hardware, and baseline runtime.
Read the AKO loop reference before any implementation:
Then read every implementation reference that matches the task. The top-level *-reference.md files are compact routing guides; the stack-specific subdirectories under references/ contain the deeper bundled material.
| Kernel work | Required bundled references |
| --- | --- |
| Triton kernel or launcher | references/triton-kernel-reference.md, references/triton/triton-overview.md, and the matching references/triton/triton-*.md file(s) for the specific pattern (FlashAttention, persistent matmul, fused norm, quantized GEMM, etc.) |
| CUDA C++ or PTX kernel | references/cuda-cpp-kernel-reference.md, references/cuda-cpp/cuda-cpp-overview.md, and (only on demand) narrow targets inside references/cuda-cpp/vendored-docs/ |
| CUTLASS or CuTe C++ kernel | references/cutlass-cpp-kernel-reference.md and references/cutlass-cpp/cutlass-cpp-overview.md |
| CuTe DSL Python kernel | references/cute-dsl-kernel-reference.md, references/cute-dsl/cute-dsl-overview.md, and the matching references/cute-dsl/cute*.md / intro.md / pipeline.md / utils*.md API snapshots |
| Profiling, debugging, correctness failure, or perf claim | references/profiling-debugging-reference.md; only descend into references/cuda-cpp/vendored-docs/ncu-docs/ or nsys-docs/ when narrowing to a specific section / metric / counter |
| Architecture-specific tuning (sm89 / sm90) | references/nvidia-architecture-reference.md plus references/architectures/sm89-optimization-guide.md or sm90-optimization-guide.md |
| Architecture-specific tuning (sm100 / sm103 / sm120) | references/nvidia-architecture-reference.md plus the stack-specific references/<stack>/sm{100,103,120}-optimization-guide.md matching your lane (CUDA / CUTLASS / CuTe DSL) |
| Generic kernel template families | references/kernel-templates.md |
| Compute-sanitizer / cuda-gdb / build flag troubleshooting | references/troubleshooting.md |
For mixed implementations, read all applicable references. A CUDA wrapper around a CUTLASS kernel needs the CUDA, CUTLASS, profiling, and architecture references.
This skill assumes a sibling layout:
<base-dir>/
├── <target-repo>/
└── AKO4ALL/
Before any AKO work:
scripts/ensure_ako4all_clean.sh [base-dir].AKO4ALL/ when missing (override the upstream URL with AKO4ALL_UPSTREAM_URL if your environment uses a fork or mirror).Bootstrap a new AKO task by copying templates/ into AKO4ALL/<task>/:
templates/ITERATIONS.md → <task>/ITERATIONS.mdtemplates/kernel_notes.md → <task>/context/<kernel>_notes.mdtemplates/bench_kernel.py → <task>/bench/bench_<kernel>.py (Triton or CuTe DSL)templates/bench_kernel.cu → <task>/bench/bench_<kernel>.cu (CUDA C++ or CUTLASS/CuTe C++)These are scaffolds, not auto-discovered tools — fill in the kernel-specific bits before the first run.
<task>/context/<kernel>_notes.md).tl.dot.If the first lane fails to transfer to the real workload, keep the AKO record and switch lanes intentionally instead of layering more speculative changes.
Inside the clean AKO4ALL repo, create a custom harness instead of relying on stock benchmark tasks. Copy templates/ (see above) and prefer this layout:
input/reference.* — trusted referenceinput/<kernel>.* — current baseline (copied from target repo)solution/<kernel>.* — candidate(s)bench/bench_<kernel>.* — per-shape timing + correctnesscontext/<kernel>_notes.md — contract, profiler findings, failed attemptsITERATIONS.md — hypothesis / change / result / decision per iterationThe benchmark must cover representative production shapes, tail shapes, dtype variants, and the fastest known baseline.
ncu report for the hot shape.nsys when launch overhead, CPU/GPU gaps, stream overlap, memory copies, or many small kernels may matter.ITERATIONS.md with device, shapes, dtype, warmup, iterations, median (p50), and p95 when latency matters.ITERATIONS.md with the hypothesis, exact result, and decision.ncu, reread the iteration log, and change direction (or switch lane).Minimum validation:
ncu or nsys evidence when claiming a performance root causeFor model-serving or diffusion paths, also run the smallest real model-level benchmark that proves the kernel win transfers beyond the microbench. Hand off as follows:
llm-serving-auto-benchmark (per-framework deployment search) and sglang-sota-performance (model-level SOTA loop).llm-torch-profiler-analysis to convert a torch-profiler trace into a kernel / overlap / fusion table.sglang-diffusion-ako4all-kernel for diffusion-specific validation (denoise step accuracy, scheduler interaction). This skill stays the home for cross-stack AKO discipline; that skill carries the diffusion-specific gates.When finishing work under this skill, report:
ITERATIONS.mdITERATIONS.md so the iteration history is reviewablencu summary delta and/or nsys summary delta)ITERATIONS.md.references/cuda-cpp/vendored-docs/ are opt-in: open one file at a time when narrowing to a specific instruction, API, or metric — never load the whole mirror.development
Run an autonomous Humanize-governed vLLM SOTA performance loop for one LLM model: first perform the fixed fair vLLM/SGLang/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches vLLM code, optionally uses ncu-report-skill for kernel evidence, and revalidates until vLLM matches or beats the best observed framework under the same workload and SLA.
devops
Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.
development
Run an autonomous Humanize-governed SGLang SOTA performance loop for one LLM model: first perform the fixed fair SGLang/vLLM/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches SGLang code, optionally uses ncu-report-skill for kernel evidence, and revalidates until SGLang matches or beats the best observed framework under the same workload and SLA.
documentation
Use when an SGLang, vLLM, or TensorRT-LLM serving/model optimization task needs prior model-family PR evidence. Query and read the PR-driven history docs under model-pr-optimization-history before choosing source paths, fast paths, kernel/fusion ideas, regression risks, or validation lanes.