kernel-builder/skills/cuda-kernels/SKILL.md
Provides guidance for writing and benchmarking optimized CUDA kernels for NVIDIA GPUs (H100, A100, T4) targeting HuggingFace diffusers and transformers libraries. Supports models like LTX-Video, Stable Diffusion, LLaMA, Mistral, and Qwen. Includes integration with HuggingFace Kernels Hub (get_kernel) for loading pre-compiled kernels. Includes benchmarking scripts to compare kernel performance against baseline implementations.
npx skillsauth add huggingface/kernels cuda-kernelsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill provides patterns and guidance for developing optimized CUDA kernels targeting NVIDIA GPUs (H100, A100, T4) for use with HuggingFace diffusers and transformers libraries.
For benchmarking kernel performance:
# Benchmark with optimized kernels (6% end-to-end speedup)
python generate_video.py --use-optimized-kernels
# Benchmark baseline with torch.compile (34% speedup)
python generate_video.py --no-optimized-kernels --compile
# Compare configurations (note: --compile and --use-optimized-kernels are mutually exclusive)
python generate_video.py --use-optimized-kernels && \
python generate_video.py --no-optimized-kernels --compile
For a minimal diffusers integration example (~150 lines):
python scripts/ltx_kernel_injection_example.py
For a minimal transformers integration example (~120 lines):
python scripts/transformers_injection_example.py
Load pre-compiled kernels from HuggingFace Hub (no local compilation):
from kernels import get_kernel
# Load optimized activation kernels
activation = get_kernel("kernels-community/activation", version=1)
# Use the kernel
y = torch.empty_like(x)
activation.gelu_fast(y, x)
For a complete HuggingFace Kernels example:
python scripts/huggingface_kernels_example.py
python benchmark_rmsnorm.py
| Library | Supported Models | Key Kernels | |---------|------------------|-------------| | diffusers | LTX-Video, Stable Diffusion, FLUX, DiT | RMSNorm, GEGLU, RoPE, AdaLN | | transformers | LLaMA, Mistral, Qwen, Falcon | RMSNorm, Attention |
| GPU | Compute Capability | Guide | |-----|-------------------|-------| | H100 | sm_90 | h100-optimization-guide.md | | A100 | sm_80 | a100-optimization-guide.md | | T4 | sm_75 | t4-optimization-guide.md |
Use this skill when:
A complete working example is available at examples/ltx_video/. This demonstrates:
Use the benchmark script to measure kernel performance:
# Full benchmark with all options
python scripts/benchmark_example.py \
--use-optimized-kernels \
--compile \
--batch-size 1 \
--num-frames 161 \
--height 512 \
--width 768 \
--steps 50 \
--warmup-iterations 2
| Option | Default | Description |
|--------|---------|-------------|
| --use-optimized-kernels | auto | Use custom H100 CUDA kernels |
| --no-optimized-kernels | - | Use baseline implementation |
| --compile | false | Enable torch.compile on transformer |
| --batch-size | 1 | Number of videos per prompt |
| --num-frames | 161 | Number of frames to generate |
| --height | 512 | Video height in pixels |
| --width | 768 | Video width in pixels |
| --steps | 50 | Denoising steps |
| --warmup-iterations | 2 | Warmup runs before benchmark |
End-to-End Video Generation (49 frames, 30 steps, H100 80GB):
| Configuration | Time (s) | it/s | Speedup | Notes | |:---|:---:|:---:|:---:|:---| | Baseline (no compile) | 2.87 | 12.58 | 1.00x | Reference | | Optimized Kernels | 2.70 | 13.52 | 1.06x | 6% faster | | Baseline + torch.compile | 2.14 | 19.05 | 1.34x | 34% faster |
Important: --use-optimized-kernels and --compile are currently mutually exclusive. Custom kernels require PyTorch custom op registration to work with torch.compile.
Key metrics to capture:
The vectorized RMSNorm kernel achieves 2.67x average speedup over PyTorch baseline:
| Shape | Custom (ms) | PyTorch (ms) | Speedup | |:---|:---:|:---:|:---:| | [1×1024×2048] | 0.019 | 0.065 | 3.37x | | [2×1024×2048] | 0.024 | 0.073 | 3.04x | | [4×1024×2048] | 0.036 | 0.093 | 2.58x | | [2×4096×3072] | 0.087 | 0.208 | 2.41x | | [4×4096×3072] | 0.157 | 0.392 | 2.49x |
Bandwidth efficiency: 38% of H100's theoretical 3.35 TB/s
Why end-to-end speedup is smaller: RMSNorm accounts for ~5% of total compute in LTX-Video. The remaining time is spent in attention (Flash Attention/SDPA), linear projections, and VAE decode.
.claude/skills/cuda-kernels/
├── scripts/
│ ├── benchmark_example.py # End-to-end video generation benchmark
│ ├── benchmark_rmsnorm.py # Isolated RMSNorm micro-benchmark
│ ├── ltx_kernel_injection_example.py # Minimal diffusers integration (~150 lines)
│ ├── transformers_injection_example.py # Minimal transformers integration (~120 lines)
│ └── huggingface_kernels_example.py # HuggingFace Kernels Hub integration
├── references/
│ ├── diffusers-integration.md # Complete diffusers integration guide
│ ├── transformers-integration.md # Complete transformers integration guide
│ ├── huggingface-kernels-integration.md # HuggingFace Kernels Hub (get_kernel) guide
│ ├── troubleshooting.md # Common issues and solutions
│ ├── kernel-templates.md # CUDA kernel templates (includes vectorized)
│ ├── h100-optimization-guide.md # H100 (Hopper) optimization deep dive
│ ├── a100-optimization-guide.md # A100 (Ampere) optimization deep dive
│ └── t4-optimization-guide.md # T4 (Turing) optimization deep dive
└── SKILL.md # This file
examples/ltx_video/ # Complete working example
├── kernel_src/
│ └── rmsnorm.cu # Vectorized RMSNorm kernel (2.67x faster)
├── torch-ext/ # PyTorch bindings
├── generate_video.py # Full benchmark script
├── benchmark_rmsnorm.py # Isolated kernel benchmark
└── setup.py # pip install -e .
| Spec | Value | Optimization Impact | |------|-------|---------------------| | SMs | 132 | Grid sizing: aim for multiples of 132 | | Threads/SM | 2048 | Max 16 blocks of 128 threads per SM | | Shared Memory | 192 KB/SM | Large tiles possible | | L2 Cache | 50 MB | Reuse across blocks | | Memory BW | 3.35 TB/s | Coalesced access critical | | Warp Size | 32 | All reductions use warp shuffles |
| Spec | H100 | A100 | T4 | |------|------|------|-----| | SMs | 132 | 108 | 40 | | Memory BW | 3.35 TB/s | 2.0 TB/s | 320 GB/s | | Shared Mem/SM | 192 KB | 164 KB | 64 KB | | BF16 Support | Yes | Yes | No (FP16 only) | | Compute Cap | sm_90 | sm_80 | sm_75 |
See detailed guides: H100 | A100 | T4
BFloat16 vectorization using __nv_bfloat162:
// Load 2 bfloat16 elements at once (32-bit load)
const __nv_bfloat162* vec_input = reinterpret_cast<const __nv_bfloat162*>(row_input);
#pragma unroll 4
for (int i = tid; i < vec_hidden; i += stride) {
__nv_bfloat162 v = vec_input[i];
float v0 = __bfloat162float(v.x);
float v1 = __bfloat162float(v.y);
sum_sq += v0 * v0 + v1 * v1;
}
FP16 vectorization using __half2:
const __half2* vec_input = reinterpret_cast<const __half2*>(row_input);
__half2 v = vec_input[i];
float v0 = __half2float(v.x);
float v1 = __half2float(v.y);
FP32 vectorization using float4:
const float4* vec_input = reinterpret_cast<const float4*>(row_input);
float4 v = vec_input[i];
sum_sq += v.x * v.x + v.y * v.y + v.z * v.z + v.w * v.w;
template <typename T>
__device__ __forceinline__ T warp_reduce_sum(T val) {
#pragma unroll
for (int offset = 16; offset > 0; offset >>= 1) {
val += __shfl_xor_sync(0xffffffff, val, offset);
}
return val;
}
BLOCK_SIZE_M = 128, BLOCK_SIZE_N = 64, BLOCK_SIZE_K = 64NUM_WARPS = 8For element-wise ops (RoPE, GEGLU):
constexpr int BLOCK_SIZE = 256;
int num_blocks = (total_elements + BLOCK_SIZE - 1) / BLOCK_SIZE;
For reduction ops (LayerNorm, RMSNorm) with vectorization:
// Divide by 2 for bf16/fp16 vectorized access
int threads = min(hidden_size / 2, MAX_THREADS);
threads = max(threads, WARP_SIZE);
threads = (threads + 32 - 1) / 32 * 32; // Round to warp boundary
All kernels support three precision modes:
__half (FP16) - Default for inference__nv_bfloat16 (BF16) - Preferred for trainingfloat (FP32) - Reference/debuggingnix run .#build-and-copy --max-jobs 2 --cores 8 -L
uv pip install -e .
[general]
name = "ltx_kernels"
backends = ["cuda"]
[kernel.your_kernel]
backend = "cuda"
src = ["kernel_src/your_kernel.cu"]
cuda-capabilities = ["9.0"]
See huggingface-kernels-integration.md for the complete guide.
Load pre-compiled, optimized kernels directly from HuggingFace Hub without local compilation:
from kernels import get_kernel, has_kernel
# Check availability and load
if has_kernel("kernels-community/activation"):
activation = get_kernel("kernels-community/activation", version=1)
# Use the kernel
x = torch.randn((4, 4), dtype=torch.float16, device="cuda")
y = torch.empty_like(x)
activation.gelu_fast(y, x)
Key functions:
get_kernel(repo_id, version=None) - Download and load kernel from Hubhas_kernel(repo_id) - Check if compatible build existsget_local_kernel(path) - Load from local directory (development)Popular community kernels:
kernels-community/activation - GELU, SiLU, etc.kernels-community/flash-attn - Flash Attention 2kernels-community/triton-layer-norm - LayerNorm, RMSNormSee diffusers-integration.md for the complete guide.
See transformers-integration.md for the complete guide.
Key differences from diffusers:
elementwise_affine=False)'RMSNorm' in class_name to match LlamaRMSNorm, MistralRMSNorm, etc.variance_epsilon (LLaMA) or eps (others) for epsilonset_processor() pattern - use Flash Attention 2 insteadMinimal transformers pattern:
from transformers import AutoModelForCausalLM
from ltx_kernels import rmsnorm
def patch_rmsnorm(model):
for name, module in model.named_modules():
if 'RMSNorm' in type(module).__name__:
eps = getattr(module, 'variance_epsilon', None) or getattr(module, 'eps', 1e-6)
def make_forward(mod, epsilon):
def forward(x):
return rmsnorm(x, mod.weight, eps=epsilon)
return forward
module.forward = make_forward(module, eps)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.bfloat16)
patch_rmsnorm(model)
LTX-Video uses elementwise_affine=False for some RMSNorm modules:
# Transformer blocks: NO WEIGHT
self.norm1 = RMSNorm(dim, elementwise_affine=False)
# Attention modules: HAS WEIGHT
self.norm_q = torch.nn.RMSNorm(..., elementwise_affine=True)
Solution: Handle both cases:
has_weight = hasattr(module, 'weight') and module.weight is not None
if has_weight:
output = rmsnorm(x, module.weight, eps=eps)
else:
weight = torch.ones(x.shape[-1], device=x.device, dtype=x.dtype)
output = rmsnorm(x, weight, eps=eps)
# WRONG - misses diffusers RMSNorm
if isinstance(module, torch.nn.RMSNorm):
# CORRECT - catches all RMSNorm variants
if type(module).__name__ == 'RMSNorm':
LTX-Video uses activation_fn="gelu-approximate". Don't patch GEGLU for LTX-Video.
pipe = LTXPipeline.from_pretrained(...)
pipe.to("cuda")
inject_optimized_kernels(pipe) # BEFORE offloading
pipe.enable_model_cpu_offload() # Now safe
from diffusers import LTXPipeline
from ltx_kernels import rmsnorm
def patch_rmsnorm_modules(model):
"""Patch all RMSNorm modules to use custom kernel."""
for name, module in model.named_modules():
if type(module).__name__ == 'RMSNorm':
eps = getattr(module, 'eps', 1e-6)
has_weight = hasattr(module, 'weight') and module.weight is not None
if has_weight:
def make_forward(mod, epsilon):
def forward(x):
return rmsnorm(x, mod.weight, eps=epsilon)
return forward
module.forward = make_forward(module, eps)
else:
def make_forward(epsilon):
def forward(x):
w = torch.ones(x.shape[-1], device=x.device, dtype=x.dtype)
return rmsnorm(x, w, eps=epsilon)
return forward
module.forward = make_forward(eps)
# Usage
pipe = LTXPipeline.from_pretrained("Lightricks/LTX-Video", torch_dtype=torch.bfloat16)
pipe.to("cuda")
patch_rmsnorm_modules(pipe.transformer)
pipe.enable_model_cpu_offload()
[..., hidden_size]elementwise_affine=False__nv_bfloat162 for BF16, __half2 for FP16, float4 for FP32[batch, seq, heads, head_dim] - for text[batch, t*h*w, heads, head_dim] - for videoLTXVideoRotaryPosEmbed[batch, seq, 2*hidden] -> Output [batch, seq, hidden]norm(x) * weight * (1 + scale) + shift# NVIDIA Nsight Systems
nsys profile -o profile python your_script.py
# NVIDIA Nsight Compute
ncu --set full -o metrics python your_script.py
See troubleshooting.md for all common issues and solutions.
Quick fixes:
type(module).__name__ insteadenable_model_cpu_offload()Custom CUDA kernels and torch.compile are mutually exclusive unless you register the kernel as a PyTorch custom op.
Error message:
torch._dynamo.exc.Unsupported: Attempted to call function marked as skipped
Workaround options:
--use-optimized-kernels without --compile (6% speedup)--compile without custom kernels (34% speedup)torch.library)To register as custom op (for torch.compile compatibility):
import torch
@torch.library.custom_op("ltx_kernels::rmsnorm", mutates_args={"out"})
def rmsnorm(out: torch.Tensor, input: torch.Tensor, weight: torch.Tensor, eps: float) -> None:
ops.rmsnorm_forward(out, input.contiguous(), weight.contiguous(), eps)
@rmsnorm.register_fake
def _(out, input, weight, eps):
pass # No shape changes
development
Provides guidance for writing, optimizing, and benchmarking Triton kernels for Intel XPU GPUs (Battlemage/Arc Pro B50) using the Xe-Forge optimization framework. Includes an LLM-driven trial-loop workflow (analyze, validate, benchmark, profile, finalize), XPU-specific patterns (tensor descriptors, GRF mode, tile swizzling), KernelBench fused kernels, and Flash Attention.
devops
Provides guidance for writing and benchmarking optimized Triton kernels for AMD GPUs (MI355X, R9700) on ROCm, targeting HuggingFace diffusers (LTX-Video, SD3, FLUX) and transformers. Core kernels: RMSNorm, RoPE 3D, GEGLU, AdaLN. Includes XCD swizzle, autotune, diffusers integration patterns, and LTX-Video pipeline injection.
development
Maintainer-only workflow for handling GitHub Secret Scanning alerts on OpenClaw. Use when Codex needs to triage, redact, clean up, and resolve secret leakage found in issue comments, issue bodies, PR comments, or other GitHub content.
development
Maintainer workflow for OpenClaw releases, prereleases, changelog release notes, and publish validation. Use when Codex needs to prepare or verify stable or beta release steps, align version naming, assemble release notes, check release auth requirements, or validate publish-time commands and artifacts.