claude/skills/cuda-kernel-optimization/SKILL.md
CUDA kernel development and GPU optimization patterns — memory hierarchy, occupancy tuning, coalescing, shared memory tiling, warp-level ops, and profiling with Nsight Compute. Use when writing or optimizing CUDA C++ kernels.
npx skillsauth add aleonsa/claude-config cuda-kernel-optimizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Deep reference for writing high-performance CUDA kernels in C++. Covers the GPU execution model, memory hierarchy, occupancy, access patterns, and systematic profiling.
.cu/.cuh CUDA kernels from scratchGrid
└── Blocks (up to 3D)
└── Warps (32 threads, scheduled together)
└── Threads
| Unit | Max size | Notes | |------|----------|-------| | Thread block | 1024 threads | All dimensions combined | | Grid | 2³¹-1 per dim (x), 65535 (y/z) | | | Warp | 32 threads | Lockstep execution unit |
The GPU hides latency through massive multithreading — always have enough warps in flight to cover memory latency (~300–700 cycles for global memory).
// Minimal kernel shape: choose blockDim first, then gridDim
constexpr int BLOCK = 256;
int grid = (N + BLOCK - 1) / BLOCK;
kernel<<<grid, BLOCK>>>(args...);
| Memory | Scope | Latency | Size | Bandwidth | |--------|-------|---------|------|-----------| | Registers | Thread | 1 cycle | ~255 per thread | — | | Shared memory | Block | ~20–30 cycles | 48–228 KB/SM (arch-dep) | ~20 TB/s | | L1 cache | SM | ~30 cycles | 32–128 KB (shared with smem) | — | | L2 cache | Device | ~100–200 cycles | 4–50 MB (arch-dep) | ~4 TB/s | | Global memory (DRAM) | Device | ~300–700 cycles | GBs | ~900 GB/s (H100) | | Constant memory | Device (cached) | 1 cycle (hit) | 64 KB | — | | Texture memory | Device (cached) | Low (2D locality) | — | — |
__global__ void tiledMatMul(const float* A, const float* B, float* C,
int N) {
constexpr int TILE = 32;
__shared__ float tileA[TILE][TILE];
__shared__ float tileB[TILE][TILE];
int row = blockIdx.y * TILE + threadIdx.y;
int col = blockIdx.x * TILE + threadIdx.x;
float acc = 0.f;
for (int t = 0; t < N / TILE; ++t) {
tileA[threadIdx.y][threadIdx.x] = A[row * N + t * TILE + threadIdx.x];
tileB[threadIdx.y][threadIdx.x] = B[(t * TILE + threadIdx.y) * N + col];
__syncthreads();
#pragma unroll
for (int k = 0; k < TILE; ++k)
acc += tileA[threadIdx.y][k] * tileB[k][threadIdx.x];
__syncthreads(); // prevent overwrite before all threads finish
}
C[row * N + col] = acc;
}
Threads in a warp should access consecutive 128-byte aligned addresses in a single transaction.
// GOOD — coalesced: thread i accesses element i
float val = data[blockIdx.x * blockDim.x + threadIdx.x];
// BAD — strided: each thread jumps by `stride`
float val = data[threadIdx.x * stride]; // multiple transactions
Shared memory has 32 banks (4-byte width). Threads in a warp accessing the same bank (different addresses) serialize.
// BAD — column-major access causes bank conflicts
float val = tile[threadIdx.x][threadIdx.y]; // all threads hit bank = threadIdx.x % 32
// FIX — pad the shared array by 1
__shared__ float tile[32][33]; // +1 padding breaks conflict pattern
float* data;
cudaMallocManaged(&data, N * sizeof(float));
// Accessible from host and device; managed by runtime
// Prefetch to avoid page faults during kernel:
cudaMemPrefetchAsync(data, N * sizeof(float), device_id);
Occupancy = active warps / max warps per SM. Higher occupancy enables better latency hiding.
Limiters (whichever binds first):
int minGridSize, blockSize;
cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize, myKernel, 0, 0);
int gridSize = (N + blockSize - 1) / blockSize;
myKernel<<<gridSize, blockSize>>>(args);
// Force compiler to limit registers (may increase spilling — profile first)
__launch_bounds__(256 /*maxThreadsPerBlock*/, 4 /*minBlocksPerSM*/)
__global__ void myKernel(...) { ... }
// BAD — threads in the same warp take different paths → serialized
if (threadIdx.x % 2 == 0) { doA(); } else { doB(); }
// BETTER — divergence across warps is free (different warps run independently)
// Ensure condition is uniform within a warp:
if (threadIdx.x / 32 == 0) { doA(); } // entire warp 0 takes this branch
// Warp reduction — no shared memory needed
float val = /* per-thread value */;
for (int offset = 16; offset > 0; offset >>= 1)
val += __shfl_down_sync(0xffffffff, val, offset);
// Thread 0 of each warp holds the warp sum
// Warp vote
bool pred = (val > threshold);
unsigned mask = __ballot_sync(0xffffffff, pred); // bitmask of threads where pred=true
int count = __popc(mask);
template <int BLOCK_SIZE>
__global__ void reduce(const float* in, float* out, int N) {
__shared__ float smem[BLOCK_SIZE];
int tid = threadIdx.x;
int i = blockIdx.x * BLOCK_SIZE * 2 + tid;
smem[tid] = (i < N ? in[i] : 0.f) + (i + BLOCK_SIZE < N ? in[i + BLOCK_SIZE] : 0.f);
__syncthreads();
// Sequential addressing — no bank conflicts
for (int s = BLOCK_SIZE / 2; s > 32; s >>= 1) {
if (tid < s) smem[tid] += smem[tid + s];
__syncthreads();
}
// Warp-level unrolling (no sync needed within a warp)
if (tid < 32) {
smem[tid] += smem[tid + 32];
smem[tid] += __shfl_down_sync(0xffffffff, smem[tid], 16);
smem[tid] += __shfl_down_sync(0xffffffff, smem[tid], 8);
smem[tid] += __shfl_down_sync(0xffffffff, smem[tid], 4);
smem[tid] += __shfl_down_sync(0xffffffff, smem[tid], 2);
smem[tid] += __shfl_down_sync(0xffffffff, smem[tid], 1);
}
if (tid == 0) out[blockIdx.x] = smem[0];
}
Use Thrust for production:
#include <thrust/scan.h>
thrust::inclusive_scan(d_in, d_in + N, d_out);
cudaStream_t s1, s2;
cudaStreamCreate(&s1);
cudaStreamCreate(&s2);
// Overlap kernel execution with H2D copy
cudaMemcpyAsync(d_buf, h_buf, size, cudaMemcpyHostToDevice, s1);
kernelA<<<grid, block, 0, s1>>>(d_buf);
kernelB<<<grid, block, 0, s2>>>(d_other); // runs concurrently if SM resources allow
cudaStreamSynchronize(s1);
cudaStreamSynchronize(s2);
cudaStreamDestroy(s1);
cudaStreamDestroy(s2);
// Macro — use in all CUDA API calls
#define CUDA_CHECK(call) \
do { \
cudaError_t err = (call); \
if (err != cudaSuccess) { \
fprintf(stderr, "CUDA error at %s:%d — %s\n", \
__FILE__, __LINE__, cudaGetErrorString(err)); \
std::exit(EXIT_FAILURE); \
} \
} while (0)
// Kernel launch errors are async — always sync and check after launch
myKernel<<<grid, block>>>(args);
CUDA_CHECK(cudaGetLastError());
CUDA_CHECK(cudaDeviceSynchronize()); // remove in production hot paths
# Full kernel profile
ncu --set full -o report ./my_app
# Target one kernel
ncu --kernel-name myKernel --set full ./my_app
# Key metrics to check first
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed,\
l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum,\
sm__warps_active.avg.pct_of_peak_sustained_active \
./my_app
| Metric | Target | If Low → | |--------|--------|----------| | SM throughput % | >80% | Low occupancy or memory bound | | Global load efficiency | >80% | Uncoalesced accesses | | Shared memory bank conflict rate | 0 | Pad shared arrays | | Warp execution efficiency | >85% | Warp divergence | | Achieved occupancy | >50% (app-dep) | Register/smem pressure | | L2 hit rate | >50% | Poor data reuse, increase tiling |
nsys profile --trace=cuda,nvtx ./my_app
nsys-ui report.nsys-rep # open GUI
Work through these in order — fix the biggest bottleneck first (Amdahl's law).
ncu to identify the limiter.cudaOccupancyMaxPotentialBlockSize.--ptxas-options=-v in compiler output.__shfl_* where possible.cudaMemcpyAsync.#pragma unroll. For small, fixed-bound inner loops.new/malloc inside __global__.| GPU Family | Arch | Compute Cap | Key Feature | |-----------|------|-------------|-------------| | Volta | V100 | 7.0 | Tensor Cores (FP16), NVLink | | Turing | T4/RTX20 | 7.5 | INT8 Tensor Cores, RT cores | | Ampere | A100/RTX30 | 8.0/8.6 | BF16, TF32, 3rd-gen Tensor Cores | | Hopper | H100 | 9.0 | FP8, Transformer Engine, NVLink4 | | Ada Lovelace | RTX40 | 8.9 | FP8, DLSS 3 |
| Library | Purpose | |---------|---------| | Thrust | STL-like parallel algorithms (sort, scan, reduce) | | cuBLAS | Dense linear algebra (GEMM, TRSM…) | | cuDNN | Deep learning primitives (conv, norm, attention) | | cuSPARSE | Sparse matrix operations | | cuFFT | Fast Fourier Transforms | | CUTLASS | Template library for custom GEMM/conv | | CUB | Low-level building blocks (warp/block/device primitives) |
documentation
Translate visa application documents (images) to English and create a bilingual PDF with original and translation
development
A comprehensive verification system for Claude Code sessions.
development
Use this skill when writing new features, fixing bugs, or refactoring code. Enforces test-driven development with 80%+ coverage including unit, integration, and E2E tests.
tools
SwiftUI architecture patterns, state management with @Observable, view composition, navigation, performance optimization, and modern iOS/macOS UI best practices.