internal/skills/content/cuda-guide/SKILL.md
CUDA/GPU computing guardrails, patterns, and best practices for AI-assisted development. Use when working with CUDA files (.cu, .cuh), or when the user mentions CUDA/GPU programming. Provides kernel design patterns, memory hierarchy guidelines, and occupancy optimization specific to this project's coding standards.
npx skillsauth add ar4mirez/samuel cuda-guideInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Applies to: CUDA 11+, GPU Computing, Deep Learning, Scientific Computing, HPC
cudaGetLastError() after every kernel launchcudaDeviceSynchronize() before reading kernel results on the hostcompute-sanitizer (successor to cuda-memcheck) in development buildscudaErrorMemoryAllocation gracefully; never assume GPU memory is infinite#define CUDA_CHECK(call) \
do { \
cudaError_t err = call; \
if (err != cudaSuccess) { \
fprintf(stderr, "CUDA error at %s:%d: %s\n", \
__FILE__, __LINE__, cudaGetErrorString(err)); \
exit(EXIT_FAILURE); \
} \
} while (0)
#define CUDA_CHECK_KERNEL() \
do { \
cudaError_t err = cudaGetLastError(); \
if (err != cudaSuccess) { \
fprintf(stderr, "Kernel launch error at %s:%d: %s\n", \
__FILE__, __LINE__, cudaGetErrorString(err)); \
exit(EXIT_FAILURE); \
} \
} while (0)
cudaMalloc with a cudaFree; prefer RAII wrappers in C++ host codecudaMallocManaged (Unified Memory) for prototyping; switch to explicit transfers for productioncudaMallocHost (pinned memory) when streaming data to the GPU; pageable memory cannot overlap with computecudaMemcpyAsync with streams over synchronous cudaMemcpycudaMemset or cudaMemsetAsync to zero-initialize device buffers(n + block_size - 1) / block_sizeif (idx < n) at the top of every kernel__device__, host+device as __host__ __device__// Grid-stride loop: works with any grid size, any data size
__global__ void saxpy(float a, const float* x, float* y, int n) {
for (int i = blockIdx.x * blockDim.x + threadIdx.x;
i < n;
i += blockDim.x * gridDim.x) {
y[i] = a * x[i] + y[i];
}
}
__syncthreads() after every shared memory write before any thread reads another thread's value__syncthreads() inside a conditional branch that not all threads in a block will reach (deadlock)__syncwarp() (CUDA 9+) for warp-level synchronization instead of relying on implicit warp-synchronous executioncudaDeviceSynchronize() sparingly in production; prefer stream synchronization with cudaStreamSynchronize()cudaEventRecord / cudaEventSynchronize) for fine-grained inter-stream orderingfloat over double on consumer GPUs (2x throughput difference)Understanding the memory hierarchy is the single most important factor in CUDA performance.
| Memory Type | Scope | Latency (cycles) | Size | Cached | Read/Write | |-------------|-------|-------------------|------|--------|------------| | Registers | Thread | 1 | ~255 per thread | N/A | R/W | | Shared | Block | ~5 | 48-164 KB per SM | N/A | R/W | | L1 Cache | SM | ~28 | 48-192 KB per SM | Auto | R | | L2 Cache | Device | ~200 | 4-40 MB | Auto | R/W | | Global | Device | ~400-600 | 4-80 GB (HBM/GDDR) | Yes | R/W | | Constant | Device | ~5 (cached) | 64 KB | Yes (broadcast) | R | | Texture | Device | ~400 (cached) | Global pool | Yes (spatial) | R |
Decision guide:
__shared__ memory__constant__ memory// Query device for optimal configuration
void launch_optimized(const float* input, float* output, int n) {
int block_size;
int min_grid_size;
// Let the runtime suggest optimal block size for maximum occupancy
cudaOccupancyMaxPotentialBlockSize(
&min_grid_size, &block_size, my_kernel, 0, n);
int grid_size = (n + block_size - 1) / block_size;
my_kernel<<<grid_size, block_size>>>(input, output, n);
CUDA_CHECK_KERNEL();
}
// BAD: Strided access -- adjacent threads access non-adjacent memory
// Each warp issues 32 separate memory transactions
__global__ void transpose_naive(const float* in, float* out, int W, int H) {
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x < W && y < H) {
out[x * H + y] = in[y * W + x]; // Write is strided
}
}
// GOOD: Use shared memory to coalesce both reads and writes
__global__ void transpose_coalesced(
const float* in, float* out, int W, int H
) {
__shared__ float tile[32][33]; // +1 padding avoids bank conflicts
int x = blockIdx.x * 32 + threadIdx.x;
int y = blockIdx.y * 32 + threadIdx.y;
if (x < W && y < H) {
tile[threadIdx.y][threadIdx.x] = in[y * W + x]; // Coalesced read
}
__syncthreads();
x = blockIdx.y * 32 + threadIdx.x;
y = blockIdx.x * 32 + threadIdx.y;
if (x < H && y < W) {
out[y * H + x] = tile[threadIdx.x][threadIdx.y]; // Coalesced write
}
}
// Dot product of two vectors using shared memory reduction
__global__ void dot_product(
const float* a, const float* b, float* result, int n
) {
__shared__ float cache[256];
int tid = threadIdx.x;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// Each thread computes its partial sum via grid-stride
float partial = 0.0f;
for (int i = idx; i < n; i += blockDim.x * gridDim.x) {
partial += a[i] * b[i];
}
cache[tid] = partial;
__syncthreads();
// Tree reduction in shared memory
for (int s = blockDim.x / 2; s > 0; s >>= 1) {
if (tid < s) {
cache[tid] += cache[tid + s];
}
__syncthreads();
}
if (tid == 0) {
atomicAdd(result, cache[0]);
}
}
// Warp-level reduction using shuffle instructions -- no shared memory needed
__device__ float warp_reduce_sum(float val) {
for (int offset = warpSize / 2; offset > 0; offset /= 2) {
val += __shfl_down_sync(0xFFFFFFFF, val, offset);
}
return val;
}
// Block-level reduction combining warp shuffles and shared memory
__device__ float block_reduce_sum(float val) {
__shared__ float warp_sums[32]; // One slot per warp (max 32 warps/block)
int lane = threadIdx.x % warpSize;
int warp_id = threadIdx.x / warpSize;
val = warp_reduce_sum(val);
if (lane == 0) {
warp_sums[warp_id] = val;
}
__syncthreads();
// First warp reduces the warp sums
int num_warps = (blockDim.x + warpSize - 1) / warpSize;
val = (threadIdx.x < num_warps) ? warp_sums[threadIdx.x] : 0.0f;
if (warp_id == 0) {
val = warp_reduce_sum(val);
}
return val;
}
// Query occupancy at compile time for tuning
void report_occupancy() {
int block_size = 256;
int num_blocks;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&num_blocks, my_kernel, block_size, 0);
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
int active_warps = num_blocks * (block_size / prop.warpSize);
int max_warps = prop.maxThreadsPerMultiProcessor / prop.warpSize;
float occupancy = (float)active_warps / max_warps;
printf("Occupancy: %.1f%% (%d/%d warps)\n",
occupancy * 100, active_warps, max_warps);
}
# System-level trace: find CPU/GPU idle gaps, stream concurrency
nsys profile -o trace ./program
nsys stats trace.nsys-rep
# Kernel-level analysis: roofline, memory throughput, occupancy
ncu --set full -o kernel_report ./program
ncu -i kernel_report.ncu-rep # Open in Nsight Compute GUI
# Quick single-metric check
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed ./program
// Measure effective bandwidth of a kernel
void measure_bandwidth(int n) {
size_t bytes = 2 * n * sizeof(float); // Read A + Write B
cudaEvent_t start, stop;
CUDA_CHECK(cudaEventCreate(&start));
CUDA_CHECK(cudaEventCreate(&stop));
CUDA_CHECK(cudaEventRecord(start));
copy_kernel<<<grid, block>>>(d_in, d_out, n);
CUDA_CHECK(cudaEventRecord(stop));
CUDA_CHECK(cudaEventSynchronize(stop));
float ms = 0;
CUDA_CHECK(cudaEventElapsedTime(&ms, start, stop));
float gb_per_sec = bytes / (ms * 1e6);
printf("Effective bandwidth: %.2f GB/s\n", gb_per_sec);
CUDA_CHECK(cudaEventDestroy(start));
CUDA_CHECK(cudaEventDestroy(stop));
}
# Compile CUDA code
nvcc -arch=sm_80 -O3 -o program main.cu # Single file
nvcc -arch=native -lineinfo -o program main.cu # With debug line info
# CMake build
cmake -B build -DCMAKE_CUDA_ARCHITECTURES="70;80;86"
cmake --build build -j$(nproc)
# Runtime debugging
compute-sanitizer ./program # Memory errors (replaces cuda-memcheck)
compute-sanitizer --tool racecheck ./program # Shared memory race conditions
compute-sanitizer --tool initcheck ./program # Uninitialized device memory reads
compute-sanitizer --tool synccheck ./program # Synchronization errors
# Profiling
nsys profile ./program # System-level timeline
ncu ./program # Kernel-level metrics
ncu --kernel-name my_kernel --launch-skip 2 --launch-count 1 ./program
# Device info
nvidia-smi # GPU status and memory usage
nvcc --version # CUDA compiler version
cmake_minimum_required(VERSION 3.18)
project(myproject LANGUAGES CXX CUDA)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CUDA_STANDARD 17)
set(CMAKE_CUDA_ARCHITECTURES 70 80 86)
set(CMAKE_CUDA_SEPARABLE_COMPILATION ON)
find_package(CUDAToolkit REQUIRED)
add_library(kernels src/kernels.cu)
target_include_directories(kernels PUBLIC include)
add_executable(main src/main.cpp)
target_link_libraries(main kernels CUDA::cudart)
enable_testing()
add_executable(tests tests/test_kernels.cu)
target_link_libraries(tests kernels CUDA::cudart)
add_test(NAME gpu_tests COMMAND tests)
For detailed patterns and examples, see:
development
Zig language guardrails, patterns, and best practices for AI-assisted development. Use when working with Zig files (.zig), build.zig, or when the user mentions Zig. Provides comptime patterns, allocator conventions, C interop guidelines, and testing standards specific to this project's coding standards.
tools
WordPress framework guardrails, patterns, and best practices for AI-assisted development. Use when working with WordPress projects, or when the user mentions WordPress. Provides theme development, plugin architecture, REST API, blocks, and security guidelines.
tools
Toolkit for interacting with and testing local web applications using Playwright. Supports verifying frontend functionality, debugging UI behavior, capturing browser screenshots, and viewing browser logs. Use when testing web apps, automating browser interactions, or debugging frontend issues.
tools
Suite of tools for creating elaborate, multi-component web applications using modern frontend technologies (React, Tailwind CSS, shadcn/ui). Use for complex projects requiring state management, routing, or shadcn/ui components - not for simple single-file HTML/JSX pages.