skills/pipeline-gpu-kernel/SKILL.md
Apply software pipelining (double-buffering) to a tiled GPU kernel to overlap global memory loads with Tensor Core computation. Covers prologue/loop/epilogue restructuring, LDG-register vs cp.async (LDGSTS) variant selection based on compute/load ratio, shared memory budget verification against architecture-specific occupancy cliffs, and SASS-level verification of load/compute overlap.
npx skillsauth add pjt222/agent-almanac pipeline-gpu-kernelInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Apply software pipelining (double-buffering) to a tiled GPU kernel so that global memory loads for tile N+1 overlap with Tensor Core computation on tile N. Transform a sequential load-sync-compute-sync K-loop into a prologue/loop/epilogue structure, choose between LDG-register and cp.async (LDGSTS) variants based on compute/load ratio, verify shared memory stays under the architecture occupancy cliff, and confirm load/compute overlap in the final SASS.
analyze-kernel-bottleneck identifies a memory-bound kernel with low compute/load ratio per tile.cu) with a tiled K-loop containing separate load and compute phasesanalyze-kernel-bottleneck; will be estimated if not provided)Confirm the kernel has a tiled K-loop with distinct load and compute phases separated by __syncthreads(). Calculate the doubled shared memory cost and verify it stays under the architecture occupancy cliff.
__syncthreads(), compute (HMMA/IMMA/FFMA) on the shared memory tiles, __syncthreads().smem_a_size = BM * BK * sizeof(T) and smem_b_size = BK * BN * sizeof(T).smem_doubled = smem_a_size * 2 + smem_b_size * 2.Single buffer: smem_a[BM*BK] + smem_b[BK*BN] = 2 KB + 2 KB = 4 KB
Double buffer: smem_a[2][BM*BK] + smem_b[2][BK*BN] = 4 KB + 4 KB = 8 KB
8 KB << 50 KB cliff -> 2 blocks/SM -> 8 warps
num_tiles = K / BK. Pipelining requires num_tiles >= 2 (at least one prologue + one main loop iteration).Expected: A shared memory budget table showing single-buffer and double-buffer costs, confirming the doubled allocation stays under the architecture cliff with at least 2 blocks/SM occupancy.
On failure: If double-buffer exceeds the cliff, reduce tile size (halve BK or BM) until smem_doubled <= 50 KB for GA104. Alternatively, use register-only prefetch (LDG variant) without doubling shared memory — store prefetched data in registers and write to the same single buffer after __syncthreads().
Select between LDG-register and cp.async (LDGSTS) based on the compute/load ratio per tile.
ratio = (2 * BM * BN * BK) / ((BM * BK + BK * BN) * sizeof(T)) for GEMM-like kernels (2 FLOPs per multiply-add, bytes loaded per tile).LDG-register variant (ratio >= 5 or CUDA < 11.0):
buf[N % 2] (overlaps with outstanding LDGs).__syncthreads(), then STS registers into buf[(N+1) % 2], __syncthreads().(BM * BK + BK * BN) / BLOCK_SIZE registers per thread for staging.cp.async (LDGSTS) variant (ratio < 5, CUDA >= 11.0):
__pipeline_memcpy_async tile N+1 directly to buf[(N+1) % 2] (async, bypasses register file).__pipeline_commit() before compute.buf[N % 2].__pipeline_wait_prior(0) + __syncthreads() after compute.#include <cuda_pipeline.h>.Expected: Selected variant with justification based on compute/load ratio and target architecture.
On failure: If the ratio is ambiguous (5-20:1 range), implement both variants and benchmark. The cp.async variant is the safer default when CUDA version supports it.
Transform the sequential load-sync-compute-sync loop into a pipelined prologue/loop/epilogue structure.
Identify the three sections: The original loop body becomes three pieces:
buf[0], synchronize, then enter the main loop.num_tiles - 1, overlap loading tile N+1 with computing tile N.LDG-register variant structure:
// === LDG-register variant ===
// Prologue: load tile 0 into buf[0]
cooperative_load_tile(smem_a[0], smem_b[0], global_a, global_b, /*k_offset=*/0);
__syncthreads();
for (int tile = 0; tile < num_tiles - 1; tile++) {
int cur_buf = tile & 1;
int next_buf = 1 - cur_buf;
// Phase 1: LDG next tile into registers (non-blocking)
float reg_a[ELEMS_PER_THREAD_A], reg_b[ELEMS_PER_THREAD_B];
prefetch_tile_to_registers(reg_a, reg_b, global_a, global_b,
(tile + 1) * BK);
// Phase 2: Compute on current buffer (overlaps with LDG flight)
tensor_core_mma(smem_a[cur_buf], smem_b[cur_buf], acc);
// Phase 3: Drain registers into next buffer
__syncthreads();
store_registers_to_smem(smem_a[next_buf], smem_b[next_buf],
reg_a, reg_b);
__syncthreads();
}
// Epilogue: compute last tile
tensor_core_mma(smem_a[(num_tiles - 1) & 1], smem_b[(num_tiles - 1) & 1], acc);
// === cp.async variant ===
#include <cuda_pipeline.h>
// Prologue: async load tile 0 into buf[0]
cpasync_load_tile(smem_a[0], smem_b[0], global_a, global_b, /*k_offset=*/0);
__pipeline_commit();
__pipeline_wait_prior(0);
__syncthreads();
for (int tile = 0; tile < num_tiles - 1; tile++) {
int cur_buf = tile & 1;
int next_buf = 1 - cur_buf;
// Phase 1: cp.async next tile into next buffer (async, direct to smem)
cpasync_load_tile(smem_a[next_buf], smem_b[next_buf],
global_a, global_b, (tile + 1) * BK);
__pipeline_commit();
// Phase 2: Compute on current buffer (overlaps with LDGSTS in flight)
tensor_core_mma(smem_a[cur_buf], smem_b[cur_buf], acc);
// Phase 3: Wait for async copies to complete
__pipeline_wait_prior(0);
__syncthreads();
}
// Epilogue: compute last tile
tensor_core_mma(smem_a[(num_tiles - 1) & 1], smem_b[(num_tiles - 1) & 1], acc);
num_tiles - 1 iterations (tiles 0 through num_tiles - 2 indexing which tiles to compute, loading tiles 1 through num_tiles - 1). The epilogue computes the tile loaded in the last iteration.Expected: Restructured K-loop source code with clear prologue, main loop, and epilogue sections for the chosen variant.
On failure: The most common bug is an off-by-one in buffer indexing or forgetting the epilogue compute pass. Verify: prologue loads into buf[0], first main loop iteration computes buf[0] and loads into buf[1], second iteration computes buf[1] and loads into buf[0], and so on. The epilogue computes buf[(num_tiles - 1) & 1].
Declare the double-buffered shared memory and implement the load functions.
// Before (single buffer)
__shared__ half smem_a[BM * BK];
__shared__ half smem_b[BK * BN];
// After (double buffer)
__shared__ half smem_a[2][BM * BK];
__shared__ half smem_b[2][BK * BN];
__device__ void cpasync_load_tile(half* dst_a, half* dst_b,
const half* src_a, const half* src_b,
int k_offset) {
// Each thread copies its portion (16 bytes = 8 half values per cp.async)
int tid = threadIdx.x;
int bytes_per_thread = 16; // cp.async.cg supports 4, 8, or 16 bytes
// A tile: BM * BK elements, distributed across BLOCK_SIZE threads
int elems_a = BM * BK / BLOCK_SIZE;
for (int i = 0; i < elems_a; i += 8) {
int idx = tid * elems_a + i;
__pipeline_memcpy_async(dst_a + idx,
src_a + k_offset * BM + idx,
bytes_per_thread);
}
// B tile: BK * BN elements, distributed similarly
int elems_b = BK * BN / BLOCK_SIZE;
for (int i = 0; i < elems_b; i += 8) {
int idx = tid * elems_b + i;
__pipeline_memcpy_async(dst_b + idx,
src_b + k_offset * BN + idx,
bytes_per_thread);
}
}
// Declare register staging (size = elements per thread)
half reg_a[BM * BK / BLOCK_SIZE];
half reg_b[BK * BN / BLOCK_SIZE];
// Prefetch: LDG from global to registers (non-blocking, issued early)
for (int i = 0; i < BM * BK / BLOCK_SIZE; i++) {
int idx = threadIdx.x * (BM * BK / BLOCK_SIZE) + i;
reg_a[i] = global_a[k_offset * BM + idx];
}
// ... similarly for reg_b
// Store: STS from registers to shared memory (after __syncthreads)
for (int i = 0; i < BM * BK / BLOCK_SIZE; i++) {
int idx = threadIdx.x * (BM * BK / BLOCK_SIZE) + i;
smem_a[next_buf][idx] = reg_a[i];
}
__launch_bounds__(BLOCK_SIZE) on the kernel to give the compiler accurate occupancy information.nvcc --cubin -arch=sm_86 -O2 -o kernel.sm_86.cubin kernel.cu.Expected: Compilable kernel with double-buffered shared memory and the chosen load mechanism. Successful cubin generation with no errors.
On failure: If compilation fails on pipeline API calls, ensure #include <cuda_pipeline.h> is present and CUDA toolkit is >= 11.0. If register spills occur (check nvcc --resource-usage), reduce the register staging array sizes by increasing BLOCK_SIZE or reducing BK.
Run the pipelined kernel against the CPU reference to confirm identical numerical output.
nvcc -arch=sm_86 -O2 -o bench bench.cu -lcuda -I../../phase2/common.abs=0.5, rel=0.1abs=1e-2, rel=1e-2abs=1e-3, rel=1e-3Expected: PASS at both small and target problem sizes with error bounds identical to the non-pipelined baseline.
On failure: Buffer indexing bug is the most likely cause. Verify: compute reads from buf[tile & 1] while loads write to buf[1 - (tile & 1)]. Check the epilogue processes buffer index (num_tiles - 1) & 1, not num_tiles & 1. For cp.async, verify __pipeline_wait_prior(0) completes before __syncthreads() — otherwise compute may read partially-written data.
Measure the pipelined kernel against the non-pipelined baseline at the target problem size.
speedup = pipelined_metric / baseline_metric.| Variant | GFLOPS | Speedup vs Baseline |
|------------------|--------|---------------------|
| Baseline | XXX | 1.00x |
| LDG-register | XXX | X.XXx |
| cp.async (LDGSTS)| XXX | X.XXx |
Expected: Performance comparison table showing improvement. The chosen variant should show measurable speedup consistent with the compute/load ratio prediction.
On failure: If performance regresses, check three things: (1) SASS for unexpected instruction overhead (extra BAR.SYNC, register spills). (2) Shared memory did not cross the occupancy cliff — verify with nvcc --resource-usage or cuobjdump -res-usage. (3) The problem size produces enough tiles (K / BK >= 4) for pipelining to amortize the prologue/epilogue overhead.
Inspect the compiled SASS to confirm that global loads and Tensor Core instructions overlap within the main loop body.
cuobjdump -sass kernel.sm_86.cubin | grep -E 'IMMA|HMMA|LDGSTS|LDG|BAR'.LDGSTS or LDG instructions appear before HMMA or IMMA instructions.BAR.SYNC between the load instructions and the compute instructions (they must be free to overlap in the warp scheduler).BAR.SYNC appears after the compute block, gating the next iteration's use of the loaded data.# Full SASS pipeline verification
cuobjdump -sass kernel.sm_86.cubin | grep -E 'IMMA|HMMA|LDGSTS|LDG|BAR'
# Count compute instructions per loop
cuobjdump -sass kernel.sm_86.cubin | grep -c 'HMMA\|IMMA'
# Check for register spills
nvcc --resource-usage --cubin -arch=sm_86 -O2 kernel.cu 2>&1 | grep -i spill
Expected: Annotated SASS excerpt showing the load-before-compute pattern with no intervening barriers. Zero register spills.
On failure: If the compiler reordered loads after compute (defeating the overlap), try: (1) #pragma unroll 1 on the main loop to prevent over-aggressive unrolling. (2) Separate load and compute into distinct inline functions to create a sequencing hint. (3) Use asm volatile("" ::: "memory") as a compiler fence between load and compute blocks (last resort — may inhibit other optimizations).
buf[tile & 1] pattern)buf[0]buf[(num_tiles - 1) & 1]BAR.SYNC between LDGSTS/LDG and IMMA/HMMA)nvcc --resource-usage)smem_doubled before implementing. A kernel using 28 KB single-buffered jumps to 56 KB doubled, crossing the cliff and halving occupancy. This can turn a +20% pipelining gain into a -50% occupancy regression.buf[tile & 1] for the current compute buffer and buf[1 - (tile & 1)] for the next load buffer. A common mistake is using buf[(tile + 1) & 1] for the next buffer, which is equivalent to buf[1 - (tile & 1)] only when the buffer count is 2 — but reads wrong if accidentally applied to the compute index.__pipeline_commit() must be called BEFORE the compute phase (it seals the batch of async copies). __pipeline_wait_prior(0) must be called AFTER the compute phase (it blocks until all committed copies complete). Swapping these makes the async copies synchronous, eliminating all overlap benefit.__syncthreads() is needed between compute and the STS drain (so compute finishes reading the current buffer before it gets overwritten). Another __syncthreads() is needed after the STS drain (so all threads finish writing before the next iteration reads). In the cp.async variant, __syncthreads() after __pipeline_wait_prior(0) ensures all threads see the completed async copies.__pipeline_memcpy_async requires the source address to be valid and aligned. At matrix edges where K is not a multiple of BK, the last tile may read out of bounds. Fall back to scalar loads with bounds checking for the final tile, or pad the input matrices to a multiple of BK.analyze-kernel-bottleneck — identify whether the kernel is memory-bound and calculate the compute/load ratio that drives variant selectiontesting
Launch all available agents in parallel waves for open-ended hypothesis generation on problems where the correct domain is unknown. Use when facing a cross-domain problem with no clear starting point, when single-agent approaches have stalled, or when diverse perspectives are more valuable than deep expertise. Produces a ranked hypothesis set with convergence analysis and adversarial refinement.
tools
Write integration tests for a Node.js CLI application using the built-in node:test module. Covers the exec helper pattern, output assertions, filesystem state verification, cleanup hooks, JSON output parsing, error case testing, and state restoration after destructive tests. Use when adding tests to an existing CLI, testing a new command, verifying adapter behavior across frameworks, or setting up CI for a CLI tool.
development
Screen a proposed trademark for conflicts and distinctiveness before filing. Covers trademark database searches (TMview, WIPO Global Brand Database, USPTO TESS), distinctiveness analysis using the Abercrombie spectrum, likelihood of confusion assessment using DuPont factors and EUIPO relative grounds, common law rights evaluation, and goods/services overlap analysis. Produces a conflict report with a risk matrix. Use before adopting a new brand name, logo, or slogan — distinct from patent prior art search, which uses different databases, legal frameworks, and analysis methods.
tools
Scaffold a new CLI command using Commander.js with options, action handler, three output modes (human-readable, quiet, JSON), and optional ceremony variant. Covers command naming, option design, shared context patterns, error handling, and integration testing. Use when adding a command to an existing Commander.js CLI, designing a new CLI tool from scratch, or standardizing command structure across a multi-command CLI.