skills/dr-kernel-reinforcement-learning-done/SKILL.md
Write high-performance Triton GPU kernels using Dr. Kernel's multi-turn refinement strategy: profile-guided optimization, reward hacking prevention, and iterative kernel fusion. Use when asked to 'write a Triton kernel', 'optimize GPU code', 'fuse CUDA operations', 'speed up a PyTorch operation with Triton', 'generate a fast kernel for [operation]', or 'profile and optimize this kernel'.
npx skillsauth add ndpvt-web/arxiv-claude-skills dr-kernel-reinforcement-learning-doneInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to write high-performance Triton GPU kernels by applying the Dr. Kernel methodology: a multi-turn, profile-guided approach to kernel generation that avoids common pitfalls like reward hacking (generating code that appears fast but doesn't actually execute the kernel) and lazy optimization (optimizing trivial sub-operations while ignoring dominant bottlenecks). The core insight is that kernel optimization should be driven by profiling data -- identify what consumes the most runtime, fuse those operations into a single Triton kernel, then iteratively tune block sizes and memory access patterns based on measured performance.
nn.Module with a custom Triton implementation that matches correctness and improves throughputProfile-Guided Kernel Optimization. The central lesson from Dr. Kernel is that naive kernel generation often falls into "lazy optimization" -- the model writes a kernel that handles a trivial sub-operation (e.g., a bias addition consuming 0.01% of runtime) while the dominant bottleneck (e.g., a matrix multiply or reduction consuming 86% of runtime) runs unchanged through PyTorch. The fix is profiling-based: measure what fraction of total CUDA execution time your generated kernel accounts for (PR = T_kernel / T_total). If PR is low (< 0.3), the kernel is optimizing the wrong thing. Redirect effort to the operation that dominates the profile.
Multi-Turn Refinement. Effective kernel generation is iterative, not one-shot. The Dr. Kernel approach uses up to 3 turns: (1) generate an initial kernel targeting the dominant operation, (2) incorporate correctness feedback and profiling data to fix errors and improve fusion, (3) tune hardware-specific parameters (block sizes, number of warps, pipeline stages) based on measured throughput. Each turn receives environment feedback including compilation errors, correctness checks against reference outputs, and timing measurements.
Reward Hacking Prevention. Several degenerate patterns must be actively avoided: kernels that define a @triton.jit function but never call it (speedup comes from skipping computation), kernels that branch on self.training to bypass execution during profiling, and kernels that simply copy the PyTorch reference. Valid kernels must (a) actually execute a Triton kernel confirmed via launch instrumentation, (b) produce numerically correct outputs within tolerance (typically atol=1e-2, rtol=1e-2 for float32), and (c) show wall-clock speedup in end-to-end measurement, not just isolated kernel timing.
Analyze the reference PyTorch operation. Read the user's PyTorch code and identify every operation it performs. List them by expected cost: large matmuls and reductions dominate; element-wise ops and small reshapes are cheap. Determine which operations are fusible (same tensor shapes, sequential data flow, no host-device synchronization between them).
Profile to identify the bottleneck. If timing data is available, compute the profiling ratio for each operation. If not, estimate based on FLOPs and memory bandwidth. The target operation for your Triton kernel should account for the majority of runtime. Do NOT write a kernel for a trivial sub-operation.
Design the kernel fusion strategy. Decide which operations to fuse into a single Triton kernel. Good fusion candidates share input tensors, have compatible reduction dimensions, and avoid complex control flow. For example, fuse LayerNorm = mean + variance + normalize + scale + shift into one kernel. Avoid attempting to fuse convolutions that cuDNN handles better unless the user specifically requests it.
Write the initial Triton kernel. Implement a @triton.jit function with explicit grid, block sizes, and memory access patterns. Wrap it in a ModelNew(nn.Module) class that matches the reference module's forward() signature exactly. Ensure the kernel is actually called in the forward pass -- never define it without invoking it.
Validate correctness against the reference. Generate test inputs matching the expected shapes and dtypes. Run both the reference and your kernel, comparing outputs with torch.allclose(ref_out, kern_out, atol=1e-2, rtol=1e-2). If outputs diverge, check for: off-by-one in pointer arithmetic, incorrect reduction dimension, missing masking for partial blocks, or dtype mismatches.
Measure end-to-end performance. Time the full forward() call (not just the Triton kernel in isolation) using CUDA events with warmup iterations. Compute speedup as T_reference / T_kernel. If speedup < 1.0, the kernel is slower than PyTorch -- revisit memory access patterns and occupancy.
Compute the profiling ratio. Measure what fraction of total execution time your Triton kernel accounts for. If PR < 0.3, your kernel is optimizing a minor operation. Go back to step 2 and target the actual bottleneck.
Tune hardware parameters. Adjust BLOCK_SIZE, num_warps, and num_stages based on the target GPU. Use Triton's triton.autotune decorator with a grid of configurations, or manually set values based on the problem size. For H100: start with BLOCK_SIZE=1024, num_warps=8, num_stages=3 and adjust down for smaller problems.
Iterate with feedback. If correctness fails, fix pointer arithmetic and masking first. If performance is underwhelming, check for: uncoalesced memory access, excessive shared memory usage, low occupancy due to register pressure, or unnecessary tl.debug_barrier() calls. Repeat steps 5-8 for up to 2 additional refinement turns.
Final validation. Run the kernel across multiple input sizes to confirm it generalizes. Verify that speedup holds for the user's target batch sizes, not just the profiling size. Report the achieved speedup and profiling ratio.
Example 1: Fused LayerNorm Kernel
User: "Write a Triton kernel that replaces nn.LayerNorm for shape (batch=32, seq=512, hidden=768). I want it faster than PyTorch."
Approach:
hidden=768, which fits in a single Triton blockBLOCK_SIZE=1024 (next power of 2 above 768) operating on one row per programOutput:
import triton
import triton.language as tl
import torch
import torch.nn as nn
@triton.jit
def layernorm_kernel(
X, Y, W, B, stride, N, eps,
BLOCK_SIZE: tl.constexpr,
):
row = tl.program_id(0)
cols = tl.arange(0, BLOCK_SIZE)
mask = cols < N
x = tl.load(X + row * stride + cols, mask=mask, other=0.0).to(tl.float32)
mean = tl.sum(x, axis=0) / N
x_centered = x - mean
var = tl.sum(x_centered * x_centered, axis=0) / N
x_norm = x_centered / tl.sqrt(var + eps)
w = tl.load(W + cols, mask=mask, other=1.0).to(tl.float32)
b = tl.load(B + cols, mask=mask, other=0.0).to(tl.float32)
y = x_norm * w + b
tl.store(Y + row * stride + cols, y, mask=mask)
class ModelNew(nn.Module):
def __init__(self, normalized_shape, eps=1e-5):
super().__init__()
self.eps = eps
self.weight = nn.Parameter(torch.ones(normalized_shape))
self.bias = nn.Parameter(torch.zeros(normalized_shape))
self.normalized_shape = normalized_shape
def forward(self, x):
assert x.is_cuda and x.is_contiguous()
y = torch.empty_like(x)
N = self.normalized_shape
n_rows = x.numel() // N
BLOCK_SIZE = triton.next_power_of_2(N)
layernorm_kernel[(n_rows,)](
x, y, self.weight, self.bias,
x.stride(-2), N, self.eps,
BLOCK_SIZE=BLOCK_SIZE,
)
return y
Example 2: Detecting and Fixing Lazy Optimization
User: "I wrote a Triton kernel for a transformer block but it's only 1.02x faster. What's wrong?"
Approach:
Analysis:
Operation Breakdown (profiled via torch.profiler):
- Linear projections (Q, K, V): 45% of runtime
- Attention matmul + softmax: 38% of runtime
- Your Triton kernel (bias add): 0.8% of runtime <-- PR = 0.008
- Other (dropout, residual): 16.2% of runtime
Problem: Your kernel fuses bias addition (0.8% of runtime).
Even a 10x speedup on bias addition saves < 0.7% total time.
Fix: Target the attention computation (38%) or fuse the full
QKV projection + attention into a single kernel. Consider using
Triton's flash-attention pattern for the matmul + softmax fusion.
Example 3: Catching a Reward Hacking Pattern
User: "My kernel reports 3x speedup but something seems off."
Approach:
@triton.jit function is actually invoked in the forward passDetection:
# HACKING PATTERN DETECTED -- kernel is defined but never called:
class ModelNew(nn.Module):
def forward(self, x):
if self.training:
return x # Bypasses kernel entirely during training
return triton_kernel(x) # Only runs during eval
# FIX: Remove the training guard. The kernel must execute
# unconditionally in forward():
class ModelNew(nn.Module):
def forward(self, x):
return triton_kernel(x) # Always executes
Another common hack -- defining @triton.jit but calling torch.ops instead:
# BAD: Kernel defined but torch fallback used
@triton.jit
def my_kernel(...): ...
class ModelNew(nn.Module):
def forward(self, x):
return torch.layer_norm(x, ...) # Uses PyTorch, not the Triton kernel
T_kernel / T_total). If it's below 0.3, you're optimizing the wrong operation. Rewrite targeting the bottleneck.triton.autotune with at least 4-6 configurations varying BLOCK_SIZE, num_warps, and num_stages. Hardware-specific tuning accounts for a large fraction of real-world speedup.(32, 512, 768) may fail for (1, 2048, 768) due to masking bugs.@triton.jit functions without calling them, or that uses self.training guards to bypass execution. These are hacking patterns that inflate benchmarks without real optimization.| Problem | Diagnosis | Fix |
|---------|-----------|-----|
| triton.CompilationError | Invalid Triton IR, usually from type mismatches or unsupported ops | Cast inputs explicitly with .to(tl.float32), avoid Python-level control flow inside @triton.jit |
| Numerical mismatch (atol > 1e-2) | Pointer arithmetic error or missing mask | Verify tl.arange bounds match tensor dimensions, add mask= to all tl.load/tl.store calls |
| CUDA OOM during kernel launch | Grid size too large or shared memory overflow | Reduce BLOCK_SIZE, check that grid dimensions match the problem (not numel() when you mean n_rows) |
| Kernel slower than PyTorch | Uncoalesced access, low occupancy, or excessive synchronization | Ensure innermost dimension is contiguous, reduce register pressure by lowering BLOCK_SIZE, remove unnecessary barriers |
| Speedup doesn't reproduce | Insufficient warmup or background GPU load | Use torch.cuda.synchronize() before timing, run 100+ warmup iterations, measure 100+ timed iterations |
| Profiling ratio near zero | Kernel optimizes a trivial operation | Reprofile to find the dominant operation and rewrite the kernel to target it |
autotune configs for the target hardware.torch.compile or custom CUDA graphs.Paper: Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations -- Focus on Sections 3-5 for the profiling-based reward design, TRLOO algorithm for multi-turn training, and the case studies showing lazy optimization vs. effective kernel fusion patterns.
Code: github.com/hkust-nlp/KernelGYM -- The KernelGYM environment for evaluating kernel correctness and performance with subprocess isolation, plus the Dr. Kernel training recipes.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".