skills/flashinfer-ai/debug-cuda-crash/SKILL.md
Tutorial for debugging CUDA crashes using API logging
npx skillsauth add aiskillstore/marketplace debug-cuda-crashInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This tutorial shows you how to debug CUDA crashes and errors in FlashInfer using the @flashinfer_api logging decorator.
When your code crashes with CUDA errors (illegal memory access, out-of-bounds, NaN/Inf), use API logging to:
Problem: CUDA errors often crash the program, leaving no debugging information.
Solution: FlashInfer's @flashinfer_api decorator logs inputs BEFORE execution, so you can see what caused the crash even after the program terminates.
export FLASHINFER_LOGLEVEL=1 # Log function names
export FLASHINFER_LOGDEST=stdout # Log to console
python my_script.py
Output:
[2025-12-18 10:30:45] FlashInfer API Call: batch_decode_with_padded_kv_cache
export FLASHINFER_LOGLEVEL=3 # Log inputs/outputs with metadata
export FLASHINFER_LOGDEST=debug.log # Save to file
python my_script.py
Output in debug.log:
================================================================================
[2025-12-18 10:30:45] FlashInfer API Logging - System Information
================================================================================
FlashInfer version: 0.6.0
CUDA toolkit version: 12.1
GPU 0: NVIDIA H100 PCIe
Compute capability: 9.0 (SM90)
PyTorch version: 2.1.0
================================================================================
================================================================================
[2025-12-18 10:30:46] FlashInfer API Call: batch_decode_with_padded_kv_cache
--------------------------------------------------------------------------------
Positional input arguments:
arg[0]:
Tensor(
shape=(32, 8, 128)
dtype=torch.bfloat16
device=cuda:0
requires_grad=False
is_contiguous=True
)
Keyword input arguments:
kv_cache=
Tensor(
shape=(1024, 2, 8, 128)
dtype=torch.bfloat16
device=cuda:0
requires_grad=False
is_contiguous=True
)
export FLASHINFER_LOGLEVEL=5 # Log with min/max/mean/nan/inf
export FLASHINFER_LOGDEST=debug.log
python my_script.py
Additional output:
Tensor(
shape=(32, 8, 128)
dtype=torch.bfloat16
device=cuda:0
requires_grad=False
is_contiguous=True
min=-3.125000
max=4.250000
mean=0.015625
nan_count=0
inf_count=0
)
Your code crashes with:
RuntimeError: CUDA error: an illegal memory access was encountered
Enable logging and run again:
export FLASHINFER_LOGLEVEL=3
export FLASHINFER_LOGDEST=crash_log.txt
python my_script.py
The log shows inputs before the crash:
[2025-12-18 10:32:15] FlashInfer API Call: batch_decode_with_padded_kv_cache
Positional input arguments:
arg[0]:
Tensor(
shape=(32, 8, 128) # Query tensor
...
)
Keyword input arguments:
kv_cache=
Tensor(
shape=(1024, 2, 8, 64) # ❌ Wrong! Should be (..., 128) not (..., 64)
...
)
Found the bug: head_dim mismatch (64 vs 128)
Error Message:
RuntimeError: CUDA error: an illegal memory access was encountered
Enable logging:
export FLASHINFER_LOGLEVEL=3
python my_script.py
What to check in logs:
is_contiguous=True (if required)Common causes:
Error Message:
RuntimeError: Function ... returned nan or inf
Enable statistics logging:
export FLASHINFER_LOGLEVEL=5 # Level 5 shows nan_count, inf_count
python my_script.py
What to check in logs:
Tensor(
...
min=-1234567.000000 # ❌ Suspiciously large
max=9876543.000000 # ❌ Suspiciously large
mean=nan # ❌ NaN detected
nan_count=128 # ❌ 128 NaN values!
inf_count=0
)
Common causes:
Error Message:
RuntimeError: CUDA out of memory
Enable logging:
export FLASHINFER_LOGLEVEL=3
python my_script.py
What to check in logs:
Example:
Tensor(
shape=(1024, 8192, 128, 128) # ❌ Way too large! Should be (1024, 128, 128)?
...
)
Error Message:
RuntimeError: expected scalar type BFloat16 but found Float16
Enable logging:
export FLASHINFER_LOGLEVEL=3
python my_script.py
What to check in logs:
Tensor(
dtype=torch.float16 # ❌ Should be torch.bfloat16
...
)
When running with multiple GPUs/processes, use %i pattern:
export FLASHINFER_LOGLEVEL=3
export FLASHINFER_LOGDEST=debug_rank_%i.txt # %i = process ID
torchrun --nproc_per_node=4 my_script.py
This creates separate logs:
debug_rank_12345.txt (process 12345)debug_rank_12346.txt (process 12346)debug_rank_12347.txt (process 12347)debug_rank_12348.txt (process 12348)Now you can debug each rank independently.
For harder bugs, combine API logging with CUDA tools:
export FLASHINFER_LOGLEVEL=3
export FLASHINFER_LOGDEST=debug.log
compute-sanitizer --tool memcheck python my_script.py
Output shows exact memory errors:
========= COMPUTE-SANITIZER
========= Invalid __global__ write of size 4 bytes
========= at 0x1234 in ScaleKernel<float>
========= by thread (256,0,0) in block (10,0,0)
========= Address 0x7f1234567890 is out of bounds
Check debug.log to see what inputs caused this kernel to fail.
export FLASHINFER_LOGLEVEL=3
export FLASHINFER_LOGDEST=debug.log
cuda-gdb --args python my_script.py
In gdb:
(cuda-gdb) run
(cuda-gdb) where # Show stack trace when it crashes
Check debug.log for the inputs that led to the crash.
You can use printf() inside CUDA kernels for debugging:
__global__ void MyKernel(const float* input, float* output, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// Print from one thread to avoid spam
if (threadIdx.x == 0 && blockIdx.x == 0) {
printf("n=%d, input[0]=%f\n", n, input[0]);
}
if (idx < n) {
output[idx] = input[idx] * 2.0f;
}
}
Important: Flush printf buffer after kernel:
my_kernel(input, output)
torch.cuda.synchronize() # ← Flushes printf output
Problem: threadIdx.x == 0 doesn't work for all warps (warp starting at thread 32 won't have thread 0).
Solution: Choose one representative thread per specialization group.
__global__ void WarpSpecializedKernel(...) {
// Define your group's representative thread
// e.g., first thread of each warp: threadIdx.x % 32 == 0
// e.g., first thread of each 4-warp group: threadIdx.x % 128 == 0
if (is_group_representative) {
printf("Group %d processing\n", group_id);
}
}
Common mistake ❌:
// ❌ Only warp 0 will print!
if (threadIdx.x == 0) {
printf("Warp %d processing\n", threadIdx.x / 32);
}
| Kernel Type | Print Condition | Notes |
|-------------|-----------------|-------|
| Simple kernel | threadIdx.x == 0 | One thread per block |
| Warp-specialized | One thread per group | Depends on kernel design |
// Assert for invariants
assert(value >= 0.0f && "Value must be non-negative");
// Compile-time checks
static_assert(BLOCK_SIZE % 32 == 0, "BLOCK_SIZE must be multiple of warp size");
| Variable | Values | Description |
|----------|--------|-------------|
| FLASHINFER_LOGLEVEL | 0 | No logging (default) |
| | 1 | Function names only |
| | 3 | Inputs/outputs with metadata |
| | 5 | + Tensor statistics (min/max/mean/nan/inf) |
| FLASHINFER_LOGDEST | stdout | Log to console (default) |
| | stderr | Log to stderr |
| | <path> | Log to file |
| | log_%i.txt | Multi-process: %i = process ID |
export FLASHINFER_LOGLEVEL=3
Level 3 provides tensor metadata (shape, dtype, device) without overwhelming output.
export FLASHINFER_LOGLEVEL=5
Only use level 5 when debugging NaN/Inf problems (adds statistics).
export FLASHINFER_LOGDEST=crash_log.txt
Console output may be lost when program crashes. File logs persist.
Enable logging and compare:
unset FLASHINFER_LOGLEVEL # or export FLASHINFER_LOGLEVEL=0
Logging has zero overhead when disabled (decorator returns original function).
Problem: Set FLASHINFER_LOGLEVEL=3 but no logs appear
Solutions:
Check if API has the decorator: Not all FlashInfer APIs have @flashinfer_api yet (work in progress)
Verify environment variable:
echo $FLASHINFER_LOGLEVEL # Should print "3"
Check log destination:
echo $FLASHINFER_LOGDEST # Should print path or "stdout"
Problem: Level 5 produces too much output
Solution: Use level 3 instead:
export FLASHINFER_LOGLEVEL=3 # Skip tensor statistics
Warning: [statistics skipped: CUDA graph capture in progress]
What it means: Level 5 statistics are automatically skipped during CUDA graph capture (to avoid synchronization)
This is normal: The framework protects you from graph capture issues.
export FLASHINFER_LOGLEVEL=3
export FLASHINFER_LOGDEST=stdout
python my_script.py
# Check tensor shapes in output
export FLASHINFER_LOGLEVEL=5 # Show statistics
export FLASHINFER_LOGDEST=debug.log
python my_script.py
# Check nan_count and inf_count in debug.log
export FLASHINFER_LOGLEVEL=3
export FLASHINFER_LOGDEST=rank_%i.log # Separate log per rank
torchrun --nproc_per_node=8 train.py
# Check rank_*.log files
export FLASHINFER_LOGLEVEL=3
export FLASHINFER_LOGDEST=inputs.log
compute-sanitizer --tool memcheck python my_script.py
# inputs.log shows what data caused the memory error
import torch
from flashinfer import batch_decode_with_padded_kv_cache
q = torch.randn(32, 8, 128, dtype=torch.bfloat16, device="cuda")
kv = torch.randn(1024, 2, 8, 64, dtype=torch.bfloat16, device="cuda") # Wrong dim!
output = batch_decode_with_padded_kv_cache(q, kv) # ❌ Crashes
export FLASHINFER_LOGLEVEL=3
export FLASHINFER_LOGDEST=debug.log
python test.py
[2025-12-18 10:45:23] FlashInfer API Call: batch_decode_with_padded_kv_cache
Positional input arguments:
arg[0]:
Tensor(
shape=(32, 8, 128)
dtype=torch.bfloat16
...
)
arg[1]:
Tensor(
shape=(1024, 2, 8, 64) # ❌ Found it! Last dim should be 128
dtype=torch.bfloat16
...
)
kv = torch.randn(1024, 2, 8, 128, dtype=torch.bfloat16, device="cuda") # ✅ Fixed
python test.py
# No crash, outputs logged successfully
Enable logging before the crash:
export FLASHINFER_LOGLEVEL=3
export FLASHINFER_LOGDEST=debug.log
Run your code - inputs are logged BEFORE crash
Check the log - last API call shows what caused the crash
Fix the issue based on logged input metadata
Disable logging when done:
export FLASHINFER_LOGLEVEL=0
flashinfer/api_logging.py for implementationdevelopment
Apple Human Interface Guidelines for content display components. Use this skill when the user asks about charts component, collection view, image view, web view, color well, image well, activity view, lockup, data visualization, content display, displaying images, rendering web content, color pickers, or presenting collections of items in Apple apps. Also use when the user says how should I display charts, what's the best way to show images, should I use a web view, how do I build a grid of items, what component shows media, or how do I present a share sheet. Cross-references: hig-foundations for color/typography/accessibility, hig-patterns for data visualization patterns, hig-components-layout for structural containers, hig-platforms for platform-specific component behavior.
tools
Automate HelpDesk tasks via Rube MCP (Composio): list tickets, manage views, use canned responses, and configure custom fields. Always search tools first for current schemas.
testing
Expert Haskell engineer specializing in advanced type systems, pure functional design, and high-reliability software. Use PROACTIVELY for type-level programming, concurrency, and architecture guidance.
tools
GraphQL gives clients exactly the data they need - no more, no less. One endpoint, typed schema, introspection. But the flexibility that makes it powerful also makes it dangerous. Without proper controls, clients can craft queries that bring down your server. This skill covers schema design, resolvers, DataLoader for N+1 prevention, federation for microservices, and client integration with Apollo/urql. Key insight: GraphQL is a contract. The schema is the API documentation. Design it carefully.