skills/flashinfer-ai/benchmark-kernel/SKILL.md
Guide for benchmarking FlashInfer kernels with CUPTI timing
npx skillsauth add aiskillstore/marketplace benchmark-kernelInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This tutorial shows you how to accurately benchmark FlashInfer kernels.
Measure the performance of FlashInfer kernels:
FlashInfer supports two timing methods:
CUPTI (Preferred): Hardware-level profiling for most accurate GPU kernel time
cupti-python >= 13.0.0 (CUDA 13+)CUDA Events (Fallback): Standard CUDA event timing
The framework automatically uses CUPTI if available, otherwise falls back to CUDA events.
For the most accurate benchmarking:
pip install -U cupti-python
Requirements: CUDA 13+ (CUPTI version 13+)
If you don't install CUPTI, the framework will:
CUPTI is not installed. Falling back to CUDA events.Available routines:
BatchDecodeWithPagedKVCacheWrapper, BatchPrefillWithPagedKVCacheWrapper, BatchPrefillWithRaggedKVCacheWrapper, BatchMLAPagedAttentionWrapperbmm_fp8, gemm_fp8_nt_groupwise, group_gemm_fp8_nt_groupwise, mm_fp4trtllm_fp4_block_scale_moe, trtllm_fp8_block_scale_moe, trtllm_fp8_per_tensor_scale_moe, cutlass_fused_moeExample - Benchmark decode attention:
# CUPTI will be used automatically if installed
python benchmarks/flashinfer_benchmark.py \
--routine BatchDecodeWithPagedKVCacheWrapper \
--backends fa2 fa2_tc cudnn \
--page_size 16 \
--batch_size 32 \
--s_qo 1 \
--s_kv 2048 \
--num_qo_heads 32 \
--num_kv_heads 8 \
--head_dim_qk 128 \
--head_dim_vo 128 \
--q_dtype bfloat16 \
--kv_dtype bfloat16 \
--num_iters 30 \
--dry_run_iters 5 \
--refcheck \
-vv
Example - Benchmark FP8 GEMM:
python benchmarks/flashinfer_benchmark.py \
--routine bmm_fp8 \
--backends cudnn cublas cutlass \
--batch_size 256 \
--m 1 \
--n 1024 \
--k 7168 \
--input_dtype fp8_e4m3 \
--mat2_dtype fp8_e4m3 \
--out_dtype bfloat16 \
--refcheck \
-vv \
--generate_repro_command
Timing behavior:
--use_cuda_events flag[INFO] FlashInfer version: 0.6.0
[VVERBOSE] gpu_name = 'NVIDIA_H100_PCIe'
[PERF] fa2 :: median time 0.145 ms; std 0.002 ms; achieved tflops 125.3 TFLOPs/sec; achieved tb_per_sec 1.87 TB/sec
[PERF] fa2_tc :: median time 0.138 ms; std 0.001 ms; achieved tflops 131.5 TFLOPs/sec; achieved tb_per_sec 1.96 TB/sec
[PERF] cudnn :: median time 0.142 ms; std 0.001 ms; achieved tflops 127.8 TFLOPs/sec; achieved tb_per_sec 1.91 TB/sec
Key metrics:
Create a test list file my_benchmarks.txt:
--routine BatchDecodeWithPagedKVCacheWrapper --backends fa2 cudnn --page_size 16 --batch_size 32 --s_kv 2048 --num_qo_heads 32 --num_kv_heads 8 --head_dim_qk 128 --head_dim_vo 128
--routine BatchDecodeWithPagedKVCacheWrapper --backends fa2 cudnn --page_size 16 --batch_size 64 --s_kv 4096 --num_qo_heads 32 --num_kv_heads 8 --head_dim_qk 128 --head_dim_vo 128
--routine bmm_fp8 --backends cudnn cutlass --batch_size 256 --m 1 --n 1024 --k 7168 --input_dtype fp8_e4m3 --mat2_dtype fp8_e4m3 --out_dtype bfloat16
Run all tests:
python benchmarks/flashinfer_benchmark.py \
--testlist my_benchmarks.txt \
--output_path results.csv \
--generate_repro_command \
--refcheck
Results are saved to results.csv with all metrics and reproducer commands.
| Flag | Description | Default |
|------|-------------|---------|
| --num_iters | Measurement iterations | 30 |
| --dry_run_iters | Warmup iterations | 5 |
| --refcheck | Verify output correctness | False |
| --allow_output_mismatch | Continue on mismatch | False |
| --use_cuda_events | Force CUDA events (skip CUPTI) | False |
| --no_cuda_graph | Disable CUDA graph | False |
| -vv | Very verbose output | - |
| --generate_repro_command | Print reproducer command | False |
| --case_tag | Tag for CSV output | None |
For custom benchmarking in your own code:
import torch
from flashinfer.testing import bench_gpu_time
# Setup your kernel
def my_kernel_wrapper(q, k, v):
# Your kernel call here
return output
# Create test inputs
device = torch.device("cuda")
q = torch.randn(32, 8, 128, dtype=torch.bfloat16, device=device)
k = torch.randn(2048, 8, 128, dtype=torch.bfloat16, device=device)
v = torch.randn(2048, 8, 128, dtype=torch.bfloat16, device=device)
# Benchmark - CUPTI preferred, CUDA events if CUPTI unavailable
median_time, std_time = bench_gpu_time(
my_kernel_wrapper,
args=(q, k, v),
enable_cupti=True, # Prefer CUPTI, fallback to CUDA events
num_iters=30, # Number of iterations
dry_run_iters=5, # Warmup iterations
)
print(f"Kernel time: {median_time:.3f} ms ± {std_time:.3f} ms")
# Calculate FLOPS if you know the operation count
flops = ... # Your FLOP count
tflops = (flops / 1e12) / (median_time / 1000)
print(f"Achieved: {tflops:.2f} TFLOPS/sec")
Note: If CUPTI is not installed, you'll see a warning and the function will automatically use CUDA events instead.
python my_benchmark.py
Output with CUPTI:
Kernel time: 0.145 ms ± 0.002 ms
Achieved: 125.3 TFLOPS/sec
Output without CUPTI (automatic fallback):
[WARNING] CUPTI is not installed. Try 'pip install -U cupti-python'. Falling back to CUDA events.
Kernel time: 0.147 ms ± 0.003 ms
Achieved: 124.1 TFLOPS/sec
# Cold L2 cache benchmarking (optional)
median_time, std_time = bench_gpu_time(
my_kernel,
args=(x, y),
enable_cupti=True, # Will use CUDA events if CUPTI unavailable
cold_l2_cache=True, # Flush L2 or rotate buffers automatically
num_iters=30
)
# Force CUDA events (skip CUPTI even if installed)
median_time, std_time = bench_gpu_time(
my_kernel,
args=(x, y),
enable_cupti=False, # Explicitly use CUDA events
num_iters=30
)
Warning: CUPTI is not installed. Falling back to CUDA events.
What it means: CUPTI is not available, using CUDA events instead
Impact: Less accurate for very fast kernels (5-50 us) due to synchronization overhead, but becomes negligible for longer-running kernels
Solution (optional): Install CUPTI for best accuracy:
pip install -U cupti-python
If installation fails, check:
cupti-python versionYou can still run benchmarks without CUPTI - the framework handles this automatically.
Problem: Large standard deviation or varying results
Solutions:
Increase warmup iterations:
--dry_run_iters 10
Increase measurement iterations:
--num_iters 50
Use cold L2 cache (in Python):
bench_gpu_time(..., rotate_buffers=True)
Disable GPU boost (advanced):
sudo nvidia-smi -lgc <base_clock>
Error: [ERROR] Output mismatch between backends
What it means: Different backends produce different results
Solutions:
Allow mismatch and continue:
--allow_output_mismatch
Check numerical tolerance: Some backends use different precisions (FP32 vs FP16)
Investigate the difference:
-vv # Very verbose mode shows tensor statistics
Error: [WARNING] fa3 for routine ... is not supported on compute capability X.X
Solution: Check the backend support matrix in benchmarks/README.md or remove that backend from --backends list
Install CUPTI for best accuracy (but not required):
pip install -U cupti-python
Use reference checking to verify correctness:
--refcheck
Use verbose mode to see input shapes and dtypes:
-vv
Generate reproducer commands for sharing results:
--generate_repro_command
Run multiple iterations for statistical significance:
--num_iters 30 --dry_run_iters 5
Save results to CSV for later analysis:
--output_path results.csv
Compare multiple backends to find the best:
--backends fa2 fa3 cudnn cutlass
python benchmarks/flashinfer_benchmark.py \
--routine BatchDecodeWithPagedKVCacheWrapper \
--backends fa2 fa2_tc cudnn trtllm-gen \
--page_size 16 --batch_size 128 --s_kv 8192 \
--num_qo_heads 64 --num_kv_heads 8 \
--head_dim_qk 128 --head_dim_vo 128 \
--refcheck -vv --generate_repro_command
python benchmarks/flashinfer_benchmark.py \
--routine BatchPrefillWithRaggedKVCacheWrapper \
--backends fa2 fa3 cudnn cutlass \
--batch_size 16 --s_qo 1024 --s_kv 1024 \
--num_qo_heads 128 --num_kv_heads 128 \
--head_dim_qk 192 --head_dim_vo 128 \
--causal --random_actual_seq_len \
--q_dtype bfloat16 --kv_dtype bfloat16 \
--refcheck -vv
python benchmarks/flashinfer_benchmark.py \
--routine bmm_fp8 \
--backends cudnn cublas cutlass \
--batch_size 256 --m 1 --n 1024 --k 7168 \
--input_dtype fp8_e4m3 --mat2_dtype fp8_e4m3 \
--out_dtype bfloat16 \
--refcheck -vv
python benchmarks/flashinfer_benchmark.py \
--routine trtllm_fp8_block_scale_moe \
--backends trtllm \
--num_tokens 1024 --hidden_size 5120 \
--intermediate_size 13824 --num_experts 256 \
--top_k 8 --n_group 8 --topk_group 1 \
--routing_method deepseek_v3 \
--routed_scaling_factor 2.5 \
--use_routing_bias \
-vv
| Aspect | CUPTI (Preferred) | CUDA Events (Fallback) |
|--------|-------------------|------------------------|
| Accuracy | Highest (hardware-level) | Good (slight overhead) |
| Installation | pip install cupti-python | Built-in with CUDA |
| Requirements | CUDA 13+ | Any CUDA version |
| Fallback | N/A | Automatic if CUPTI unavailable |
| When to use | Always (if available) | When CUPTI can't be installed |
Recommendation: Install CUPTI for best results, but benchmarks work fine without it.
nsys or ncu for detailed analysisFLASHINFER_LOGLEVEL=3benchmarks/README.md for full flag documentationbenchmarks/samples/sample_testlist.txt for more examplesdevelopment
Apple Human Interface Guidelines for content display components. Use this skill when the user asks about charts component, collection view, image view, web view, color well, image well, activity view, lockup, data visualization, content display, displaying images, rendering web content, color pickers, or presenting collections of items in Apple apps. Also use when the user says how should I display charts, what's the best way to show images, should I use a web view, how do I build a grid of items, what component shows media, or how do I present a share sheet. Cross-references: hig-foundations for color/typography/accessibility, hig-patterns for data visualization patterns, hig-components-layout for structural containers, hig-platforms for platform-specific component behavior.
tools
Automate HelpDesk tasks via Rube MCP (Composio): list tickets, manage views, use canned responses, and configure custom fields. Always search tools first for current schemas.
testing
Expert Haskell engineer specializing in advanced type systems, pure functional design, and high-reliability software. Use PROACTIVELY for type-level programming, concurrency, and architecture guidance.
tools
GraphQL gives clients exactly the data they need - no more, no less. One endpoint, typed schema, introspection. But the flexibility that makes it powerful also makes it dangerous. Without proper controls, clients can craft queries that bring down your server. This skill covers schema design, resolvers, DataLoader for N+1 prevention, federation for microservices, and client integration with Apollo/urql. Key insight: GraphQL is a contract. The schema is the API documentation. Design it carefully.