kernel-builder/skills/xpu-kernels/SKILL.md
Provides guidance for writing, optimizing, and benchmarking Triton kernels for Intel XPU GPUs (Battlemage/Arc Pro B50) using the Xe-Forge optimization framework. Includes an LLM-driven trial-loop workflow (analyze, validate, benchmark, profile, finalize), XPU-specific patterns (tensor descriptors, GRF mode, tile swizzling), KernelBench fused kernels, and Flash Attention.
npx skillsauth add huggingface/kernels xpu-kernelsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill provides patterns and guidance for developing optimized Triton kernels targeting Intel XPU GPUs (Battlemage/Arc Pro B50). It integrates the Xe-Forge optimization framework — an LLM-driven loop that transforms PyTorch code into fast Triton kernels.
The full optimization workflow analyzes a PyTorch baseline, generates Triton kernel variants in a branching trial tree, benchmarks each on XPU hardware, and finalizes the best result.
# 1. Analyze the baseline
python scripts/analyze_kernel.py test_kernels/70_Gemm_Sigmoid_Scaling_ResidualAdd_pytorch.py
# 2. Initialize trial tracking
python scripts/trial_manager.py init 70_Gemm_Sigmoid test_kernels/70_Gemm_Sigmoid_Scaling_ResidualAdd_pytorch.py
# 3. Validate a generated kernel (no GPU needed)
python scripts/validate_triton.py my_kernel.py
# 4. Benchmark correctness + performance
python scripts/benchmark.py test_kernels/70_Gemm_Sigmoid_Scaling_ResidualAdd_pytorch.py my_kernel.py
# 5. Profile with VTune (optional)
python scripts/xpu_profiler.py my_kernel.py
# 6. Finalize best trial
python scripts/trial_manager.py finalize 70_Gemm_Sigmoid optimized_triton.py
| GPU | Architecture | XVEs | Mem BW | Key Feature | Verified | |-----|-------------|------|--------|-------------|:--------:| | Battlemage G21 / Arc Pro B50 | Xe2 | 128 | ~500 GB/s | Tensor descriptors, GRF 256 | Yes |
See the Intel XPU Backend for Triton for supported hardware.
Use this skill when:
Transform PyTorch code into optimized Triton kernels for Intel XPU. Kernels must be numerically equivalent and faster than baseline.
config.yaml firstAt the start of every session, read scripts/config.yaml. It controls:
max_trials — hard cap on optimization trials; always run all of them (use this instead of hardcoded "10")vtune_enabled — if false, skip ALL VTune profiling steps (Step 3.6 and profiler-related decisions)vtune_bin — path to the VTune binary (also settable via VTUNE_BIN env var)test_kernels/*_triton.py or trial files t<trial_id>.py).scripts/benchmark.py.max_trials trials from config.yaml. Do NOT stop early due to plateau — LLM sampling can discover new ideas at any point. The only valid early stop is speedup > 5x.CRITICAL — Single-XPU serialization: There is only ONE XPU on this machine. You MUST NOT run multiple GPU workloads in parallel. benchmark.py and xpu_profiler.py must execute strictly one at a time — concurrent GPU jobs produce wrong results. CPU-only tools (analyze_kernel.py, validate_triton.py, trial_manager.py) are safe to parallelize with each other and with anything else.
| Tool | Command | Purpose |
|------|---------|---------|
| Analyze | python scripts/analyze_kernel.py <file> | Static analysis: operations, shapes, fusion opportunities |
| Validate | python scripts/validate_triton.py <file> | Syntax + constraint checks before GPU time |
| Benchmark | python scripts/benchmark.py <baseline> <triton> [--triton-baseline] [--baseline-us <cached>] | Correctness + performance via ai-bench |
| Profile | python scripts/xpu_profiler.py <file> | VTune GPU hardware counters + recommendations |
| Init trials | python scripts/trial_manager.py init <kernel_name> <baseline_file> [--triton-baseline] | Initialize trial tracking |
| Save trial | python scripts/trial_manager.py save <kernel_name> <file> [--parent <parent_id>] [--strategy "..."] | Save trial to tree |
| Record result | python scripts/trial_manager.py result <kernel_name> <trial_id> --validation pass --correctness <pass\|fail> --speedup <float> --baseline_us <float> --triton_us <float> | Record benchmark result |
| Check status | python scripts/trial_manager.py status <kernel_name> | View trial tree |
| Best trial | python scripts/trial_manager.py best <kernel_name> | Get best trial |
| Baseline time | python scripts/trial_manager.py baseline-us <kernel_name> | Cached baseline time for --baseline-us |
| Finalize | python scripts/trial_manager.py finalize <kernel_name> <name>_triton.py | Copy best trial to output |
python scripts/analyze_kernel.py <pytorch_file>.--triton-baseline): skip analyze_kernel.py (it only supports PyTorch). Read the Triton file directly.references/correctness.yaml and references/xpu_optimizations.yaml.references/implementation_reference.md for templates and the Model class pattern.python scripts/trial_manager.py init <kernel_name> <baseline_file> [--triton-baseline]
max_trials from config.yaml)For each trial:
references/implementation_reference.md.python scripts/validate_triton.py <triton_file> (fix until passing; doesn't count as a trial).python scripts/trial_manager.py save <kernel_name> <triton_file> --parent <parent_id> --strategy "description". Omit --parent for the first trial (t0).python scripts/benchmark.py <baseline_file> <triton_file> [--triton-baseline] (measures both baseline and triton).python scripts/trial_manager.py baseline-us <kernel_name>, then run python scripts/benchmark.py <baseline_file> <triton_file> [--triton-baseline] --baseline-us <cached_value> (skips baseline perf, saves time).finalize: Re-run benchmark.py without --baseline-us for final accurate comparison.python scripts/trial_manager.py result <kernel_name> <trial_id> --validation pass --correctness <pass|fail> --speedup <float> --baseline_us <float> --triton_us <float> (runtimes from benchmark output).vtune_enabled is true in config.yaml) — Run python scripts/xpu_profiler.py <triton_file> after your first benchmarked trial. Use its output to guide subsequent trial strategies. Run again if speedup plateaus after 2+ additional trials. Skip this step entirely if vtune_enabled is false.references/xpu_optimizations.yamlreferences/optimization_levels.yamlreferences/optimization_strategies.md for the full "try harder" decision treepython scripts/trial_manager.py finalize <kernel_name> <name>_triton.py
| Doc | Contents |
|-----|----------|
| references/implementation_reference.md | Code templates, Model class pattern, GEMM example |
| references/optimization_strategies.md | Strategy reference, optimization levels, checklist |
| references/workflow_details.md | Detailed workflow, decision tree, benchmarking/validation details |
| references/correctness.yaml | Critical constraints to avoid bugs |
| references/xpu_optimizations.yaml | XPU-specific patterns (tensor descriptors, GRF, swizzling) |
| references/fusion_patterns.yaml | When to fuse vs split operations |
| references/optimization_levels.yaml | Progressive optimization with "try harder" decision tree |
The test_kernels/*.py Triton files (non-pytorch) are unoptimized baselines. They use manual pointer arithmetic, lack autotune, and miss XPU optimizations. Do NOT copy their patterns. Use references/implementation_reference.md instead.
Tensor descriptors produce better address generation and memory access codegen than block pointers on Intel XPU.
desc = tl.make_tensor_descriptor(
base=ptr, shape=[M, N],
strides=[stride_m, stride_n],
block_shape=[BLOCK_M, BLOCK_N],
)
block = tl.load(desc, [pid_m, pid_n], boundary_check=(0, 1))
Use the large register file for compute-heavy kernels:
@triton.autotune(
configs=[triton.Config({'BLOCK_M': 256, 'BLOCK_N': 256}, num_warps=32)],
key=['M', 'N', 'K'],
)
@triton.jit(launch_metadata=lambda *args, **kwargs: {'grf_mode': '256'})
def kernel(...):
...
Use 1D grid with GROUP_SIZE_M for L2 locality:
grid = lambda META: (triton.cdiv(M, META['BLOCK_M']) * triton.cdiv(N, META['BLOCK_N']),)
# Inside kernel:
pid = tl.program_id(0)
num_pid_n = tl.cdiv(N, BLOCK_N)
group_id = pid // (GROUP_SIZE_M * num_pid_n)
a = tl.load(a_desc, [pid_m, k], boundary_check=(0, 1))
b = tl.load(b_desc, [k, pid_n], boundary_check=(0, 1))
acc += tl.dot(a.to(tl.bfloat16), b.to(tl.bfloat16), acc=acc) # fp32 accumulator
@triton.autotune meta-parameters in kernel signatureboundary_check uses dimension indices (0, 1), not booleansint64 before stride multiplicationnn.Module with nn.Linear)get_inputs(), get_init_inputs(), and module-level constants from *_pytorch.pyFull constraint list: correctness.yaml
Measured on Intel Battlemage G21 / Arc Pro B50 (128 XVEs). All runtimes are median of benchmark trials.
Speedup is vs. PyTorch eager baseline. Includes GEMM+Sigmoid+Scaling, GEMM+GELU+Softmax, Conv+BatchNorm+ReLU, and other fused patterns.
Baseline is the flash attention kernel from the Intel XPU Triton backend; speedup is vs. that kernel across multiple sequence lengths.
Full results: see the Xe-Forge repository.
| Issue | Symptom | Fix |
|-------|---------|-----|
| Autotune BLOCK_D | Wrong results (max_abs 4-8+) | Never autotune BLOCK_D. Use triton.next_power_of_2(D) |
| Python min/max | Runtime error | tl.minimum()/tl.maximum() |
xpu-kernels/
├── SKILL.md # This file (skill definition + workflow)
├── manifest.txt # Files included in this skill
│
├── scripts/ # Standalone CLI tools
│ ├── analyze_kernel.py # PyTorch → operations, shapes, fusion opportunities
│ ├── validate_triton.py # Syntax + constraint checks
│ ├── benchmark.py # Correctness + performance via ai-bench
│ ├── trial_manager.py # Tree-structured trial management
│ ├── xpu_profiler.py # VTune GPU hardware counters
│ ├── config.py # Shared configuration loader
│ ├── config.yaml # Session config (max_trials, vtune)
│ └── requirements.txt # Python dependencies
│
└── references/ # Knowledge base + integration guides
├── correctness.yaml # Hard constraints for XPU Triton
├── xpu_optimizations.yaml # Tensor descriptors, GRF, swizzling
├── implementation_reference.md # Code templates, Model class pattern
├── implementation_reference.md # Code templates, Model class pattern
├── optimization_strategies.md # Strategy reference + "try harder" tree
├── optimization_levels.yaml # Progressive L1-L5 optimization levels
├── workflow_details.md # Detailed workflow and decision tree
├── fusion_patterns.yaml # When to fuse vs split
├── memory_patterns.yaml # Access patterns and coalescing
├── dtype_optimizations.yaml # Mixed precision choices
├── persistent_kernel_patterns.yaml # Stream K and persistent kernels
├── kernel-templates.md # Triton kernel templates for XPU
└── kernelbench-classification.md # KernelBench operator taxonomy
devops
Provides guidance for writing and benchmarking optimized Triton kernels for AMD GPUs (MI355X, R9700) on ROCm, targeting HuggingFace diffusers (LTX-Video, SD3, FLUX) and transformers. Core kernels: RMSNorm, RoPE 3D, GEGLU, AdaLN. Includes XCD swizzle, autotune, diffusers integration patterns, and LTX-Video pipeline injection.
development
Provides guidance for writing and benchmarking optimized CUDA kernels for NVIDIA GPUs (H100, A100, T4) targeting HuggingFace diffusers and transformers libraries. Supports models like LTX-Video, Stable Diffusion, LLaMA, Mistral, and Qwen. Includes integration with HuggingFace Kernels Hub (get_kernel) for loading pre-compiled kernels. Includes benchmarking scripts to compare kernel performance against baseline implementations.
development
Maintainer-only workflow for handling GitHub Secret Scanning alerts on OpenClaw. Use when Codex needs to triage, redact, clean up, and resolve secret leakage found in issue comments, issue bodies, PR comments, or other GitHub content.
development
Maintainer workflow for OpenClaw releases, prereleases, changelog release notes, and publish validation. Use when Codex needs to prepare or verify stable or beta release steps, align version naming, assemble release notes, check release auth requirements, or validate publish-time commands and artifacts.