XPU Triton Kernels for Intel GPUs

This skill provides patterns and guidance for developing optimized Triton kernels targeting Intel XPU GPUs (Battlemage/Arc Pro B50). It integrates the Xe-Forge optimization framework — an LLM-driven loop that transforms PyTorch code into fast Triton kernels.

Quick Start

Optimize a Kernel (Xe-Forge Workflow)

The full optimization workflow analyzes a PyTorch baseline, generates Triton kernel variants in a branching trial tree, benchmarks each on XPU hardware, and finalizes the best result.

# 1. Analyze the baseline
python scripts/analyze_kernel.py test_kernels/70_Gemm_Sigmoid_Scaling_ResidualAdd_pytorch.py

# 2. Initialize trial tracking
python scripts/trial_manager.py init 70_Gemm_Sigmoid test_kernels/70_Gemm_Sigmoid_Scaling_ResidualAdd_pytorch.py

# 3. Validate a generated kernel (no GPU needed)
python scripts/validate_triton.py my_kernel.py

# 4. Benchmark correctness + performance
python scripts/benchmark.py test_kernels/70_Gemm_Sigmoid_Scaling_ResidualAdd_pytorch.py my_kernel.py

# 5. Profile with VTune (optional)
python scripts/xpu_profiler.py my_kernel.py

# 6. Finalize best trial
python scripts/trial_manager.py finalize 70_Gemm_Sigmoid optimized_triton.py

Supported Hardware

| GPU | Architecture | XVEs | Mem BW | Key Feature | Verified | |-----|-------------|------|--------|-------------|:--------:| | Battlemage G21 / Arc Pro B50 | Xe2 | 128 | ~500 GB/s | Tensor descriptors, GRF 256 | Yes |

See the Intel XPU Backend for Triton for supported hardware.

When This Skill Applies

Use this skill when:

Optimizing PyTorch operations into Triton kernels for Intel XPU
Writing GEMM, fused kernels, reductions, or Flash Attention for Intel GPUs
Running the Xe-Forge optimization loop (analyze → validate → benchmark → profile → finalize)
Benchmarking kernel performance against PyTorch baseline on XPU

Xe-Forge Optimization Workflow

Transform PyTorch code into optimized Triton kernels for Intel XPU. Kernels must be numerically equivalent and faster than baseline.

Configuration — Read `config.yaml` first

At the start of every session, read scripts/config.yaml. It controls:

max_trials — hard cap on optimization trials; always run all of them (use this instead of hardcoded "10")
vtune_enabled — if false, skip ALL VTune profiling steps (Step 3.6 and profiler-related decisions)
vtune_bin — path to the VTune binary (also settable via VTUNE_BIN env var)

Rules — Never Violate

ONLY create Triton kernel files (test_kernels/*_triton.py or trial files t<trial_id>.py).
NEVER create benchmark scripts, test scripts, helper utilities, or any other Python files.
NEVER write custom scripts to measure performance or test correctness — ONLY use scripts/benchmark.py.
If a tool fails, STOP and report the error. Do NOT work around it with custom scripts.
Generated kernels must be self-contained — all helper functions inline.
You MUST run all max_trials trials from config.yaml. Do NOT stop early due to plateau — LLM sampling can discover new ideas at any point. The only valid early stop is speedup > 5x.

Mandatory Tools

CRITICAL — Single-XPU serialization: There is only ONE XPU on this machine. You MUST NOT run multiple GPU workloads in parallel. benchmark.py and xpu_profiler.py must execute strictly one at a time — concurrent GPU jobs produce wrong results. CPU-only tools (analyze_kernel.py, validate_triton.py, trial_manager.py) are safe to parallelize with each other and with anything else.

| Tool | Command | Purpose | |------|---------|---------| | Analyze | python scripts/analyze_kernel.py <file> | Static analysis: operations, shapes, fusion opportunities | | Validate | python scripts/validate_triton.py <file> | Syntax + constraint checks before GPU time | | Benchmark | python scripts/benchmark.py <baseline> <triton> [--triton-baseline] [--baseline-us <cached>] | Correctness + performance via ai-bench | | Profile | python scripts/xpu_profiler.py <file> | VTune GPU hardware counters + recommendations | | Init trials | python scripts/trial_manager.py init <kernel_name> <baseline_file> [--triton-baseline] | Initialize trial tracking | | Save trial | python scripts/trial_manager.py save <kernel_name> <file> [--parent <parent_id>] [--strategy "..."] | Save trial to tree | | Record result | python scripts/trial_manager.py result <kernel_name> <trial_id> --validation pass --correctness <pass\|fail> --speedup <float> --baseline_us <float> --triton_us <float> | Record benchmark result | | Check status | python scripts/trial_manager.py status <kernel_name> | View trial tree | | Best trial | python scripts/trial_manager.py best <kernel_name> | Get best trial | | Baseline time | python scripts/trial_manager.py baseline-us <kernel_name> | Cached baseline time for --baseline-us | | Finalize | python scripts/trial_manager.py finalize <kernel_name> <name>_triton.py | Copy best trial to output |

Workflow Steps

Step 1: Analyze

Read the baseline source file. Identify shapes, dtypes, operations, fusion opportunities.
If baseline is PyTorch: run python scripts/analyze_kernel.py <pytorch_file>.
If baseline is Triton (--triton-baseline): skip analyze_kernel.py (it only supports PyTorch). Read the Triton file directly.
Read relevant knowledge base files: start with references/correctness.yaml and references/xpu_optimizations.yaml.
Read references/implementation_reference.md for templates and the Model class pattern.

Step 2: Initialize

python scripts/trial_manager.py init <kernel_name> <baseline_file> [--triton-baseline]

Step 3: Trial Loop (always run all `max_trials` from config.yaml)

For each trial:

Write kernel — start from templates or modify previous trial. See references/implementation_reference.md.
Validate — python scripts/validate_triton.py <triton_file> (fix until passing; doesn't count as a trial).
Save — python scripts/trial_manager.py save <kernel_name> <triton_file> --parent <parent_id> --strategy "description". Omit --parent for the first trial (t0).
Benchmark (MANDATORY every trial):
- Trial t0: python scripts/benchmark.py <baseline_file> <triton_file> [--triton-baseline] (measures both baseline and triton).
- Trials t1+: Get cached baseline via python scripts/trial_manager.py baseline-us <kernel_name>, then run python scripts/benchmark.py <baseline_file> <triton_file> [--triton-baseline] --baseline-us <cached_value> (skips baseline perf, saves time).
- After finalize: Re-run benchmark.py without --baseline-us for final accurate comparison.
Record — python scripts/trial_manager.py result <kernel_name> <trial_id> --validation pass --correctness <pass|fail> --speedup <float> --baseline_us <float> --triton_us <float> (runtimes from benchmark output).
Profile (MANDATORY after t1, if vtune_enabled is true in config.yaml) — Run python scripts/xpu_profiler.py <triton_file> after your first benchmarked trial. Use its output to guide subsequent trial strategies. Run again if speedup plateaus after 2+ additional trials. Skip this step entirely if vtune_enabled is false.
Decide next action (use profiler output from step 6 to inform decisions):
- Speedup > 5x → stop (excellent), finalize
- Speedup improved → continue on this branch, try next optimization level
- Speedup regressed → branch back to best trial, try different strategy
- Correctness failed → fix on same branch
- Profiler says low occupancy (if vtune_enabled) → increase tile sizes, check references/xpu_optimizations.yaml
- Profiler says overhead kernels dominate (if vtune_enabled) → pre-pack to bf16, see references/optimization_levels.yaml
- Plateau → do NOT stop. Try a fundamentally different approach (different algorithm, tiling, fusion strategy). LLM sampling can discover new ideas.
- See references/optimization_strategies.md for the full "try harder" decision tree

Step 4: Finalize

python scripts/trial_manager.py finalize <kernel_name> <name>_triton.py

Reference Docs — Read During Step 1

| Doc | Contents | |-----|----------| | references/implementation_reference.md | Code templates, Model class pattern, GEMM example | | references/optimization_strategies.md | Strategy reference, optimization levels, checklist | | references/workflow_details.md | Detailed workflow, decision tree, benchmarking/validation details | | references/correctness.yaml | Critical constraints to avoid bugs | | references/xpu_optimizations.yaml | XPU-specific patterns (tensor descriptors, GRF, swizzling) | | references/fusion_patterns.yaml | When to fuse vs split operations | | references/optimization_levels.yaml | Progressive optimization with "try harder" decision tree |

Existing Baselines Are Naive

The test_kernels/*.py Triton files (non-pytorch) are unoptimized baselines. They use manual pointer arithmetic, lack autotune, and miss XPU optimizations. Do NOT copy their patterns. Use references/implementation_reference.md instead.

Core XPU Kernel Patterns

Tensor Descriptors (Preferred on XPU)

Tensor descriptors produce better address generation and memory access codegen than block pointers on Intel XPU.

desc = tl.make_tensor_descriptor(
    base=ptr, shape=[M, N],
    strides=[stride_m, stride_n],
    block_shape=[BLOCK_M, BLOCK_N],
)
block = tl.load(desc, [pid_m, pid_n], boundary_check=(0, 1))

GRF Mode '256'

Use the large register file for compute-heavy kernels:

@triton.autotune(
    configs=[triton.Config({'BLOCK_M': 256, 'BLOCK_N': 256}, num_warps=32)],
    key=['M', 'N', 'K'],
)
@triton.jit(launch_metadata=lambda *args, **kwargs: {'grf_mode': '256'})
def kernel(...):
    ...

Register-Spill-Aware Autotune Pruning

For autotuned XPU kernels, prune configs that are likely to spill even after large-GRF recompilation. This avoids slow double-compile/benchmark cycles where the Intel backend reports spills, recompiles in large GRF mode, and still spills.

Use an early_config_prune that estimates per-thread GRF pressure from the live accumulator and operand tiles:

regs ~= (accumulators * BM * BN + stages * (BM * BK + weight_tiles * BN * BK))
        / (num_warps * warp_size)

Guidelines:

Apply this only on XPU; leave other backends' configs unchanged.
Read BM, BN, and BK from either autotune config kwargs or launch args.
Count accumulator tiles explicitly: gate/up fused kernels often have 2 accumulators, down/simple GEMM usually has 1.
Count weight tiles explicitly: gate/up fused kernels often load 2 weight tiles, down/simple GEMM usually loads 1.
Use num_stages for pipelined MMA paths; use 1 for scalar/non-pipelined paths.
Keep a never-empty fallback, such as the config with the lowest estimated register pressure.
Treat the model as a coarse filter for catastrophic spillers, not a perfect predictor; leave borderline configs for real benchmarking.

This complements grf_mode='256': large GRF increases the register budget, while pruning removes configs that are still too register-heavy to be useful.

For an implementation pattern, see references/xpu_optimizations.yaml (xpu_register_spill_autotune_pruning).

Tile Swizzling

Use 1D grid with GROUP_SIZE_M for L2 locality:

grid = lambda META: (triton.cdiv(M, META['BLOCK_M']) * triton.cdiv(N, META['BLOCK_N']),)
# Inside kernel:
pid = tl.program_id(0)
num_pid_n = tl.cdiv(N, BLOCK_N)
group_id = pid // (GROUP_SIZE_M * num_pid_n)

bf16 Inputs with fp32 Accumulation

a = tl.load(a_desc, [pid_m, k], boundary_check=(0, 1))
b = tl.load(b_desc, [k, pid_n], boundary_check=(0, 1))
acc += tl.dot(a.to(tl.bfloat16), b.to(tl.bfloat16), acc=acc)  # fp32 accumulator

Critical XPU Constraints

NO default values for @triton.autotune meta-parameters in kernel signature
1D grid when using tile swizzling (GROUP_SIZE_M)
boundary_check uses dimension indices (0, 1), not booleans
Cast batch indices to int64 before stride multiplication
Prefer tensor descriptors over block pointers for all new XPU kernels
Do NOT mix block pointer and tensor descriptor APIs on same operation
Pre-zero output buffers when using atomic accumulation
Model class must be compatible with ai-bench (nn.Module with nn.Linear)
Match get_inputs(), get_init_inputs(), and module-level constants from *_pytorch.py

Full constraint list: correctness.yaml

Performance Results

Measured on Intel Battlemage G21 / Arc Pro B50 (128 XVEs). All runtimes are median of benchmark trials.

KernelBench Level 2 — Fused Kernels (bf16)

Speedup is vs. PyTorch eager baseline. Includes GEMM+Sigmoid+Scaling, GEMM+GELU+Softmax, Conv+BatchNorm+ReLU, and other fused patterns.

Flash Attention Forward (fp16)

Baseline is the flash attention kernel from the Intel XPU Triton backend; speedup is vs. that kernel across multiple sequence lengths.

Full results: see the Xe-Forge repository.

Common Issues

| Issue | Symptom | Fix | |-------|---------|-----| | Autotune BLOCK_D | Wrong results (max_abs 4-8+) | Never autotune BLOCK_D. Use triton.next_power_of_2(D) | | Python min/max | Runtime error | tl.minimum()/tl.maximum() |

Project Structure

xpu-kernels/
├── SKILL.md                                    # This file (skill definition + workflow)
├── manifest.txt                                # Files included in this skill
│
├── scripts/                                    # Standalone CLI tools
│   ├── analyze_kernel.py                       # PyTorch → operations, shapes, fusion opportunities
│   ├── validate_triton.py                      # Syntax + constraint checks
│   ├── benchmark.py                            # Correctness + performance via ai-bench
│   ├── trial_manager.py                        # Tree-structured trial management
│   ├── xpu_profiler.py                         # VTune GPU hardware counters
│   ├── config.py                               # Shared configuration loader
│   ├── config.yaml                             # Session config (max_trials, vtune)
│   └── requirements.txt                        # Python dependencies
│
└── references/                                 # Knowledge base + integration guides
    ├── correctness.yaml                        # Hard constraints for XPU Triton
    ├── xpu_optimizations.yaml                  # Tensor descriptors, GRF, swizzling
    ├── implementation_reference.md             # Code templates, Model class pattern
    ├── implementation_reference.md             # Code templates, Model class pattern
    ├── optimization_strategies.md              # Strategy reference + "try harder" tree
    ├── optimization_levels.yaml                # Progressive L1-L5 optimization levels
    ├── workflow_details.md                     # Detailed workflow and decision tree
    ├── fusion_patterns.yaml                    # When to fuse vs split
    ├── memory_patterns.yaml                    # Access patterns and coalescing
    ├── dtype_optimizations.yaml                # Mixed precision choices
    ├── persistent_kernel_patterns.yaml         # Stream K and persistent kernels
    ├── kernel-templates.md                     # Triton kernel templates for XPU
    └── kernelbench-classification.md           # KernelBench operator taxonomy

XPU Triton Kernels for Intel GPUs

Quick Start

Optimize a Kernel (Xe-Forge Workflow)

The full optimization workflow analyzes a PyTorch baseline, generates Triton kernel variants in a branching trial tree, benchmarks each on XPU hardware, and finalizes the best result.

# 1. Analyze the baseline
python scripts/analyze_kernel.py test_kernels/70_Gemm_Sigmoid_Scaling_ResidualAdd_pytorch.py

# 2. Initialize trial tracking
python scripts/trial_manager.py init 70_Gemm_Sigmoid test_kernels/70_Gemm_Sigmoid_Scaling_ResidualAdd_pytorch.py

# 3. Validate a generated kernel (no GPU needed)
python scripts/validate_triton.py my_kernel.py

# 4. Benchmark correctness + performance
python scripts/benchmark.py test_kernels/70_Gemm_Sigmoid_Scaling_ResidualAdd_pytorch.py my_kernel.py

# 5. Profile with VTune (optional)
python scripts/xpu_profiler.py my_kernel.py

# 6. Finalize best trial
python scripts/trial_manager.py finalize 70_Gemm_Sigmoid optimized_triton.py

Supported Hardware

See the Intel XPU Backend for Triton for supported hardware.

When This Skill Applies

Use this skill when:

Optimizing PyTorch operations into Triton kernels for Intel XPU
Writing GEMM, fused kernels, reductions, or Flash Attention for Intel GPUs
Running the Xe-Forge optimization loop (analyze → validate → benchmark → profile → finalize)
Benchmarking kernel performance against PyTorch baseline on XPU

Xe-Forge Optimization Workflow

Transform PyTorch code into optimized Triton kernels for Intel XPU. Kernels must be numerically equivalent and faster than baseline.

Configuration — Read `config.yaml` first

At the start of every session, read scripts/config.yaml. It controls:

max_trials — hard cap on optimization trials; always run all of them (use this instead of hardcoded "10")
vtune_enabled — if false, skip ALL VTune profiling steps (Step 3.6 and profiler-related decisions)
vtune_bin — path to the VTune binary (also settable via VTUNE_BIN env var)

Rules — Never Violate

ONLY create Triton kernel files (test_kernels/*_triton.py or trial files t<trial_id>.py).
NEVER create benchmark scripts, test scripts, helper utilities, or any other Python files.
NEVER write custom scripts to measure performance or test correctness — ONLY use scripts/benchmark.py.
If a tool fails, STOP and report the error. Do NOT work around it with custom scripts.
Generated kernels must be self-contained — all helper functions inline.
You MUST run all max_trials trials from config.yaml. Do NOT stop early due to plateau — LLM sampling can discover new ideas at any point. The only valid early stop is speedup > 5x.

Mandatory Tools

Workflow Steps

Step 1: Analyze

Read the baseline source file. Identify shapes, dtypes, operations, fusion opportunities.
If baseline is PyTorch: run python scripts/analyze_kernel.py <pytorch_file>.
If baseline is Triton (--triton-baseline): skip analyze_kernel.py (it only supports PyTorch). Read the Triton file directly.
Read relevant knowledge base files: start with references/correctness.yaml and references/xpu_optimizations.yaml.
Read references/implementation_reference.md for templates and the Model class pattern.

Step 2: Initialize

python scripts/trial_manager.py init <kernel_name> <baseline_file> [--triton-baseline]

Step 3: Trial Loop (always run all `max_trials` from config.yaml)

For each trial:

Write kernel — start from templates or modify previous trial. See references/implementation_reference.md.
Validate — python scripts/validate_triton.py <triton_file> (fix until passing; doesn't count as a trial).
Save — python scripts/trial_manager.py save <kernel_name> <triton_file> --parent <parent_id> --strategy "description". Omit --parent for the first trial (t0).
Benchmark (MANDATORY every trial):
- Trial t0: python scripts/benchmark.py <baseline_file> <triton_file> [--triton-baseline] (measures both baseline and triton).
- Trials t1+: Get cached baseline via python scripts/trial_manager.py baseline-us <kernel_name>, then run python scripts/benchmark.py <baseline_file> <triton_file> [--triton-baseline] --baseline-us <cached_value> (skips baseline perf, saves time).
- After finalize: Re-run benchmark.py without --baseline-us for final accurate comparison.
Record — python scripts/trial_manager.py result <kernel_name> <trial_id> --validation pass --correctness <pass|fail> --speedup <float> --baseline_us <float> --triton_us <float> (runtimes from benchmark output).
Profile (MANDATORY after t1, if vtune_enabled is true in config.yaml) — Run python scripts/xpu_profiler.py <triton_file> after your first benchmarked trial. Use its output to guide subsequent trial strategies. Run again if speedup plateaus after 2+ additional trials. Skip this step entirely if vtune_enabled is false.
Decide next action (use profiler output from step 6 to inform decisions):
- Speedup > 5x → stop (excellent), finalize
- Speedup improved → continue on this branch, try next optimization level
- Speedup regressed → branch back to best trial, try different strategy
- Correctness failed → fix on same branch
- Profiler says low occupancy (if vtune_enabled) → increase tile sizes, check references/xpu_optimizations.yaml
- Profiler says overhead kernels dominate (if vtune_enabled) → pre-pack to bf16, see references/optimization_levels.yaml
- Plateau → do NOT stop. Try a fundamentally different approach (different algorithm, tiling, fusion strategy). LLM sampling can discover new ideas.
- See references/optimization_strategies.md for the full "try harder" decision tree

Step 4: Finalize

python scripts/trial_manager.py finalize <kernel_name> <name>_triton.py

Reference Docs — Read During Step 1

Existing Baselines Are Naive

Core XPU Kernel Patterns

Tensor Descriptors (Preferred on XPU)

Tensor descriptors produce better address generation and memory access codegen than block pointers on Intel XPU.

desc = tl.make_tensor_descriptor(
    base=ptr, shape=[M, N],
    strides=[stride_m, stride_n],
    block_shape=[BLOCK_M, BLOCK_N],
)
block = tl.load(desc, [pid_m, pid_n], boundary_check=(0, 1))

GRF Mode '256'

Use the large register file for compute-heavy kernels:

@triton.autotune(
    configs=[triton.Config({'BLOCK_M': 256, 'BLOCK_N': 256}, num_warps=32)],
    key=['M', 'N', 'K'],
)
@triton.jit(launch_metadata=lambda *args, **kwargs: {'grf_mode': '256'})
def kernel(...):
    ...

Register-Spill-Aware Autotune Pruning

Use an early_config_prune that estimates per-thread GRF pressure from the live accumulator and operand tiles:

regs ~= (accumulators * BM * BN + stages * (BM * BK + weight_tiles * BN * BK))
        / (num_warps * warp_size)

Guidelines:

Apply this only on XPU; leave other backends' configs unchanged.
Read BM, BN, and BK from either autotune config kwargs or launch args.
Count accumulator tiles explicitly: gate/up fused kernels often have 2 accumulators, down/simple GEMM usually has 1.
Count weight tiles explicitly: gate/up fused kernels often load 2 weight tiles, down/simple GEMM usually loads 1.
Use num_stages for pipelined MMA paths; use 1 for scalar/non-pipelined paths.
Keep a never-empty fallback, such as the config with the lowest estimated register pressure.
Treat the model as a coarse filter for catastrophic spillers, not a perfect predictor; leave borderline configs for real benchmarking.

This complements grf_mode='256': large GRF increases the register budget, while pruning removes configs that are still too register-heavy to be useful.

For an implementation pattern, see references/xpu_optimizations.yaml (xpu_register_spill_autotune_pruning).

Tile Swizzling

Use 1D grid with GROUP_SIZE_M for L2 locality:

grid = lambda META: (triton.cdiv(M, META['BLOCK_M']) * triton.cdiv(N, META['BLOCK_N']),)
# Inside kernel:
pid = tl.program_id(0)
num_pid_n = tl.cdiv(N, BLOCK_N)
group_id = pid // (GROUP_SIZE_M * num_pid_n)

bf16 Inputs with fp32 Accumulation

a = tl.load(a_desc, [pid_m, k], boundary_check=(0, 1))
b = tl.load(b_desc, [k, pid_n], boundary_check=(0, 1))
acc += tl.dot(a.to(tl.bfloat16), b.to(tl.bfloat16), acc=acc)  # fp32 accumulator

Critical XPU Constraints

NO default values for @triton.autotune meta-parameters in kernel signature
1D grid when using tile swizzling (GROUP_SIZE_M)
boundary_check uses dimension indices (0, 1), not booleans
Cast batch indices to int64 before stride multiplication
Prefer tensor descriptors over block pointers for all new XPU kernels
Do NOT mix block pointer and tensor descriptor APIs on same operation
Pre-zero output buffers when using atomic accumulation
Model class must be compatible with ai-bench (nn.Module with nn.Linear)
Match get_inputs(), get_init_inputs(), and module-level constants from *_pytorch.py

Full constraint list: correctness.yaml

Performance Results

Measured on Intel Battlemage G21 / Arc Pro B50 (128 XVEs). All runtimes are median of benchmark trials.

KernelBench Level 2 — Fused Kernels (bf16)

Speedup is vs. PyTorch eager baseline. Includes GEMM+Sigmoid+Scaling, GEMM+GELU+Softmax, Conv+BatchNorm+ReLU, and other fused patterns.

Flash Attention Forward (fp16)

Baseline is the flash attention kernel from the Intel XPU Triton backend; speedup is vs. that kernel across multiple sequence lengths.

Full results: see the Xe-Forge repository.

Common Issues

Project Structure

xpu-kernels/
├── SKILL.md                                    # This file (skill definition + workflow)
├── manifest.txt                                # Files included in this skill
│
├── scripts/                                    # Standalone CLI tools
│   ├── analyze_kernel.py                       # PyTorch → operations, shapes, fusion opportunities
│   ├── validate_triton.py                      # Syntax + constraint checks
│   ├── benchmark.py                            # Correctness + performance via ai-bench
│   ├── trial_manager.py                        # Tree-structured trial management
│   ├── xpu_profiler.py                         # VTune GPU hardware counters
│   ├── config.py                               # Shared configuration loader
│   ├── config.yaml                             # Session config (max_trials, vtune)
│   └── requirements.txt                        # Python dependencies
│
└── references/                                 # Knowledge base + integration guides
    ├── correctness.yaml                        # Hard constraints for XPU Triton
    ├── xpu_optimizations.yaml                  # Tensor descriptors, GRF, swizzling
    ├── implementation_reference.md             # Code templates, Model class pattern
    ├── implementation_reference.md             # Code templates, Model class pattern
    ├── optimization_strategies.md              # Strategy reference + "try harder" tree
    ├── optimization_levels.yaml                # Progressive L1-L5 optimization levels
    ├── workflow_details.md                     # Detailed workflow and decision tree
    ├── fusion_patterns.yaml                    # When to fuse vs split
    ├── memory_patterns.yaml                    # Access patterns and coalescing
    ├── dtype_optimizations.yaml                # Mixed precision choices
    ├── persistent_kernel_patterns.yaml         # Stream K and persistent kernels
    ├── kernel-templates.md                     # Triton kernel templates for XPU
    └── kernelbench-classification.md           # KernelBench operator taxonomy

Adoption

huggingface/xpu-kernels

$ install --global

Security Scan Results

SKILL.md

XPU Triton Kernels for Intel GPUs

Quick Start

Optimize a Kernel (Xe-Forge Workflow)

Supported Hardware

When This Skill Applies

Xe-Forge Optimization Workflow

Configuration — Read config.yaml first

Rules — Never Violate

Mandatory Tools

Workflow Steps

Step 1: Analyze

Step 2: Initialize

Step 3: Trial Loop (always run all max_trials from config.yaml)

Step 4: Finalize

Reference Docs — Read During Step 1

Existing Baselines Are Naive

Core XPU Kernel Patterns

Tensor Descriptors (Preferred on XPU)

GRF Mode '256'

Register-Spill-Aware Autotune Pruning

Tile Swizzling

bf16 Inputs with fp32 Accumulation

Critical XPU Constraints

Performance Results

KernelBench Level 2 — Fused Kernels (bf16)

Flash Attention Forward (fp16)

Common Issues

Project Structure

See Also

Xe-Forge Tools

XPU Optimization References

Other References

External Resources

Related Skills

huggingface/kernel-builder/skills/triton-kernels

huggingface/cuda-kernels

huggingface/cpu-kernels

huggingface/rocm-kernels

huggingface/xpu-kernels

$ install --global

Security Scan Results

SKILL.md

XPU Triton Kernels for Intel GPUs

Quick Start

Optimize a Kernel (Xe-Forge Workflow)

Supported Hardware

When This Skill Applies

Xe-Forge Optimization Workflow

Configuration — Read config.yaml first

Rules — Never Violate

Mandatory Tools

Workflow Steps

Step 1: Analyze

Step 2: Initialize

Step 3: Trial Loop (always run all max_trials from config.yaml)

Step 4: Finalize

Reference Docs — Read During Step 1

Existing Baselines Are Naive

Core XPU Kernel Patterns

Tensor Descriptors (Preferred on XPU)

GRF Mode '256'

Register-Spill-Aware Autotune Pruning

Tile Swizzling

bf16 Inputs with fp32 Accumulation

Critical XPU Constraints

Performance Results

KernelBench Level 2 — Fused Kernels (bf16)

Flash Attention Forward (fp16)

Common Issues

Project Structure

See Also

Xe-Forge Tools

XPU Optimization References

Other References

External Resources

Configuration — Read `config.yaml` first

Step 3: Trial Loop (always run all `max_trials` from config.yaml)

Configuration — Read `config.yaml` first

Step 3: Trial Loop (always run all `max_trials` from config.yaml)