skills/skillxiv-v0.0.2-claude-opus-4.6/automatic-triton-programming/SKILL.md
Train specialized LLMs to generate optimized Triton GPU kernels using RL with dual rewards for correctness and syntax compliance. 8B model achieves parity with Claude-Sonnet and DeepSeek-R1 by combining supervised fine-tuning on curated code pairs with RL exploration beyond imitation learning ceilings.
npx skillsauth add ADu2021/skillXiv automatic-triton-programmingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Manually writing Triton kernels requires deep GPU compute expertise—developers must understand thread block dimensions, memory coalescing, register pressure, and hardware-specific optimization tricks. Existing code generation models struggle because Triton syntax is specialized and test-driven validation is critical: broken kernels produce wrong outputs or hang indefinitely. AutoTriton solves this through a three-stage pipeline: first, automatically harvest real PyTorch kernels from GitHub and generate corresponding Triton implementations through distillation; second, fine-tune a compact 8B model on these code pairs with chain-of-thought reasoning; third, use reinforcement learning with dual rewards—execution-based (tests pass) and rule-based (Triton syntax valid)—to push performance beyond supervised learning limits.
When developers need generated kernels for custom operations, custom data types, or domain-specific acceleration, autoregressive generation without RL produces code that looks reasonable but fails silently or violates Triton constraints. The dual reward prevents "reward hacking" where models game the system (e.g., producing syntactically valid but incorrect kernels). RL enables exploration of kernel implementation strategies supervised fine-tuning cannot reach.
AutoTriton separates training into supervised learning (establish baseline coding competency) and RL (push beyond baseline through directed exploration). The supervised stage learns from 14,102 instruction-triton code pairs with explanations extracted from GitHub. RL uses GRPO (Group Relative Policy Optimization), a variant of PPO, with two reward signals: (1) execution-based rewards from test cases—if kernel outputs match ground truth, high reward—and (2) rule-based rewards from Triton syntax validators—ensuring code respects block size limits, instruction counts, memory patterns. The combination prevents models from gaming metrics; a syntactically invalid kernel gets zero RL credit regardless of test performance, forcing genuine implementation learning.
This example demonstrates the supervised fine-tuning stage on curated instruction-code pairs. SFT establishes baseline coding competency.
# Supervised fine-tuning on Triton code pairs
import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer
class TritonSFTTrainer:
def __init__(self, base_model_name="meta-llama/Llama-2-8b"):
self.model = AutoModelForCausalLM.from_pretrained(base_model_name)
self.tokenizer = AutoTokenizer.from_pretrained(base_model_name)
self.optimizer = torch.optim.AdamW(self.model.parameters(), lr=1e-5)
def prepare_triton_dataset(self, github_pytorch_kernels):
"""Harvest Triton implementations from PyTorch kernels.
Distill PyTorch kernel logic into Triton code."""
instruction_code_pairs = []
for kernel in github_pytorch_kernels:
# Parse PyTorch kernel semantics
operation = extract_operation(kernel) # e.g., "element-wise multiply"
io_spec = extract_io_spec(kernel) # input/output shapes and types
# Generate instruction describing the operation
instruction = f"Implement a Triton kernel for {operation}. Input shapes: {io_spec}"
# Generate Triton implementation (via torch.compile or manual translation)
triton_code = translate_pytorch_to_triton(kernel)
# Add chain-of-thought explanation
reasoning = generate_reasoning(kernel, triton_code)
instruction_code_pairs.append({
'instruction': instruction,
'code': triton_code,
'reasoning': reasoning,
'operation': operation,
'io_spec': io_spec
})
return instruction_code_pairs
def format_training_example(self, instruction, reasoning, code):
"""Format instruction-reasoning-code for causal language modeling."""
text = f"""Instruction: {instruction}
Reasoning:
{reasoning}
Triton Code:
{code}
<|end_of_code|>"""
return text
def training_step(self, instruction, reasoning, code):
"""SFT on instruction -> reasoning -> code format."""
# Format example
text = self.format_training_example(instruction, reasoning, code)
# Tokenize
tokens = self.tokenizer(text, return_tensors='pt', truncation=True, max_length=2048)
input_ids = tokens['input_ids']
# Forward pass
outputs = self.model(input_ids, labels=input_ids)
loss = outputs.loss
# Backward
self.optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)
self.optimizer.step()
return loss.item()
This example shows the kernel validation pipeline with execution and syntax rewards used in RL training.
class TritonKernelValidator:
def __init__(self):
self.syntax_errors = []
self.execution_errors = []
def validate_execution(self, triton_code, test_inputs, ground_truth_outputs):
"""Execute kernel and compare against ground truth.
Returns reward: 1.0 if correct, 0.0 if incorrect."""
try:
# Compile and run Triton kernel
compiled_kernel = compile_triton_kernel(triton_code)
actual_outputs = run_kernel(compiled_kernel, test_inputs)
# Compare against ground truth
correct = torch.allclose(actual_outputs, ground_truth_outputs, rtol=1e-5)
execution_reward = 1.0 if correct else 0.0
return execution_reward, None
except RuntimeError as e:
# Kernel crashed or hung
return 0.0, str(e)
def validate_syntax(self, triton_code):
"""Check code against Triton language constraints.
Returns reward: 1.0 if valid, 0.0 if violated constraints."""
constraints_violated = []
# Check 1: Block size must be power of 2 and <= 2048
block_dims = extract_block_dims(triton_code)
for dim in block_dims:
if not is_power_of_2(dim) or dim > 2048:
constraints_violated.append(f"Invalid block dim: {dim}")
# Check 2: Total threads per block <= 1024
total_threads = 1
for dim in block_dims:
total_threads *= dim
if total_threads > 1024:
constraints_violated.append(f"Total threads {total_threads} exceeds 1024")
# Check 3: Memory access patterns must be coalesced-friendly
if not check_memory_coalescing(triton_code):
constraints_violated.append("Memory access pattern not coalesced")
# Check 4: Instruction count heuristic (rough estimate)
if estimate_instruction_count(triton_code) > 2000:
constraints_violated.append("Estimated instruction count too high")
syntax_reward = 0.0 if constraints_violated else 1.0
return syntax_reward, constraints_violated
def combined_reward(self, triton_code, test_inputs, ground_truth_outputs):
"""Compute dual reward: execution + syntax."""
execution_reward, exec_error = self.validate_execution(
triton_code, test_inputs, ground_truth_outputs
)
syntax_reward, syntax_errors = self.validate_syntax(triton_code)
# Prevent reward hacking: syntax failure = zero credit even if tests pass
if syntax_errors:
execution_reward = 0.0
# Combined reward emphasizes both equally
total_reward = 0.5 * execution_reward + 0.5 * syntax_reward
return total_reward, {
'execution_reward': execution_reward,
'syntax_reward': syntax_reward,
'exec_error': exec_error,
'syntax_errors': syntax_errors
}
This example demonstrates RL fine-tuning using GRPO with dual rewards to push performance beyond supervised learning.
class TritonRLTrainer:
def __init__(self, model, tokenizer, validator, learning_rate=5e-6):
self.model = model
self.tokenizer = tokenizer
self.validator = validator
self.optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
def rl_training_step(self, instruction, reference_code, test_batches):
"""GRPO fine-tuning: sample diverse generations and optimize with rewards."""
# Sample multiple code completions (group diversity)
num_samples = 8
generated_codes = []
rewards = []
reward_details = []
for _ in range(num_samples):
# Generate code from instruction
generated_code = self.generate_code(instruction, temperature=0.9)
generated_codes.append(generated_code)
# Validate against all test cases
batch_rewards = []
for test_inputs, ground_truth in test_batches:
total_reward, details = self.validator.combined_reward(
generated_code, test_inputs, ground_truth
)
batch_rewards.append(total_reward)
avg_reward = sum(batch_rewards) / len(batch_rewards)
rewards.append(avg_reward)
reward_details.append(batch_rewards)
# Group relative reward (GRPO): normalize within group
rewards = torch.tensor(rewards)
normalized_rewards = (rewards - rewards.mean()) / (rewards.std() + 1e-10)
# Compute policy loss for all generations
total_loss = 0.0
for code, norm_reward in zip(generated_codes, normalized_rewards):
# Tokenize generated code
tokens = self.tokenizer(code, return_tensors='pt', truncation=True)
input_ids = tokens['input_ids']
# Forward pass
outputs = self.model(input_ids)
logits = outputs.logits
# Policy gradient: maximize log-prob of generated code weighted by reward
log_probs = F.log_softmax(logits, dim=-1)
policy_loss = -(log_probs.mean() * norm_reward)
# Add entropy bonus to prevent collapse
entropy = -torch.sum(F.softmax(logits, dim=-1) * log_probs, dim=-1).mean()
policy_loss -= 0.01 * entropy
total_loss += policy_loss
# Average over samples
total_loss = total_loss / num_samples
# Backward
self.optimizer.zero_grad()
total_loss.backward()
torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)
self.optimizer.step()
return {
'avg_reward': rewards.mean().item(),
'max_reward': rewards.max().item(),
'min_reward': rewards.min().item(),
'policy_loss': total_loss.item()
}
def generate_code(self, instruction, temperature=0.9, max_length=512):
"""Generate Triton code from instruction."""
prompt = f"Instruction: {instruction}\n\nTriton Code:"
input_ids = self.tokenizer(prompt, return_tensors='pt')['input_ids']
with torch.no_grad():
output = self.model.generate(
input_ids,
max_length=max_length,
temperature=temperature,
top_p=0.95,
do_sample=True
)
code = self.tokenizer.decode(output[0], skip_special_tokens=True)
return code.split("Triton Code:")[1].strip() if "Triton Code:" in code else code
| Hyperparameter | Recommended Value | Purpose | |---|---|---| | SFT learning rate | 1e-5 | Stable supervised learning on curated data | | RL learning rate | 5e-6 | Conservative RL to prevent divergence | | GRPO sample size (group) | 8 | Balance diversity vs. training cost | | Execution reward weight | 0.5 | Primary optimization signal | | Syntax reward weight | 0.5 | Prevent rule violations | | Entropy coefficient | 0.01 | Prevent policy collapse | | Max code length | 512 tokens | Typical kernel size | | Test cases per kernel | 3-5 | Coverage of edge cases |
When to use: Apply AutoTriton when generating Triton kernels for custom deep learning operations, datatype-specific implementations, or specialized accelerators. Use when manual kernel writing is a bottleneck but you have reference PyTorch implementations to learn from. Ideal for research and optimization when developer time is expensive.
When NOT to use: Skip if you have a small number of fixed kernels—manual optimization is simpler. Avoid for real-time code generation requiring latency < 500ms. Don't use if your target kernels have highly novel semantics not seen in training data; the model may fail to generalize. Skip if you lack test cases to validate generated code—dual rewards require ground truth comparisons.
Common pitfalls: Using only execution rewards without syntax validation allows models to game metrics with invalid code. Overfitting during SFT on small datasets (< 10k examples) causes poor RL exploration. Setting syntax constraints too loose defeats rule-based rewards. Not including diverse test inputs means generated code passes simple cases but fails edge cases. Skipping the chain-of-thought reasoning stage during SFT hurts interpretability and RL performance. Using uniform sampling for GRPO instead of diversity-aware sampling reduces exploration.
AutoTriton Team. (2025). AutoTriton: Automatic Triton Programming with Reinforcement Learning in LLMs. arXiv preprint arXiv:2507.05687. https://arxiv.org/abs/2507.05687
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.