Automatic Triton Programming: Generating Optimized GPU Kernels via RL-Guided Code Generation

Manually writing Triton kernels requires deep GPU compute expertise—developers must understand thread block dimensions, memory coalescing, register pressure, and hardware-specific optimization tricks. Existing code generation models struggle because Triton syntax is specialized and test-driven validation is critical: broken kernels produce wrong outputs or hang indefinitely. AutoTriton solves this through a three-stage pipeline: first, automatically harvest real PyTorch kernels from GitHub and generate corresponding Triton implementations through distillation; second, fine-tune a compact 8B model on these code pairs with chain-of-thought reasoning; third, use reinforcement learning with dual rewards—execution-based (tests pass) and rule-based (Triton syntax valid)—to push performance beyond supervised learning limits.

When developers need generated kernels for custom operations, custom data types, or domain-specific acceleration, autoregressive generation without RL produces code that looks reasonable but fails silently or violates Triton constraints. The dual reward prevents "reward hacking" where models game the system (e.g., producing syntactically valid but incorrect kernels). RL enables exploration of kernel implementation strategies supervised fine-tuning cannot reach.

Core Concept

AutoTriton separates training into supervised learning (establish baseline coding competency) and RL (push beyond baseline through directed exploration). The supervised stage learns from 14,102 instruction-triton code pairs with explanations extracted from GitHub. RL uses GRPO (Group Relative Policy Optimization), a variant of PPO, with two reward signals: (1) execution-based rewards from test cases—if kernel outputs match ground truth, high reward—and (2) rule-based rewards from Triton syntax validators—ensuring code respects block size limits, instruction counts, memory patterns. The combination prevents models from gaming metrics; a syntactically invalid kernel gets zero RL credit regardless of test performance, forcing genuine implementation learning.

Architecture Overview

Code Instruction Pairs: 14,102 PyTorch-to-Triton mappings with chain-of-thought explanations
Supervised Fine-Tuning Stage: Trains 8B base model to output valid Triton code from descriptions
Execution Validator: Runs kernels on test inputs and compares outputs to ground truth
Triton Syntax Validator: Checks code against Triton language constraints (block dims, memory patterns)
RL Training Loop (GRPO): Combines execution + syntax rewards to guide policy optimization
Test Case Generator: Creates diverse inputs and expected outputs for kernel validation

Implementation

This example demonstrates the supervised fine-tuning stage on curated instruction-code pairs. SFT establishes baseline coding competency.

# Supervised fine-tuning on Triton code pairs
import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer

class TritonSFTTrainer:
    def __init__(self, base_model_name="meta-llama/Llama-2-8b"):
        self.model = AutoModelForCausalLM.from_pretrained(base_model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(base_model_name)
        self.optimizer = torch.optim.AdamW(self.model.parameters(), lr=1e-5)

    def prepare_triton_dataset(self, github_pytorch_kernels):
        """Harvest Triton implementations from PyTorch kernels.
        Distill PyTorch kernel logic into Triton code."""

        instruction_code_pairs = []

        for kernel in github_pytorch_kernels:
            # Parse PyTorch kernel semantics
            operation = extract_operation(kernel)  # e.g., "element-wise multiply"
            io_spec = extract_io_spec(kernel)  # input/output shapes and types

            # Generate instruction describing the operation
            instruction = f"Implement a Triton kernel for {operation}. Input shapes: {io_spec}"

            # Generate Triton implementation (via torch.compile or manual translation)
            triton_code = translate_pytorch_to_triton(kernel)

            # Add chain-of-thought explanation
            reasoning = generate_reasoning(kernel, triton_code)

            instruction_code_pairs.append({
                'instruction': instruction,
                'code': triton_code,
                'reasoning': reasoning,
                'operation': operation,
                'io_spec': io_spec
            })

        return instruction_code_pairs

    def format_training_example(self, instruction, reasoning, code):
        """Format instruction-reasoning-code for causal language modeling."""

        text = f"""Instruction: {instruction}

Reasoning:
{reasoning}

Triton Code:
{code}
<|end_of_code|>"""

        return text

    def training_step(self, instruction, reasoning, code):
        """SFT on instruction -> reasoning -> code format."""

        # Format example
        text = self.format_training_example(instruction, reasoning, code)

        # Tokenize
        tokens = self.tokenizer(text, return_tensors='pt', truncation=True, max_length=2048)
        input_ids = tokens['input_ids']

        # Forward pass
        outputs = self.model(input_ids, labels=input_ids)
        loss = outputs.loss

        # Backward
        self.optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)
        self.optimizer.step()

        return loss.item()

This example shows the kernel validation pipeline with execution and syntax rewards used in RL training.

class TritonKernelValidator:
    def __init__(self):
        self.syntax_errors = []
        self.execution_errors = []

    def validate_execution(self, triton_code, test_inputs, ground_truth_outputs):
        """Execute kernel and compare against ground truth.
        Returns reward: 1.0 if correct, 0.0 if incorrect."""

        try:
            # Compile and run Triton kernel
            compiled_kernel = compile_triton_kernel(triton_code)
            actual_outputs = run_kernel(compiled_kernel, test_inputs)

            # Compare against ground truth
            correct = torch.allclose(actual_outputs, ground_truth_outputs, rtol=1e-5)
            execution_reward = 1.0 if correct else 0.0

            return execution_reward, None

        except RuntimeError as e:
            # Kernel crashed or hung
            return 0.0, str(e)

    def validate_syntax(self, triton_code):
        """Check code against Triton language constraints.
        Returns reward: 1.0 if valid, 0.0 if violated constraints."""

        constraints_violated = []

        # Check 1: Block size must be power of 2 and <= 2048
        block_dims = extract_block_dims(triton_code)
        for dim in block_dims:
            if not is_power_of_2(dim) or dim > 2048:
                constraints_violated.append(f"Invalid block dim: {dim}")

        # Check 2: Total threads per block <= 1024
        total_threads = 1
        for dim in block_dims:
            total_threads *= dim
        if total_threads > 1024:
            constraints_violated.append(f"Total threads {total_threads} exceeds 1024")

        # Check 3: Memory access patterns must be coalesced-friendly
        if not check_memory_coalescing(triton_code):
            constraints_violated.append("Memory access pattern not coalesced")

        # Check 4: Instruction count heuristic (rough estimate)
        if estimate_instruction_count(triton_code) > 2000:
            constraints_violated.append("Estimated instruction count too high")

        syntax_reward = 0.0 if constraints_violated else 1.0
        return syntax_reward, constraints_violated

    def combined_reward(self, triton_code, test_inputs, ground_truth_outputs):
        """Compute dual reward: execution + syntax."""

        execution_reward, exec_error = self.validate_execution(
            triton_code, test_inputs, ground_truth_outputs
        )

        syntax_reward, syntax_errors = self.validate_syntax(triton_code)

        # Prevent reward hacking: syntax failure = zero credit even if tests pass
        if syntax_errors:
            execution_reward = 0.0

        # Combined reward emphasizes both equally
        total_reward = 0.5 * execution_reward + 0.5 * syntax_reward

        return total_reward, {
            'execution_reward': execution_reward,
            'syntax_reward': syntax_reward,
            'exec_error': exec_error,
            'syntax_errors': syntax_errors
        }

This example demonstrates RL fine-tuning using GRPO with dual rewards to push performance beyond supervised learning.

class TritonRLTrainer:
    def __init__(self, model, tokenizer, validator, learning_rate=5e-6):
        self.model = model
        self.tokenizer = tokenizer
        self.validator = validator
        self.optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

    def rl_training_step(self, instruction, reference_code, test_batches):
        """GRPO fine-tuning: sample diverse generations and optimize with rewards."""

        # Sample multiple code completions (group diversity)
        num_samples = 8
        generated_codes = []
        rewards = []
        reward_details = []

        for _ in range(num_samples):
            # Generate code from instruction
            generated_code = self.generate_code(instruction, temperature=0.9)
            generated_codes.append(generated_code)

            # Validate against all test cases
            batch_rewards = []
            for test_inputs, ground_truth in test_batches:
                total_reward, details = self.validator.combined_reward(
                    generated_code, test_inputs, ground_truth
                )
                batch_rewards.append(total_reward)

            avg_reward = sum(batch_rewards) / len(batch_rewards)
            rewards.append(avg_reward)
            reward_details.append(batch_rewards)

        # Group relative reward (GRPO): normalize within group
        rewards = torch.tensor(rewards)
        normalized_rewards = (rewards - rewards.mean()) / (rewards.std() + 1e-10)

        # Compute policy loss for all generations
        total_loss = 0.0
        for code, norm_reward in zip(generated_codes, normalized_rewards):
            # Tokenize generated code
            tokens = self.tokenizer(code, return_tensors='pt', truncation=True)
            input_ids = tokens['input_ids']

            # Forward pass
            outputs = self.model(input_ids)
            logits = outputs.logits

            # Policy gradient: maximize log-prob of generated code weighted by reward
            log_probs = F.log_softmax(logits, dim=-1)
            policy_loss = -(log_probs.mean() * norm_reward)

            # Add entropy bonus to prevent collapse
            entropy = -torch.sum(F.softmax(logits, dim=-1) * log_probs, dim=-1).mean()
            policy_loss -= 0.01 * entropy

            total_loss += policy_loss

        # Average over samples
        total_loss = total_loss / num_samples

        # Backward
        self.optimizer.zero_grad()
        total_loss.backward()
        torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)
        self.optimizer.step()

        return {
            'avg_reward': rewards.mean().item(),
            'max_reward': rewards.max().item(),
            'min_reward': rewards.min().item(),
            'policy_loss': total_loss.item()
        }

    def generate_code(self, instruction, temperature=0.9, max_length=512):
        """Generate Triton code from instruction."""

        prompt = f"Instruction: {instruction}\n\nTriton Code:"
        input_ids = self.tokenizer(prompt, return_tensors='pt')['input_ids']

        with torch.no_grad():
            output = self.model.generate(
                input_ids,
                max_length=max_length,
                temperature=temperature,
                top_p=0.95,
                do_sample=True
            )

        code = self.tokenizer.decode(output[0], skip_special_tokens=True)
        return code.split("Triton Code:")[1].strip() if "Triton Code:" in code else code

Practical Guidance

| Hyperparameter | Recommended Value | Purpose | |---|---|---| | SFT learning rate | 1e-5 | Stable supervised learning on curated data | | RL learning rate | 5e-6 | Conservative RL to prevent divergence | | GRPO sample size (group) | 8 | Balance diversity vs. training cost | | Execution reward weight | 0.5 | Primary optimization signal | | Syntax reward weight | 0.5 | Prevent rule violations | | Entropy coefficient | 0.01 | Prevent policy collapse | | Max code length | 512 tokens | Typical kernel size | | Test cases per kernel | 3-5 | Coverage of edge cases |

When to use: Apply AutoTriton when generating Triton kernels for custom deep learning operations, datatype-specific implementations, or specialized accelerators. Use when manual kernel writing is a bottleneck but you have reference PyTorch implementations to learn from. Ideal for research and optimization when developer time is expensive.

When NOT to use: Skip if you have a small number of fixed kernels—manual optimization is simpler. Avoid for real-time code generation requiring latency < 500ms. Don't use if your target kernels have highly novel semantics not seen in training data; the model may fail to generalize. Skip if you lack test cases to validate generated code—dual rewards require ground truth comparisons.

Common pitfalls: Using only execution rewards without syntax validation allows models to game metrics with invalid code. Overfitting during SFT on small datasets (< 10k examples) causes poor RL exploration. Setting syntax constraints too loose defeats rule-based rewards. Not including diverse test inputs means generated code passes simple cases but fails edge cases. Skipping the chain-of-thought reasoning stage during SFT hurts interpretability and RL performance. Using uniform sampling for GRPO instead of diversity-aware sampling reduces exploration.

Reference

AutoTriton Team. (2025). AutoTriton: Automatic Triton Programming with Reinforcement Learning in LLMs. arXiv preprint arXiv:2507.05687. https://arxiv.org/abs/2507.05687

Automatic Triton Programming: Generating Optimized GPU Kernels via RL-Guided Code Generation

Core Concept

Architecture Overview

Code Instruction Pairs: 14,102 PyTorch-to-Triton mappings with chain-of-thought explanations
Supervised Fine-Tuning Stage: Trains 8B base model to output valid Triton code from descriptions
Execution Validator: Runs kernels on test inputs and compares outputs to ground truth
Triton Syntax Validator: Checks code against Triton language constraints (block dims, memory patterns)
RL Training Loop (GRPO): Combines execution + syntax rewards to guide policy optimization
Test Case Generator: Creates diverse inputs and expected outputs for kernel validation

Implementation

This example demonstrates the supervised fine-tuning stage on curated instruction-code pairs. SFT establishes baseline coding competency.

# Supervised fine-tuning on Triton code pairs
import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer

class TritonSFTTrainer:
    def __init__(self, base_model_name="meta-llama/Llama-2-8b"):
        self.model = AutoModelForCausalLM.from_pretrained(base_model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(base_model_name)
        self.optimizer = torch.optim.AdamW(self.model.parameters(), lr=1e-5)

    def prepare_triton_dataset(self, github_pytorch_kernels):
        """Harvest Triton implementations from PyTorch kernels.
        Distill PyTorch kernel logic into Triton code."""

        instruction_code_pairs = []

        for kernel in github_pytorch_kernels:
            # Parse PyTorch kernel semantics
            operation = extract_operation(kernel)  # e.g., "element-wise multiply"
            io_spec = extract_io_spec(kernel)  # input/output shapes and types

            # Generate instruction describing the operation
            instruction = f"Implement a Triton kernel for {operation}. Input shapes: {io_spec}"

            # Generate Triton implementation (via torch.compile or manual translation)
            triton_code = translate_pytorch_to_triton(kernel)

            # Add chain-of-thought explanation
            reasoning = generate_reasoning(kernel, triton_code)

            instruction_code_pairs.append({
                'instruction': instruction,
                'code': triton_code,
                'reasoning': reasoning,
                'operation': operation,
                'io_spec': io_spec
            })

        return instruction_code_pairs

    def format_training_example(self, instruction, reasoning, code):
        """Format instruction-reasoning-code for causal language modeling."""

        text = f"""Instruction: {instruction}

Reasoning:
{reasoning}

Triton Code:
{code}
<|end_of_code|>"""

        return text

    def training_step(self, instruction, reasoning, code):
        """SFT on instruction -> reasoning -> code format."""

        # Format example
        text = self.format_training_example(instruction, reasoning, code)

        # Tokenize
        tokens = self.tokenizer(text, return_tensors='pt', truncation=True, max_length=2048)
        input_ids = tokens['input_ids']

        # Forward pass
        outputs = self.model(input_ids, labels=input_ids)
        loss = outputs.loss

        # Backward
        self.optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)
        self.optimizer.step()

        return loss.item()

This example shows the kernel validation pipeline with execution and syntax rewards used in RL training.

class TritonKernelValidator:
    def __init__(self):
        self.syntax_errors = []
        self.execution_errors = []

    def validate_execution(self, triton_code, test_inputs, ground_truth_outputs):
        """Execute kernel and compare against ground truth.
        Returns reward: 1.0 if correct, 0.0 if incorrect."""

        try:
            # Compile and run Triton kernel
            compiled_kernel = compile_triton_kernel(triton_code)
            actual_outputs = run_kernel(compiled_kernel, test_inputs)

            # Compare against ground truth
            correct = torch.allclose(actual_outputs, ground_truth_outputs, rtol=1e-5)
            execution_reward = 1.0 if correct else 0.0

            return execution_reward, None

        except RuntimeError as e:
            # Kernel crashed or hung
            return 0.0, str(e)

    def validate_syntax(self, triton_code):
        """Check code against Triton language constraints.
        Returns reward: 1.0 if valid, 0.0 if violated constraints."""

        constraints_violated = []

        # Check 1: Block size must be power of 2 and <= 2048
        block_dims = extract_block_dims(triton_code)
        for dim in block_dims:
            if not is_power_of_2(dim) or dim > 2048:
                constraints_violated.append(f"Invalid block dim: {dim}")

        # Check 2: Total threads per block <= 1024
        total_threads = 1
        for dim in block_dims:
            total_threads *= dim
        if total_threads > 1024:
            constraints_violated.append(f"Total threads {total_threads} exceeds 1024")

        # Check 3: Memory access patterns must be coalesced-friendly
        if not check_memory_coalescing(triton_code):
            constraints_violated.append("Memory access pattern not coalesced")

        # Check 4: Instruction count heuristic (rough estimate)
        if estimate_instruction_count(triton_code) > 2000:
            constraints_violated.append("Estimated instruction count too high")

        syntax_reward = 0.0 if constraints_violated else 1.0
        return syntax_reward, constraints_violated

    def combined_reward(self, triton_code, test_inputs, ground_truth_outputs):
        """Compute dual reward: execution + syntax."""

        execution_reward, exec_error = self.validate_execution(
            triton_code, test_inputs, ground_truth_outputs
        )

        syntax_reward, syntax_errors = self.validate_syntax(triton_code)

        # Prevent reward hacking: syntax failure = zero credit even if tests pass
        if syntax_errors:
            execution_reward = 0.0

        # Combined reward emphasizes both equally
        total_reward = 0.5 * execution_reward + 0.5 * syntax_reward

        return total_reward, {
            'execution_reward': execution_reward,
            'syntax_reward': syntax_reward,
            'exec_error': exec_error,
            'syntax_errors': syntax_errors
        }

This example demonstrates RL fine-tuning using GRPO with dual rewards to push performance beyond supervised learning.

class TritonRLTrainer:
    def __init__(self, model, tokenizer, validator, learning_rate=5e-6):
        self.model = model
        self.tokenizer = tokenizer
        self.validator = validator
        self.optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

    def rl_training_step(self, instruction, reference_code, test_batches):
        """GRPO fine-tuning: sample diverse generations and optimize with rewards."""

        # Sample multiple code completions (group diversity)
        num_samples = 8
        generated_codes = []
        rewards = []
        reward_details = []

        for _ in range(num_samples):
            # Generate code from instruction
            generated_code = self.generate_code(instruction, temperature=0.9)
            generated_codes.append(generated_code)

            # Validate against all test cases
            batch_rewards = []
            for test_inputs, ground_truth in test_batches:
                total_reward, details = self.validator.combined_reward(
                    generated_code, test_inputs, ground_truth
                )
                batch_rewards.append(total_reward)

            avg_reward = sum(batch_rewards) / len(batch_rewards)
            rewards.append(avg_reward)
            reward_details.append(batch_rewards)

        # Group relative reward (GRPO): normalize within group
        rewards = torch.tensor(rewards)
        normalized_rewards = (rewards - rewards.mean()) / (rewards.std() + 1e-10)

        # Compute policy loss for all generations
        total_loss = 0.0
        for code, norm_reward in zip(generated_codes, normalized_rewards):
            # Tokenize generated code
            tokens = self.tokenizer(code, return_tensors='pt', truncation=True)
            input_ids = tokens['input_ids']

            # Forward pass
            outputs = self.model(input_ids)
            logits = outputs.logits

            # Policy gradient: maximize log-prob of generated code weighted by reward
            log_probs = F.log_softmax(logits, dim=-1)
            policy_loss = -(log_probs.mean() * norm_reward)

            # Add entropy bonus to prevent collapse
            entropy = -torch.sum(F.softmax(logits, dim=-1) * log_probs, dim=-1).mean()
            policy_loss -= 0.01 * entropy

            total_loss += policy_loss

        # Average over samples
        total_loss = total_loss / num_samples

        # Backward
        self.optimizer.zero_grad()
        total_loss.backward()
        torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)
        self.optimizer.step()

        return {
            'avg_reward': rewards.mean().item(),
            'max_reward': rewards.max().item(),
            'min_reward': rewards.min().item(),
            'policy_loss': total_loss.item()
        }

    def generate_code(self, instruction, temperature=0.9, max_length=512):
        """Generate Triton code from instruction."""

        prompt = f"Instruction: {instruction}\n\nTriton Code:"
        input_ids = self.tokenizer(prompt, return_tensors='pt')['input_ids']

        with torch.no_grad():
            output = self.model.generate(
                input_ids,
                max_length=max_length,
                temperature=temperature,
                top_p=0.95,
                do_sample=True
            )

        code = self.tokenizer.decode(output[0], skip_special_tokens=True)
        return code.split("Triton Code:")[1].strip() if "Triton Code:" in code else code

Practical Guidance

Reference

AutoTriton Team. (2025). AutoTriton: Automatic Triton Programming with Reinforcement Learning in LLMs. arXiv preprint arXiv:2507.05687. https://arxiv.org/abs/2507.05687

Adoption

ADu2021/automatic-triton-programming

$ install --global

Security Scan Results

SKILL.md

Automatic Triton Programming: Generating Optimized GPU Kernels via RL-Guided Code Generation

Core Concept

Architecture Overview

Implementation

Practical Guidance

Reference

Related Skills

ADu2021/flow-map-trajectory-tilting

ADu2021/flexible-data-mixture-of-experts

ADu2021/flexibility-trap-diffusion-reasoning

ADu2021/flex-continuous-agent-evolution

ADu2021/automatic-triton-programming

$ install --global

Security Scan Results

SKILL.md

Automatic Triton Programming: Generating Optimized GPU Kernels via RL-Guided Code Generation

Core Concept

Architecture Overview

Implementation

Practical Guidance

Reference

Related Skills

ADu2021/flow-map-trajectory-tilting

ADu2021/flexible-data-mixture-of-experts

ADu2021/flexibility-trap-diffusion-reasoning

ADu2021/flex-continuous-agent-evolution