templates/setup/skills/load-inline-native-code/SKILL.md
Helps write CUDA and HIP kernels using torch.utils.cpp_extension.load_inline(). Use when users want to write native GPU code (CUDA/HIP) inside a Python submission file.
npx skillsauth add gpu-mode/popcorn-cli load-inline-native-codeInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use this skill when the user wants to write a custom CUDA or HIP kernel inside their Python submission file using torch.utils.cpp_extension.load_inline().
load_inline() compiles C++/CUDA/HIP source code at runtime and loads it as a Python module. This lets you write raw GPU kernels directly in your submission.py without a separate build system.
import torch
from torch.utils.cpp_extension import load_inline
from task import input_t, output_t
CUDA_SRC = """
template <typename scalar_t>
__global__ void my_kernel(const scalar_t* __restrict__ input,
scalar_t* __restrict__ output,
int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) {
output[idx] = input[idx];
}
}
torch::Tensor my_op(torch::Tensor input, torch::Tensor output) {
int N = input.numel();
const int threads = 256;
const int blocks = (N + threads - 1) / threads;
AT_DISPATCH_FLOATING_TYPES_AND_HALF(input.scalar_type(), "my_kernel", ([&] {
my_kernel<scalar_t><<<blocks, threads>>>(
input.data_ptr<scalar_t>(),
output.data_ptr<scalar_t>(),
N
);
}));
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) {
throw std::runtime_error(cudaGetErrorString(err));
}
return output;
}
"""
CPP_SRC = """
torch::Tensor my_op(torch::Tensor input, torch::Tensor output);
"""
module = load_inline(
name='my_module',
cpp_sources=[CPP_SRC],
cuda_sources=[CUDA_SRC],
functions=['my_op'],
verbose=True,
)
def custom_kernel(data: input_t) -> output_t:
input, output = data
return module.my_op(input, output)
import os
os.environ['PYTORCH_ROCM_ARCH'] = 'gfx942'
os.environ['CXX'] = 'clang++'
import torch
from torch.utils.cpp_extension import load_inline
from task import input_t, output_t
CUDA_SRC = """
#include <hip/amd_detail/amd_hip_bf16.h>
__global__ void my_kernel(const float* input, float* output, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) {
output[idx] = input[idx];
}
}
void my_op(torch::Tensor input, torch::Tensor output) {
int N = input.numel();
const int threads = 256;
const int blocks = (N + threads - 1) / threads;
my_kernel<<<blocks, threads>>>(
input.data_ptr<float>(),
output.data_ptr<float>(),
N
);
}
"""
CPP_SRC = """
void my_op(torch::Tensor input, torch::Tensor output);
"""
module = load_inline(
name='my_module',
cpp_sources=[CPP_SRC],
cuda_sources=[CUDA_SRC],
functions=['my_op'],
verbose=True,
extra_cuda_cflags=["--offload-arch=gfx942", "-std=c++20"],
)
def custom_kernel(data: input_t) -> output_t:
input, output = data
module.my_op(input, output)
return output
--offload-arch=gfx942) or C++ standard selection.AT_DISPATCH_FLOATING_TYPES_AND_HALF to handle multiple dtypes in CUDA kernels.PYTORCH_ROCM_ARCH and CXX env vars before importing torch.cudaGetLastError() after kernel launches for NVIDIA targets.load_inline call compiles on first run and caches the result. Subsequent runs reuse the cache unless the source changes.load_inline() call outside custom_kernel() so compilation happens once at import time, not on every call.custom_kernel function signature must match def custom_kernel(data: input_t) -> output_t:.load_inline() inside custom_kernel().PYTORCH_ROCM_ARCH before any torch import.torch::Tensor in C++ signatures for seamless Python-C++ tensor passing.tools
Use when work should span one or more detached tasks but still behave like one job with a single owner context. TaskFlow is the durable flow substrate under authoring layers like Lobster, ACPX, plugins, or plain code. Keep conditional logic in the caller; use TaskFlow for flow identity, child-task linkage, waiting state, revision-checked mutations, and user-facing emergence.
tools
# Lobster Lobster executes multi-step workflows with approval checkpoints. Use it when: - User wants a repeatable automation (triage, monitor, sync) - Actions need human approval before executing (send, post, delete) - Multiple tool calls should run as one deterministic operation ## When to use Lobster | User intent | Use Lobster? | | ------------------------------------------------------ | --------------------------
tools
# Lobster Lobster executes multi-step workflows with approval checkpoints. Use it when: - User wants a repeatable automation (triage, monitor, sync) - Actions need human approval before executing (send, post, delete) - Multiple tool calls should run as one deterministic operation ## When to use Lobster | User intent | Use Lobster? | | ------------------------------------------------------ | --------------------------
tools
A CLI tool for making authenticated requests to the X (Twitter) API. Use this skill when you need to post tweets, reply, quote, search, read posts, manage followers, send DMs, upload media, or interact with any X API v2 endpoint.