Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

a5c-ai/hip-rocm

Name: hip-rocm
Author: a5c-ai

library/specializations/gpu-programming/skills/hip-rocm/SKILL.md

npx skillsauth add a5c-ai/babysitter hip-rocm

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

hip-rocm

You are hip-rocm - a specialized skill for AMD HIP and ROCm ecosystem development. This skill provides expert capabilities for cross-platform GPU programming targeting AMD GPUs.

Overview

This skill enables AI-powered AMD GPU development including:

Execute hipify conversion tools (hipify-perl, hipify-clang)
Generate HIP-compatible kernel code
Handle CUDA/HIP API differences
Configure ROCm toolchain compilation
Profile with rocprof and omniperf
Support MI100/MI200/MI300 architectures
Maintain single-source NVIDIA/AMD code
Benchmark cross-platform performance

Prerequisites

ROCm 5.0+
HIP runtime
hipify tools
AMD GPU (or NVIDIA GPU with HIP)

Capabilities

1. CUDA to HIP Conversion

Convert CUDA code to HIP:

# Using hipify-perl (quick conversion)
hipify-perl cuda_file.cu > hip_file.cpp

# Using hipify-clang (more accurate)
hipify-clang cuda_file.cu -o hip_file.cpp

# Batch conversion
hipify-perl -inplace *.cu
hipconvertinplace.sh .

# Generate conversion statistics
hipify-perl --print-stats cuda_file.cu

# Exclude certain patterns
hipify-perl --skip-includes cuda_file.cu > hip_file.cpp

2. HIP Kernel Development

Write HIP-compatible kernels:

#include <hip/hip_runtime.h>

// HIP kernel (portable to CUDA and AMD)
__global__ void vectorAdd(const float* a, const float* b, float* c, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

// Launch syntax (same as CUDA)
int main() {
    // Allocate memory
    float *d_a, *d_b, *d_c;
    hipMalloc(&d_a, size);
    hipMalloc(&d_b, size);
    hipMalloc(&d_c, size);

    // Copy to device
    hipMemcpy(d_a, h_a, size, hipMemcpyHostToDevice);
    hipMemcpy(d_b, h_b, size, hipMemcpyHostToDevice);

    // Launch kernel
    int blockSize = 256;
    int numBlocks = (n + blockSize - 1) / blockSize;
    hipLaunchKernelGGL(vectorAdd, dim3(numBlocks), dim3(blockSize),
        0, 0, d_a, d_b, d_c, n);

    // Alternative launch syntax
    vectorAdd<<<numBlocks, blockSize>>>(d_a, d_b, d_c, n);

    // Synchronize and copy back
    hipDeviceSynchronize();
    hipMemcpy(h_c, d_c, size, hipMemcpyDeviceToHost);

    // Cleanup
    hipFree(d_a);
    hipFree(d_b);
    hipFree(d_c);
}

3. API Compatibility Macros

Handle CUDA/HIP differences:

// Platform detection
#ifdef __HIP_PLATFORM_AMD__
    // AMD-specific code
#elif defined(__HIP_PLATFORM_NVIDIA__)
    // NVIDIA HIP code
#elif defined(__CUDA_ARCH__)
    // CUDA-specific code
#endif

// Common compatibility header
#if defined(__HIPCC__) || defined(__HIP__)
    #include <hip/hip_runtime.h>
    #define DEVICE_SYNC hipDeviceSynchronize
    #define MALLOC hipMalloc
    #define FREE hipFree
    #define MEMCPY hipMemcpy
#else
    #include <cuda_runtime.h>
    #define DEVICE_SYNC cudaDeviceSynchronize
    #define MALLOC cudaMalloc
    #define FREE cudaFree
    #define MEMCPY cudaMemcpy
#endif

// Warp size handling
#ifdef __HIP_PLATFORM_AMD__
    #define WARP_SIZE 64  // AMD wavefront
#else
    #define WARP_SIZE 32  // NVIDIA warp
#endif

4. ROCm Compilation

Compile HIP code:

# Compile for AMD GPU
hipcc -o program program.cpp

# Specify target architecture
hipcc --offload-arch=gfx90a -o program program.cpp  # MI200
hipcc --offload-arch=gfx942 -o program program.cpp  # MI300

# Multiple targets
hipcc --offload-arch=gfx908 --offload-arch=gfx90a -o program program.cpp

# With optimization
hipcc -O3 -o program program.cpp

# Generate assembly
hipcc -S --offload-arch=gfx90a program.cpp

# Verbose compilation
hipcc -v -o program program.cpp

# CMake configuration
set(CMAKE_CXX_COMPILER hipcc)
set(GPU_TARGETS "gfx90a" CACHE STRING "GPU architectures")

5. Profiling with rocprof

Profile AMD GPU applications:

# Basic profiling
rocprof ./program

# Collect specific metrics
rocprof -i metrics.txt ./program

# Generate trace
rocprof --hip-trace ./program
rocprof --hsa-trace ./program

# System trace
rocprof --sys-trace ./program

# Export to JSON
rocprof --stats --json ./program

# Metrics file example (metrics.txt)
# pmc: SQ_WAVES, SQ_INSTS_VALU, SQ_INSTS_SMEM
# pmc: TCC_HIT_sum, TCC_MISS_sum

6. Omniperf Analysis

Deep performance analysis:

# Profile application
omniperf profile -n workload_name ./program

# Analyze profile
omniperf analyze -p workload_name

# Web-based GUI
omniperf analyze -p workload_name --gui

# Compare profiles
omniperf analyze -p baseline -p optimized --compare

# Specific analysis sections
omniperf analyze -p workload_name --metric-set memory
omniperf analyze -p workload_name --metric-set compute

7. Architecture-Specific Optimization

Optimize for AMD architectures:

// Wave-aware programming (64-thread wavefront)
__device__ int waveReduceSum(int val) {
    #pragma unroll
    for (int offset = 32; offset > 0; offset >>= 1) {
        val += __shfl_down(val, offset);
    }
    return val;
}

// Use LDS (Local Data Share) efficiently
__shared__ __align__(16) float lds[256];

// Memory coalescing for AMD (256-byte granularity)
__global__ void coalescedKernel(float4* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        float4 val = data[idx];  // 16-byte aligned load
        // Process...
        data[idx] = val;
    }
}

// Architecture-specific kernels
#if __gfx90a__ || __gfx942__
    // MI200/MI300 optimizations
    // Use matrix cores (MFMA instructions)
#elif __gfx908__
    // MI100 optimizations
#endif

8. hipBLAS and rocBLAS

GPU math libraries:

#include <hipblas/hipblas.h>
// Or for ROCm-native
#include <rocblas/rocblas.h>

hipblasHandle_t handle;
hipblasCreate(&handle);

// GEMM operation
float alpha = 1.0f, beta = 0.0f;
hipblasSgemm(handle,
    HIPBLAS_OP_N, HIPBLAS_OP_N,
    M, N, K,
    &alpha,
    d_A, M,
    d_B, K,
    &beta,
    d_C, M);

// rocBLAS with explicit stream
rocblas_handle roc_handle;
rocblas_create_handle(&roc_handle);
rocblas_set_stream(roc_handle, stream);

rocblas_sgemm(roc_handle,
    rocblas_operation_none, rocblas_operation_none,
    M, N, K,
    &alpha, d_A, M, d_B, K, &beta, d_C, M);

9. RCCL Collective Operations

AMD's NCCL equivalent:

#include <rccl/rccl.h>

// Initialize RCCL (same API as NCCL)
rcclComm_t comm;
rcclUniqueId id;
rcclGetUniqueId(&id);
rcclCommInitRank(&comm, worldSize, id, rank);

// All-reduce
rcclAllReduce(sendbuff, recvbuff, count, rcclFloat, rcclSum, comm, stream);

// Cleanup
rcclCommDestroy(comm);

Process Integration

This skill integrates with the following processes:

hip-porting-cross-platform.js - Cross-platform porting
multi-gpu-programming.js - Multi-GPU development

Output Format

{
  "operation": "hipify",
  "status": "success",
  "input_files": ["kernel.cu", "main.cu"],
  "output_files": ["kernel.cpp", "main.cpp"],
  "conversion_stats": {
    "cuda_calls_converted": 45,
    "manual_review_needed": 3,
    "warnings": ["__shfl_sync not directly portable to HIP"]
  },
  "target_architectures": ["gfx90a", "gfx942"],
  "recommendations": [
    "Review wavefront size (64 vs 32) in reduction kernels",
    "Consider using rocBLAS for BLAS operations"
  ]
}

Dependencies

ROCm 5.0+
HIP runtime
hipify-perl or hipify-clang
rocprof/omniperf (for profiling)

Constraints

Warp/wavefront size differs (32 vs 64)
Some CUDA intrinsics need manual porting
Texture memory API differs
CUDA-specific features may not port

a5c-ai/hip-rocm

library/specializations/gpu-programming/skills/hip-rocm/SKILL.md

AMD HIP and ROCm ecosystem for cross-platform GPU development. Execute hipify conversion tools, generate HIP-compatible kernel code, handle CUDA/HIP API differences, configure ROCm toolchain, and profile with rocprof.

514 stars

tools

Updated Apr 2, 2026

$ install --global

skillsauth

npx skillsauth add a5c-ai/babysitter hip-rocm

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 2, 2026, 12:29 PM64.7s2 files scanned

SKILL.md

name:: hip-rocm
description:: AMD HIP and ROCm ecosystem for cross-platform GPU development. Execute hipify conversion tools, generate HIP-compatible kernel code, handle CUDA/HIP API differences, configure ROCm toolchain, and profile with rocprof.
allowed-tools:: Bash(*) Read Write Edit Glob Grep WebFetch
author:: babysitter-sdk
version:: 1.0.0
category:: cross-platform
backlog-id:: SK-009

hip-rocm

You are hip-rocm - a specialized skill for AMD HIP and ROCm ecosystem development. This skill provides expert capabilities for cross-platform GPU programming targeting AMD GPUs.

Overview

This skill enables AI-powered AMD GPU development including:

Execute hipify conversion tools (hipify-perl, hipify-clang)
Generate HIP-compatible kernel code
Handle CUDA/HIP API differences
Configure ROCm toolchain compilation
Profile with rocprof and omniperf
Support MI100/MI200/MI300 architectures
Maintain single-source NVIDIA/AMD code
Benchmark cross-platform performance

Prerequisites

ROCm 5.0+
HIP runtime
hipify tools
AMD GPU (or NVIDIA GPU with HIP)

Capabilities

1. CUDA to HIP Conversion

Convert CUDA code to HIP:

# Using hipify-perl (quick conversion)
hipify-perl cuda_file.cu > hip_file.cpp

# Using hipify-clang (more accurate)
hipify-clang cuda_file.cu -o hip_file.cpp

# Batch conversion
hipify-perl -inplace *.cu
hipconvertinplace.sh .

# Generate conversion statistics
hipify-perl --print-stats cuda_file.cu

# Exclude certain patterns
hipify-perl --skip-includes cuda_file.cu > hip_file.cpp

2. HIP Kernel Development

Write HIP-compatible kernels:

#include <hip/hip_runtime.h>

// HIP kernel (portable to CUDA and AMD)
__global__ void vectorAdd(const float* a, const float* b, float* c, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

// Launch syntax (same as CUDA)
int main() {
    // Allocate memory
    float *d_a, *d_b, *d_c;
    hipMalloc(&d_a, size);
    hipMalloc(&d_b, size);
    hipMalloc(&d_c, size);

    // Copy to device
    hipMemcpy(d_a, h_a, size, hipMemcpyHostToDevice);
    hipMemcpy(d_b, h_b, size, hipMemcpyHostToDevice);

    // Launch kernel
    int blockSize = 256;
    int numBlocks = (n + blockSize - 1) / blockSize;
    hipLaunchKernelGGL(vectorAdd, dim3(numBlocks), dim3(blockSize),
        0, 0, d_a, d_b, d_c, n);

    // Alternative launch syntax
    vectorAdd<<<numBlocks, blockSize>>>(d_a, d_b, d_c, n);

    // Synchronize and copy back
    hipDeviceSynchronize();
    hipMemcpy(h_c, d_c, size, hipMemcpyDeviceToHost);

    // Cleanup
    hipFree(d_a);
    hipFree(d_b);
    hipFree(d_c);
}

3. API Compatibility Macros

Handle CUDA/HIP differences:

// Platform detection
#ifdef __HIP_PLATFORM_AMD__
    // AMD-specific code
#elif defined(__HIP_PLATFORM_NVIDIA__)
    // NVIDIA HIP code
#elif defined(__CUDA_ARCH__)
    // CUDA-specific code
#endif

// Common compatibility header
#if defined(__HIPCC__) || defined(__HIP__)
    #include <hip/hip_runtime.h>
    #define DEVICE_SYNC hipDeviceSynchronize
    #define MALLOC hipMalloc
    #define FREE hipFree
    #define MEMCPY hipMemcpy
#else
    #include <cuda_runtime.h>
    #define DEVICE_SYNC cudaDeviceSynchronize
    #define MALLOC cudaMalloc
    #define FREE cudaFree
    #define MEMCPY cudaMemcpy
#endif

// Warp size handling
#ifdef __HIP_PLATFORM_AMD__
    #define WARP_SIZE 64  // AMD wavefront
#else
    #define WARP_SIZE 32  // NVIDIA warp
#endif

4. ROCm Compilation

Compile HIP code:

# Compile for AMD GPU
hipcc -o program program.cpp

# Specify target architecture
hipcc --offload-arch=gfx90a -o program program.cpp  # MI200
hipcc --offload-arch=gfx942 -o program program.cpp  # MI300

# Multiple targets
hipcc --offload-arch=gfx908 --offload-arch=gfx90a -o program program.cpp

# With optimization
hipcc -O3 -o program program.cpp

# Generate assembly
hipcc -S --offload-arch=gfx90a program.cpp

# Verbose compilation
hipcc -v -o program program.cpp

# CMake configuration
set(CMAKE_CXX_COMPILER hipcc)
set(GPU_TARGETS "gfx90a" CACHE STRING "GPU architectures")

5. Profiling with rocprof

Profile AMD GPU applications:

# Basic profiling
rocprof ./program

# Collect specific metrics
rocprof -i metrics.txt ./program

# Generate trace
rocprof --hip-trace ./program
rocprof --hsa-trace ./program

# System trace
rocprof --sys-trace ./program

# Export to JSON
rocprof --stats --json ./program

# Metrics file example (metrics.txt)
# pmc: SQ_WAVES, SQ_INSTS_VALU, SQ_INSTS_SMEM
# pmc: TCC_HIT_sum, TCC_MISS_sum

6. Omniperf Analysis

Deep performance analysis:

# Profile application
omniperf profile -n workload_name ./program

# Analyze profile
omniperf analyze -p workload_name

# Web-based GUI
omniperf analyze -p workload_name --gui

# Compare profiles
omniperf analyze -p baseline -p optimized --compare

# Specific analysis sections
omniperf analyze -p workload_name --metric-set memory
omniperf analyze -p workload_name --metric-set compute

7. Architecture-Specific Optimization

Optimize for AMD architectures:

// Wave-aware programming (64-thread wavefront)
__device__ int waveReduceSum(int val) {
    #pragma unroll
    for (int offset = 32; offset > 0; offset >>= 1) {
        val += __shfl_down(val, offset);
    }
    return val;
}

// Use LDS (Local Data Share) efficiently
__shared__ __align__(16) float lds[256];

// Memory coalescing for AMD (256-byte granularity)
__global__ void coalescedKernel(float4* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        float4 val = data[idx];  // 16-byte aligned load
        // Process...
        data[idx] = val;
    }
}

// Architecture-specific kernels
#if __gfx90a__ || __gfx942__
    // MI200/MI300 optimizations
    // Use matrix cores (MFMA instructions)
#elif __gfx908__
    // MI100 optimizations
#endif

8. hipBLAS and rocBLAS

GPU math libraries:

#include <hipblas/hipblas.h>
// Or for ROCm-native
#include <rocblas/rocblas.h>

hipblasHandle_t handle;
hipblasCreate(&handle);

// GEMM operation
float alpha = 1.0f, beta = 0.0f;
hipblasSgemm(handle,
    HIPBLAS_OP_N, HIPBLAS_OP_N,
    M, N, K,
    &alpha,
    d_A, M,
    d_B, K,
    &beta,
    d_C, M);

// rocBLAS with explicit stream
rocblas_handle roc_handle;
rocblas_create_handle(&roc_handle);
rocblas_set_stream(roc_handle, stream);

rocblas_sgemm(roc_handle,
    rocblas_operation_none, rocblas_operation_none,
    M, N, K,
    &alpha, d_A, M, d_B, K, &beta, d_C, M);

9. RCCL Collective Operations

AMD's NCCL equivalent:

#include <rccl/rccl.h>

// Initialize RCCL (same API as NCCL)
rcclComm_t comm;
rcclUniqueId id;
rcclGetUniqueId(&id);
rcclCommInitRank(&comm, worldSize, id, rank);

// All-reduce
rcclAllReduce(sendbuff, recvbuff, count, rcclFloat, rcclSum, comm, stream);

// Cleanup
rcclCommDestroy(comm);

Process Integration

This skill integrates with the following processes:

hip-porting-cross-platform.js - Cross-platform porting
multi-gpu-programming.js - Multi-GPU development

Output Format

{
  "operation": "hipify",
  "status": "success",
  "input_files": ["kernel.cu", "main.cu"],
  "output_files": ["kernel.cpp", "main.cpp"],
  "conversion_stats": {
    "cuda_calls_converted": 45,
    "manual_review_needed": 3,
    "warnings": ["__shfl_sync not directly portable to HIP"]
  },
  "target_architectures": ["gfx90a", "gfx942"],
  "recommendations": [
    "Review wavefront size (64 vs 32) in reduction kernels",
    "Consider using rocBLAS for BLAS operations"
  ]
}

Dependencies

ROCm 5.0+
HIP runtime
hipify-perl or hipify-clang
rocprof/omniperf (for profiling)

Constraints

Warp/wavefront size differs (32 vs 64)
Some CUDA intrinsics need manual porting
Texture memory API differs
CUDA-specific features may not port

Related Skills

a5c-ai/model-card-generator

development

VerifiedTrustedCommunity

Model documentation skill for generating model cards following Google's model card framework.

680SKILL.mdUpdated Apr 28, 2026

a5c-ai/model-card-generator

a5c-ai/mlflow-experiment-tracker

development

VerifiedTrustedCommunity

MLflow integration skill for experiment tracking, model registry, and artifact management. Enables LLMs to log experiments, compare runs, manage model lifecycle, and retrieve artifacts through the MLflow API.

680SKILL.mdUpdated Apr 28, 2026

a5c-ai/mlflow-experiment-tracker

a5c-ai/lime-explainer

data-ai

VerifiedTrustedCommunity

LIME-based local explanation skill for individual predictions across tabular, text, and image data.

680SKILL.mdUpdated Apr 28, 2026

a5c-ai/lime-explainer

a5c-ai/kubeflow-pipeline-executor

devops

VerifiedTrustedCommunity

Kubeflow Pipelines skill for ML workflow orchestration, component management, and Kubernetes-native ML.

680SKILL.mdUpdated Apr 28, 2026

a5c-ai/kubeflow-pipeline-executor

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/a5c-ai/babysitter.git

# Copy into Claude Code skills folder (global)
cp -r babysitter/library/specializations/gpu-programming/skills/hip-rocm ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

a5c-ai/babysitter

514 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT