Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

miaodi/cuda-performance

Name: cuda-performance
Author: miaodi

skills/cuda-performance/SKILL.md

npx skillsauth add miaodi/llm_config cuda-performance

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

CUDA Performance Skill

Purpose

Optimize CUDA code for throughput, latency, memory efficiency, and predictable scaling while preserving correctness.

When To Use

Use for kernel optimization, CUDA profiler report analysis, launch configuration tuning, memory hierarchy tuning, transfer reduction, and end-to-end GPU pipeline improvement.

Priorities

Preserve correctness.
Measure before optimizing.
Optimize memory behavior before instruction-level tuning.
Prefer shared memory when it meaningfully reduces global memory traffic or enables cooperative data reuse.
Prefer 1D thread/block launch geometry and threadIdx.x indexing unless multidimensional launch geometry is required by an existing interface or clearly improves correctness/readability.
Optimize end-to-end runtime, not just isolated kernel duration.

Workflow

Identify the dominant kernels and transfer costs from profiling reports or benchmarks.
Confirm the target metric: latency, throughput, utilization, or total pipeline time.
Check global memory access patterns for coalescing and unnecessary traffic.
Check whether shared memory can be used for tiling, staging, reuse, or cooperative access within a block.
Check shared memory bank conflicts, synchronization cost, and occupancy impact.
Check register pressure, launch configuration, and whether occupancy is being limited by registers or shared memory.
For new or refactored kernels, prefer a flat global thread id from blockIdx.x * blockDim.x + threadIdx.x; convert that id to logical 2D or 3D coordinates inside the kernel only when needed.
Check divergence, atomics, reductions, and other serialization points.
Check whether host-device transfers can be reduced, fused, or overlapped with compute.
Propose the smallest high-confidence optimization first.
Re-profile under the same conditions and report both kernel-level and end-to-end impact.

CUDA Source File Policy

Prefer .cpp / .hpp for host-side orchestration, CUDA runtime or toolkit-library calls, and code that uses or launches existing kernels through host-callable project APIs.
Create or convert to .cu only when implementing or refactoring kernel/device code such as __global__ / __device__ functions, device lambdas, shared-memory tiling, warp-level device logic, or when following an existing project convention for CUDA-only implementation files.
When existing kernel launch plumbing requires CUDA compilation, prefer reusing or extending the project's existing .cu wrapper boundary instead of creating new .cu callers by default.
When adding new GPU behavior to a project that already separates kernels from callers, keep kernel implementations in .cu and put host-only integration, scheduling, validation, and higher-level API code in .cpp.

GPU Execution And Sandbox Access

When running a CUDA executable, test, benchmark, or profiler command inside the sandbox, treat "no GPU found" style failures as possibly caused by sandboxed device access rather than absent hardware.
This includes errors such as no CUDA-capable device, cudaGetDeviceCount returning zero, CUDA_ERROR_NO_DEVICE, missing /dev/nvidia*, inaccessible nvidia-smi, or profiler failures that cannot see a GPU.
If the command otherwise appears valid, ask to rerun the same executable or profiler command outside the sandbox with escalated permissions before concluding that the machine has no compatible GPU.
Keep the rerun command as close as possible to the failing command so performance, correctness, and profiler results remain comparable.

Profile Report Analysis

When given Nsight Compute (ncu), Nsight Systems (nsys), CUPTI, or benchmark profiler output:

Treat the report as the source of truth and cite the specific kernel, metric, table, or timeline observation driving each conclusion.
Separate kernel time, API time, transfer time, synchronization gaps, and total application time.
Rank findings by measured impact, not by how interesting the metric looks.
For Nsight Compute, map symptoms to likely causes: low memory throughput, poor coalescing, high replay, bank conflicts, low achieved occupancy, register pressure, divergence, instruction mix, or dependency stalls.
For Nsight Systems, look for CPU-side launch overhead, serialized streams, implicit synchronizations, transfer/compute overlap gaps, allocator overhead, and idle GPU intervals.
Distinguish measured facts from hypotheses; propose the next profiling counter or experiment when evidence is incomplete.
Avoid recommending kernel rewrites from a single red metric unless it explains a meaningful share of runtime.

Review Checklist

Are global memory loads and stores coalesced?
Is shared memory reducing expensive global traffic or enabling meaningful reuse?
Are shared memory bank conflicts present?
Is synchronization within the block justified and minimized?
Is occupancy limited by registers, shared memory, or launch configuration?
Does the kernel use a 1D launch and threadIdx.x where practical, mapping from flat ids to logical 2D/3D coordinates only when needed?
Is warp divergence hurting throughput?
Are atomics, reductions, or serialization points on the hot path?
Is register pressure preventing useful parallelism?
Are host-device transfers excessive or poorly overlapped?
Is the optimization improving total runtime rather than only one kernel metric?
Does the profile show whether the bottleneck is inside kernels, transfers, synchronization, launch overhead, or CPU-side orchestration?
If a CUDA run reports no GPU, was sandboxed device access ruled out by asking to rerun the same command outside the sandbox?

Constraints

Do not claim performance improvements without profiling or benchmark evidence.
Do not overfit to profiler percentages without checking absolute time and total pipeline impact.
Do not use shared memory by default if coalesced global access and cache behavior are already sufficient.
Do not optimize for occupancy alone; verify the actual bottleneck.
Do not introduce synchronization or shared memory complexity unless the reuse benefit is plausible.
Do not introduce threadIdx.y, threadIdx.z, blockIdx.y, or blockIdx.z by default; prefer flat 1D indexing and explain any exception.
State tradeoffs in maintainability, numerical behavior, portability, and hardware sensitivity.
Call out race-condition risks, reduction-order changes, and determinism issues when relevant.

Output

Provide:

the likely bottleneck
whether it is memory-bound, compute-bound, or transfer-bound
the profiler evidence behind that classification
the proposed shared-memory, launch, or memory-access optimization
why it is expected to help
the main risks or tradeoffs
the validation plan

miaodi/cuda-performance

skills/cuda-performance/SKILL.md

Use when optimizing CUDA kernels, analyzing CUDA profiler reports, GPU memory hierarchy, shared memory, coalescing, launch configuration, occupancy, warp divergence, host-device transfers, and end-to-end GPU pipeline.

development

Updated May 16, 2026

$ install --global

skillsauth

npx skillsauth add miaodi/llm_config cuda-performance

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 16, 2026, 4:30 AM166.8s1 file scanned

SKILL.md

name:: cuda-performance
description:: Use when optimizing CUDA kernels, analyzing CUDA profiler reports, GPU memory hierarchy, shared memory, coalescing, launch configuration, occupancy, warp divergence, host-device transfers, and end-to-end GPU pipeline.

CUDA Performance Skill

Purpose

Optimize CUDA code for throughput, latency, memory efficiency, and predictable scaling while preserving correctness.

When To Use

Use for kernel optimization, CUDA profiler report analysis, launch configuration tuning, memory hierarchy tuning, transfer reduction, and end-to-end GPU pipeline improvement.

Priorities

Preserve correctness.
Measure before optimizing.
Optimize memory behavior before instruction-level tuning.
Prefer shared memory when it meaningfully reduces global memory traffic or enables cooperative data reuse.
Prefer 1D thread/block launch geometry and threadIdx.x indexing unless multidimensional launch geometry is required by an existing interface or clearly improves correctness/readability.
Optimize end-to-end runtime, not just isolated kernel duration.

Workflow

Identify the dominant kernels and transfer costs from profiling reports or benchmarks.
Confirm the target metric: latency, throughput, utilization, or total pipeline time.
Check global memory access patterns for coalescing and unnecessary traffic.
Check whether shared memory can be used for tiling, staging, reuse, or cooperative access within a block.
Check shared memory bank conflicts, synchronization cost, and occupancy impact.
Check register pressure, launch configuration, and whether occupancy is being limited by registers or shared memory.
For new or refactored kernels, prefer a flat global thread id from blockIdx.x * blockDim.x + threadIdx.x; convert that id to logical 2D or 3D coordinates inside the kernel only when needed.
Check divergence, atomics, reductions, and other serialization points.
Check whether host-device transfers can be reduced, fused, or overlapped with compute.
Propose the smallest high-confidence optimization first.
Re-profile under the same conditions and report both kernel-level and end-to-end impact.

CUDA Source File Policy

Prefer .cpp / .hpp for host-side orchestration, CUDA runtime or toolkit-library calls, and code that uses or launches existing kernels through host-callable project APIs.
Create or convert to .cu only when implementing or refactoring kernel/device code such as __global__ / __device__ functions, device lambdas, shared-memory tiling, warp-level device logic, or when following an existing project convention for CUDA-only implementation files.
When existing kernel launch plumbing requires CUDA compilation, prefer reusing or extending the project's existing .cu wrapper boundary instead of creating new .cu callers by default.
When adding new GPU behavior to a project that already separates kernels from callers, keep kernel implementations in .cu and put host-only integration, scheduling, validation, and higher-level API code in .cpp.

GPU Execution And Sandbox Access

When running a CUDA executable, test, benchmark, or profiler command inside the sandbox, treat "no GPU found" style failures as possibly caused by sandboxed device access rather than absent hardware.
This includes errors such as no CUDA-capable device, cudaGetDeviceCount returning zero, CUDA_ERROR_NO_DEVICE, missing /dev/nvidia*, inaccessible nvidia-smi, or profiler failures that cannot see a GPU.
If the command otherwise appears valid, ask to rerun the same executable or profiler command outside the sandbox with escalated permissions before concluding that the machine has no compatible GPU.
Keep the rerun command as close as possible to the failing command so performance, correctness, and profiler results remain comparable.

Profile Report Analysis

When given Nsight Compute (ncu), Nsight Systems (nsys), CUPTI, or benchmark profiler output:

Treat the report as the source of truth and cite the specific kernel, metric, table, or timeline observation driving each conclusion.
Separate kernel time, API time, transfer time, synchronization gaps, and total application time.
Rank findings by measured impact, not by how interesting the metric looks.
For Nsight Compute, map symptoms to likely causes: low memory throughput, poor coalescing, high replay, bank conflicts, low achieved occupancy, register pressure, divergence, instruction mix, or dependency stalls.
For Nsight Systems, look for CPU-side launch overhead, serialized streams, implicit synchronizations, transfer/compute overlap gaps, allocator overhead, and idle GPU intervals.
Distinguish measured facts from hypotheses; propose the next profiling counter or experiment when evidence is incomplete.
Avoid recommending kernel rewrites from a single red metric unless it explains a meaningful share of runtime.

Review Checklist

Are global memory loads and stores coalesced?
Is shared memory reducing expensive global traffic or enabling meaningful reuse?
Are shared memory bank conflicts present?
Is synchronization within the block justified and minimized?
Is occupancy limited by registers, shared memory, or launch configuration?
Does the kernel use a 1D launch and threadIdx.x where practical, mapping from flat ids to logical 2D/3D coordinates only when needed?
Is warp divergence hurting throughput?
Are atomics, reductions, or serialization points on the hot path?
Is register pressure preventing useful parallelism?
Are host-device transfers excessive or poorly overlapped?
Is the optimization improving total runtime rather than only one kernel metric?
Does the profile show whether the bottleneck is inside kernels, transfers, synchronization, launch overhead, or CPU-side orchestration?
If a CUDA run reports no GPU, was sandboxed device access ruled out by asking to rerun the same command outside the sandbox?

Constraints

Do not claim performance improvements without profiling or benchmark evidence.
Do not overfit to profiler percentages without checking absolute time and total pipeline impact.
Do not use shared memory by default if coalesced global access and cache behavior are already sufficient.
Do not optimize for occupancy alone; verify the actual bottleneck.
Do not introduce synchronization or shared memory complexity unless the reuse benefit is plausible.
Do not introduce threadIdx.y, threadIdx.z, blockIdx.y, or blockIdx.z by default; prefer flat 1D indexing and explain any exception.
State tradeoffs in maintainability, numerical behavior, portability, and hardware sensitivity.
Call out race-condition risks, reduction-order changes, and determinism issues when relevant.

Output

Provide:

the likely bottleneck
whether it is memory-bound, compute-bound, or transfer-bound
the profiler evidence behind that classification
the proposed shared-memory, launch, or memory-access optimization
why it is expected to help
the main risks or tradeoffs
the validation plan

Related Skills

miaodi/computational-learning-notes

development

VerifiedTrustedCommunity

Use when creating C++ learning notes or minimal experiments for low-level computational, numerical, CPU/GPU, compiler, and hardware concepts such as false sharing, floating point, registers, caches, SIMD, atomics, numerical stability, and benchmarking pitfalls.

SKILL.mdUpdated Jun 2, 2026

miaodi/computational-learning-notes

miaodi/latex-project-build

development

VerifiedTrustedCommunity

Use when configuring, diagnosing, or compiling LaTeX projects, especially multi-file reports, theses, books, chapter-based projects, Overleaf exports, latexmk/arara/Makefile workflows, bibliography/index/glossary passes, or projects that require pdflatex, xelatex, lualatex, latex->dvips, biber, or bibtex.

SKILL.mdUpdated May 28, 2026

miaodi/latex-project-build

miaodi/graph-algorithms

development

VerifiedTrustedCommunity

Use when working with graph traversals (BFS, DFS, level-order), minimum spanning trees, strongly connected components, topological sort, graph coloring, bipartite detection, elimination trees, level-set extraction, parallel graph algorithms, task-tree parallelism, sparse graph representations, and exploiting graph structure for parallel sparse computations.

SKILL.mdUpdated May 21, 2026

miaodi/graph-algorithms

miaodi/git-workflow

testing

VerifiedTrustedCommunity

Use when planning or executing Git branch workflows, especially merge/rebase across branches, conflict resolution, safe history rewriting, and recovery from mistakes.

SKILL.mdUpdated May 21, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/miaodi/llm_config.git

# Copy into Claude Code skills folder (global)
cp -r llm_config/skills/cuda-performance ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

miaodi/llm_config

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT