Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

miaodi/ncu-analysis

Name: ncu-analysis
Author: miaodi

skills/ncu-analysis/SKILL.md

npx skillsauth add miaodi/llm_config ncu-analysis

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

NCU Analysis Skill

Purpose

Automate Nsight Compute profiling and report analysis so CUDA kernel performance regressions and bottlenecks can be found quickly and reproducibly.

When To Use

Use for .ncu-rep generation, programmatic extraction of NCU metrics, profile-to-profile comparisons, CI performance checks, and memory-vs-compute bottleneck diagnosis.

Priorities

Preserve measurement validity before optimizing.
Keep profiling runs reproducible (same kernel, shape, device, clocks, and launch path).
Extract a minimal, decision-driving metric set first.
Separate measured facts from hypotheses.
Report kernel-level and end-to-end impact.

Workflow

Confirm experiment parity: same GPU, driver/CUDA stack, workload shape, warmup policy, and iteration count.
Capture .ncu-rep with stable commands and explicit output naming.
Extract key metrics (runtime, occupancy, memory throughput/utilization, cache behavior, stall reasons) using ncu_report or ncu --csv where appropriate.
Normalize and aggregate results by kernel name and shape, then export JSON/CSV artifacts.
Compare baseline vs candidate profile and compute deltas and percent changes.
Classify likely bottleneck type (memory-bound, compute-bound, latency/launch-bound, or occupancy-limited) using evidence from counters.
Propose the smallest high-confidence optimization and a re-profile validation plan.

Reference Commands

# 1) Produce report files
ncu --set full --target-processes all -o run_a ./your_binary --your-args
ncu --set full --target-processes all -o run_b ./your_binary --your-args

# 2) Optional quick CSV export (without Python API)
ncu --import run_a.ncu-rep --csv --page raw > run_a.csv
ncu --import run_b.ncu-rep --csv --page raw > run_b.csv

Programmatic Extraction Pattern

# Requires Nsight Compute's Python module (commonly imported as ncu_report)
import json

# import ncu_report  # environment-specific import path

KEY_METRICS = [
    "gpu__time_duration.sum",
    "sm__throughput.avg.pct_of_peak_sustained_elapsed",
    "smsp__warps_active.avg.pct_of_peak_sustained_active",
    "dram__throughput.avg.pct_of_peak_sustained_elapsed",
    "l1tex__t_sectors_pipe_lsu_mem_global_op_ld_lookup_hit_rate.pct",
    "lts__t_sectors_srcunit_tex_op_read_lookup_hit_rate.pct",
]


def summarize_report(report_path: str) -> dict:
    # Pseudocode: adapt to the exact ncu_report API available in your environment.
    # report = ncu_report.load_report(report_path)
    # kernels = report.ranges[0].actions
    kernels = []
    rows = []
    for k in kernels:
        row = {"kernel": k.name()}
        for m in KEY_METRICS:
            # row[m] = k.metric_by_name(m).as_double()
            row[m] = None
        rows.append(row)
    return {"report": report_path, "rows": rows}


def compare_rows(a_rows: list[dict], b_rows: list[dict]) -> list[dict]:
    by_kernel_a = {r["kernel"]: r for r in a_rows}
    out = []
    for rb in b_rows:
        ra = by_kernel_a.get(rb["kernel"])
        if not ra:
            continue
        delta = {"kernel": rb["kernel"]}
        for k, vb in rb.items():
            if k == "kernel":
                continue
            va = ra.get(k)
            if isinstance(va, (int, float)) and isinstance(vb, (int, float)) and va != 0:
                delta[f"{k}_delta"] = vb - va
                delta[f"{k}_pct"] = (vb - va) / abs(va) * 100.0
        out.append(delta)
    return out


# Example artifact shape
# summary_a = summarize_report("run_a.ncu-rep")
# summary_b = summarize_report("run_b.ncu-rep")
# comparison = compare_rows(summary_a["rows"], summary_b["rows"])
# print(json.dumps(comparison, indent=2))

Bottleneck Heuristics

Memory-bound signal: high DRAM throughput utilization with low SM throughput and low arithmetic utilization.
Compute-bound signal: high SM throughput with relatively lower memory pressure.
Occupancy-limited signal: low active warps/occupancy with register or shared-memory pressure clues.
Cache inefficiency signal: low L1/L2 hit rates with elevated memory transactions.
Launch/latency issues: short kernels dominated by launch/synchronization overhead in the end-to-end timeline.

Review Checklist

Are baseline and candidate runs comparable (shape, clocks, software stack, warmup, repetitions)?
Were hot kernels identified by absolute time contribution, not just percentages?
Are metric names and units recorded exactly as reported by NCU?
Did comparison output include both absolute deltas and percent deltas?
Are conclusions tied to measured counters instead of assumptions?
Is there a concrete validation loop (change, re-profile, confirm total runtime impact)?

Constraints

Do not claim wins without before/after measurements under matching conditions.
Do not over-interpret a single red metric without checking total runtime contribution.
Do not compare profiles across different hardware or incompatible software stacks.
Keep metric sets stable across runs to preserve comparability.
Call out uncertainty when required counters are missing.

Output

Provide:

profiling command(s) used and environment assumptions
top kernels by runtime contribution
extracted key metrics per kernel
baseline vs candidate deltas (absolute and percent)
bottleneck classification with evidence
next optimization action and validation steps

miaodi/ncu-analysis

skills/ncu-analysis/SKILL.md

Use when automating Nsight Compute (.ncu-rep) profiling, extracting metrics with ncu_report, comparing profiles, and diagnosing CUDA kernel bottlenecks.

business

Updated May 12, 2026

$ install --global

skillsauth

npx skillsauth add miaodi/llm_config ncu-analysis

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 12, 2026, 4:26 AM168.2s1 file scanned

SKILL.md

name:: ncu-analysis
description:: Use when automating Nsight Compute (.ncu-rep) profiling, extracting metrics with ncu_report, comparing profiles, and diagnosing CUDA kernel bottlenecks.

NCU Analysis Skill

Purpose

Automate Nsight Compute profiling and report analysis so CUDA kernel performance regressions and bottlenecks can be found quickly and reproducibly.

When To Use

Use for .ncu-rep generation, programmatic extraction of NCU metrics, profile-to-profile comparisons, CI performance checks, and memory-vs-compute bottleneck diagnosis.

Priorities

Preserve measurement validity before optimizing.
Keep profiling runs reproducible (same kernel, shape, device, clocks, and launch path).
Extract a minimal, decision-driving metric set first.
Separate measured facts from hypotheses.
Report kernel-level and end-to-end impact.

Workflow

Confirm experiment parity: same GPU, driver/CUDA stack, workload shape, warmup policy, and iteration count.
Capture .ncu-rep with stable commands and explicit output naming.
Extract key metrics (runtime, occupancy, memory throughput/utilization, cache behavior, stall reasons) using ncu_report or ncu --csv where appropriate.
Normalize and aggregate results by kernel name and shape, then export JSON/CSV artifacts.
Compare baseline vs candidate profile and compute deltas and percent changes.
Classify likely bottleneck type (memory-bound, compute-bound, latency/launch-bound, or occupancy-limited) using evidence from counters.
Propose the smallest high-confidence optimization and a re-profile validation plan.

Reference Commands

# 1) Produce report files
ncu --set full --target-processes all -o run_a ./your_binary --your-args
ncu --set full --target-processes all -o run_b ./your_binary --your-args

# 2) Optional quick CSV export (without Python API)
ncu --import run_a.ncu-rep --csv --page raw > run_a.csv
ncu --import run_b.ncu-rep --csv --page raw > run_b.csv

Programmatic Extraction Pattern

# Requires Nsight Compute's Python module (commonly imported as ncu_report)
import json

# import ncu_report  # environment-specific import path

KEY_METRICS = [
    "gpu__time_duration.sum",
    "sm__throughput.avg.pct_of_peak_sustained_elapsed",
    "smsp__warps_active.avg.pct_of_peak_sustained_active",
    "dram__throughput.avg.pct_of_peak_sustained_elapsed",
    "l1tex__t_sectors_pipe_lsu_mem_global_op_ld_lookup_hit_rate.pct",
    "lts__t_sectors_srcunit_tex_op_read_lookup_hit_rate.pct",
]


def summarize_report(report_path: str) -> dict:
    # Pseudocode: adapt to the exact ncu_report API available in your environment.
    # report = ncu_report.load_report(report_path)
    # kernels = report.ranges[0].actions
    kernels = []
    rows = []
    for k in kernels:
        row = {"kernel": k.name()}
        for m in KEY_METRICS:
            # row[m] = k.metric_by_name(m).as_double()
            row[m] = None
        rows.append(row)
    return {"report": report_path, "rows": rows}


def compare_rows(a_rows: list[dict], b_rows: list[dict]) -> list[dict]:
    by_kernel_a = {r["kernel"]: r for r in a_rows}
    out = []
    for rb in b_rows:
        ra = by_kernel_a.get(rb["kernel"])
        if not ra:
            continue
        delta = {"kernel": rb["kernel"]}
        for k, vb in rb.items():
            if k == "kernel":
                continue
            va = ra.get(k)
            if isinstance(va, (int, float)) and isinstance(vb, (int, float)) and va != 0:
                delta[f"{k}_delta"] = vb - va
                delta[f"{k}_pct"] = (vb - va) / abs(va) * 100.0
        out.append(delta)
    return out


# Example artifact shape
# summary_a = summarize_report("run_a.ncu-rep")
# summary_b = summarize_report("run_b.ncu-rep")
# comparison = compare_rows(summary_a["rows"], summary_b["rows"])
# print(json.dumps(comparison, indent=2))

Bottleneck Heuristics

Memory-bound signal: high DRAM throughput utilization with low SM throughput and low arithmetic utilization.
Compute-bound signal: high SM throughput with relatively lower memory pressure.
Occupancy-limited signal: low active warps/occupancy with register or shared-memory pressure clues.
Cache inefficiency signal: low L1/L2 hit rates with elevated memory transactions.
Launch/latency issues: short kernels dominated by launch/synchronization overhead in the end-to-end timeline.

Review Checklist

Are baseline and candidate runs comparable (shape, clocks, software stack, warmup, repetitions)?
Were hot kernels identified by absolute time contribution, not just percentages?
Are metric names and units recorded exactly as reported by NCU?
Did comparison output include both absolute deltas and percent deltas?
Are conclusions tied to measured counters instead of assumptions?
Is there a concrete validation loop (change, re-profile, confirm total runtime impact)?

Constraints

Do not claim wins without before/after measurements under matching conditions.
Do not over-interpret a single red metric without checking total runtime contribution.
Do not compare profiles across different hardware or incompatible software stacks.
Keep metric sets stable across runs to preserve comparability.
Call out uncertainty when required counters are missing.

Output

Provide:

profiling command(s) used and environment assumptions
top kernels by runtime contribution
extracted key metrics per kernel
baseline vs candidate deltas (absolute and percent)
bottleneck classification with evidence
next optimization action and validation steps

Related Skills

miaodi/computational-learning-notes

development

VerifiedTrustedCommunity

Use when creating C++ learning notes or minimal experiments for low-level computational, numerical, CPU/GPU, compiler, and hardware concepts such as false sharing, floating point, registers, caches, SIMD, atomics, numerical stability, and benchmarking pitfalls.

SKILL.mdUpdated Jun 2, 2026

miaodi/computational-learning-notes

miaodi/latex-project-build

development

VerifiedTrustedCommunity

Use when configuring, diagnosing, or compiling LaTeX projects, especially multi-file reports, theses, books, chapter-based projects, Overleaf exports, latexmk/arara/Makefile workflows, bibliography/index/glossary passes, or projects that require pdflatex, xelatex, lualatex, latex->dvips, biber, or bibtex.

SKILL.mdUpdated May 28, 2026

miaodi/latex-project-build

miaodi/graph-algorithms

development

VerifiedTrustedCommunity

Use when working with graph traversals (BFS, DFS, level-order), minimum spanning trees, strongly connected components, topological sort, graph coloring, bipartite detection, elimination trees, level-set extraction, parallel graph algorithms, task-tree parallelism, sparse graph representations, and exploiting graph structure for parallel sparse computations.

SKILL.mdUpdated May 21, 2026

miaodi/graph-algorithms

miaodi/git-workflow

testing

VerifiedTrustedCommunity

Use when planning or executing Git branch workflows, especially merge/rebase across branches, conflict resolution, safe history rewriting, and recovery from mistakes.

SKILL.mdUpdated May 21, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/miaodi/llm_config.git

# Copy into Claude Code skills folder (global)
cp -r llm_config/skills/ncu-analysis ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

miaodi/llm_config

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT