skills/ncu-analysis/SKILL.md
Use when automating Nsight Compute (.ncu-rep) profiling, extracting metrics with ncu_report, comparing profiles, and diagnosing CUDA kernel bottlenecks.
npx skillsauth add miaodi/llm_config ncu-analysisInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Automate Nsight Compute profiling and report analysis so CUDA kernel performance regressions and bottlenecks can be found quickly and reproducibly.
Use for .ncu-rep generation, programmatic extraction of NCU metrics, profile-to-profile comparisons, CI performance checks, and memory-vs-compute bottleneck diagnosis.
.ncu-rep with stable commands and explicit output naming.ncu_report or ncu --csv where appropriate.# 1) Produce report files
ncu --set full --target-processes all -o run_a ./your_binary --your-args
ncu --set full --target-processes all -o run_b ./your_binary --your-args
# 2) Optional quick CSV export (without Python API)
ncu --import run_a.ncu-rep --csv --page raw > run_a.csv
ncu --import run_b.ncu-rep --csv --page raw > run_b.csv
# Requires Nsight Compute's Python module (commonly imported as ncu_report)
import json
# import ncu_report # environment-specific import path
KEY_METRICS = [
"gpu__time_duration.sum",
"sm__throughput.avg.pct_of_peak_sustained_elapsed",
"smsp__warps_active.avg.pct_of_peak_sustained_active",
"dram__throughput.avg.pct_of_peak_sustained_elapsed",
"l1tex__t_sectors_pipe_lsu_mem_global_op_ld_lookup_hit_rate.pct",
"lts__t_sectors_srcunit_tex_op_read_lookup_hit_rate.pct",
]
def summarize_report(report_path: str) -> dict:
# Pseudocode: adapt to the exact ncu_report API available in your environment.
# report = ncu_report.load_report(report_path)
# kernels = report.ranges[0].actions
kernels = []
rows = []
for k in kernels:
row = {"kernel": k.name()}
for m in KEY_METRICS:
# row[m] = k.metric_by_name(m).as_double()
row[m] = None
rows.append(row)
return {"report": report_path, "rows": rows}
def compare_rows(a_rows: list[dict], b_rows: list[dict]) -> list[dict]:
by_kernel_a = {r["kernel"]: r for r in a_rows}
out = []
for rb in b_rows:
ra = by_kernel_a.get(rb["kernel"])
if not ra:
continue
delta = {"kernel": rb["kernel"]}
for k, vb in rb.items():
if k == "kernel":
continue
va = ra.get(k)
if isinstance(va, (int, float)) and isinstance(vb, (int, float)) and va != 0:
delta[f"{k}_delta"] = vb - va
delta[f"{k}_pct"] = (vb - va) / abs(va) * 100.0
out.append(delta)
return out
# Example artifact shape
# summary_a = summarize_report("run_a.ncu-rep")
# summary_b = summarize_report("run_b.ncu-rep")
# comparison = compare_rows(summary_a["rows"], summary_b["rows"])
# print(json.dumps(comparison, indent=2))
Provide:
development
Use when creating C++ learning notes or minimal experiments for low-level computational, numerical, CPU/GPU, compiler, and hardware concepts such as false sharing, floating point, registers, caches, SIMD, atomics, numerical stability, and benchmarking pitfalls.
development
Use when configuring, diagnosing, or compiling LaTeX projects, especially multi-file reports, theses, books, chapter-based projects, Overleaf exports, latexmk/arara/Makefile workflows, bibliography/index/glossary passes, or projects that require pdflatex, xelatex, lualatex, latex->dvips, biber, or bibtex.
development
Use when working with graph traversals (BFS, DFS, level-order), minimum spanning trees, strongly connected components, topological sort, graph coloring, bipartite detection, elimination trees, level-set extraction, parallel graph algorithms, task-tree parallelism, sparse graph representations, and exploiting graph structure for parallel sparse computations.
testing
Use when planning or executing Git branch workflows, especially merge/rebase across branches, conflict resolution, safe history rewriting, and recovery from mistakes.