Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

Ascend/triton-operator-performance-optim

Name: triton-operator-performance-optim
Author: Ascend

skills/triton-operator-performance-optim/SKILL.md

npx skillsauth add Ascend/agent-skills triton-operator-performance-optim

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Triton 算子性能优化（Ascend NPU）

底线（不可突破）

精度：优化后 rtol=1e-3, atol=1e-3 对齐 PyTorch-NPU。不通过则回退。
泛化性：支持原有所有输入形状和 dtype，不能 hardcode 特定尺寸。

性能比定义：Ratio = torch_npu 耗时 / Triton 耗时（耗时倒数）。Ratio > 1.0 表示 Triton 更快。

优先级：正确性 > 泛化性 > 性能。

优化工作流

Phase 0: 算法审视

优化前先审视算法本身。低效算法再优化也有先天不足。

Phase 1: 分层评估

快速筛选：time.time() 测端到端（覆盖小/中/大），达标则完成
精确诊断：不达标时用 msprof 测 kernel 侧耗时，定位真正瓶颈

Phase 2: 瓶颈优化

| 瓶颈 | 优化重点 | |------|---------| | Memory-Bound | 向量化访存、UB 缓存复用、算子融合 | | Compute-Bound | Cube 适配、Block 尺寸调优 | | Latency-Bound | 增大并行度、减少同步 |

基础四板斧（按顺序）：Block/Grid Size → 连续访存 → UB 复用 → 编译时常量

加载：optimization-patterns.md, ascend-terminology.md

Phase 3: 硬件特化

Cube：BLOCK_M/N/K 为 16 倍数，累加器 FP32
UB：缓冲区总大小 < 192KB，单值缓冲区 32B 对齐
Grid：1D Grid ≤ 物理核数，核内循环处理多行
对角线调度：大矩阵（BLOCK_THRESHOLD 以上）使用对角线 Grid 调度提升 L2 缓存命中率
多 Vector Core：post-dot 操作使用 tl.parallel(bind_sub_block=True) 分配到 2 个 vector cores

加载：triton-ascend-api.md, tiling-strategies.md, triton-api-reference.md

Phase 4: 高级优化（按需）

算子融合、Double Buffer（tl.multibuffer(tensor, 2) 或 tl.compile_hint(tensor, "multi_buffer", 2)）

Phase 5: 验证（MANDATORY）

精度 + 泛化性 + 性能 + 端到端回归

反模式清单（NEVER）

❌ 仅凭单一规模数据做优化决策
❌ 端到端不达标时直接优化 kernel（应先 msprof 确认瓶颈）
❌ 为性能牺牲精度 / hardcode 破坏泛化性
❌ FP16 直接归约 / 非 16 倍数 BLOCK 做矩阵乘
❌ BLOCK_SIZE 超 UB（192KB）/ 非连续访存
❌ 热路径用 tensor.item()（触发 CPU-NPU 同步）
❌ 循环内用 if 分支修改变量（Triton 编译为 masked 操作，灾难性性能下降）
❌ 2D Tiling 只算数据 buffer 的 UB（必须包含 offset/mask/index 数组）
❌ 用预计算的 offset tensor 做 2D broadcasting（触发编译器 addptr 多用户 assertion）
❌ kernel 内用 broadcast stride 访问辅助张量（cos/sin 等），应改为 host 侧 expand+contiguous。expand 额外内存 < 非连续访存的性能损失
❌ 归约类算子用双 pass（多次 load 同一数据分别算统计量和归一化）——应单 pass，一次 load 后 UB 内全计算，详见踩坑 8
❌ msprof 对比时两个 kernel 同名——msprof 按 OP Type 聚合，同名会混在一起
❌ 大矩阵不用对角线调度（L2 缓存颠簸，BLOCK_THRESHOLD 以上必须启用）

检查清单

[ ] 精度对齐 PyTorch-NPU（rtol=1e-3, atol=1e-3）
[ ] 非对齐维度和边界通过
[ ] 性能测试覆盖小/中/大
[ ] grid ≤ 物理核数，BLOCK_SIZE 为编译时常量
[ ] 缓冲区 < 192KB，所有 load/store 有 Mask
[ ] 归约升 FP32，矩阵乘 BLOCK 为 16 倍数
[ ] 归约类算子是否单 pass（D ≤ UB 时必须）

常见瓶颈速查

| msprof 指标 | 瓶颈 | 典型优化 | |-------------|------|---------| | aiv_scalar > 80% | Scalar Bound | 检查双 pass / 逐行循环+tl.where 累加，改单 pass | | aiv_mte2 > 50% | Memory Bound | 连续访存、expand+contiguous、增大 BLOCK | | aiv_vec > 50% | Compute Bound | 算法优化、减少冗余计算 | | aic_cube_ratio < 50% | Cube 利用率低 | 检查对齐（512B/元素大小）、BLOCK 是否 16 倍数、用 compile_hint('dot_pad_only_k') |

Ascend/triton-operator-performance-optim

skills/triton-operator-performance-optim/SKILL.md

优化 Ascend NPU 亲和的 Triton 算子性能。解决 UB 溢出、提高 Cube 利用率、Tiling 策略设计。关键词：性能优化、performance optimization、tiling、UB。

13 stars

tools

Updated May 1, 2026

$ install --global

skillsauth

npx skillsauth add Ascend/agent-skills triton-operator-performance-optim

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 1, 2026, 2:44 AM140.7s5 files scanned

SKILL.md

name:: triton-operator-performance-optim
description:: 优化 Ascend NPU 亲和的 Triton 算子性能。解决 UB 溢出、提高 Cube 利用率、Tiling 策略设计。关键词：性能优化、performance optimization、tiling、UB。

Triton 算子性能优化（Ascend NPU）

底线（不可突破）

精度：优化后 rtol=1e-3, atol=1e-3 对齐 PyTorch-NPU。不通过则回退。
泛化性：支持原有所有输入形状和 dtype，不能 hardcode 特定尺寸。

性能比定义：Ratio = torch_npu 耗时 / Triton 耗时（耗时倒数）。Ratio > 1.0 表示 Triton 更快。

优先级：正确性 > 泛化性 > 性能。

优化工作流

Phase 0: 算法审视

优化前先审视算法本身。低效算法再优化也有先天不足。

Phase 1: 分层评估

快速筛选：time.time() 测端到端（覆盖小/中/大），达标则完成
精确诊断：不达标时用 msprof 测 kernel 侧耗时，定位真正瓶颈

Phase 2: 瓶颈优化

基础四板斧（按顺序）：Block/Grid Size → 连续访存 → UB 复用 → 编译时常量

加载：optimization-patterns.md, ascend-terminology.md

Phase 3: 硬件特化

Cube：BLOCK_M/N/K 为 16 倍数，累加器 FP32
UB：缓冲区总大小 < 192KB，单值缓冲区 32B 对齐
Grid：1D Grid ≤ 物理核数，核内循环处理多行
对角线调度：大矩阵（BLOCK_THRESHOLD 以上）使用对角线 Grid 调度提升 L2 缓存命中率
多 Vector Core：post-dot 操作使用 tl.parallel(bind_sub_block=True) 分配到 2 个 vector cores

加载：triton-ascend-api.md, tiling-strategies.md, triton-api-reference.md

Phase 4: 高级优化（按需）

算子融合、Double Buffer（tl.multibuffer(tensor, 2) 或 tl.compile_hint(tensor, "multi_buffer", 2)）

Phase 5: 验证（MANDATORY）

精度 + 泛化性 + 性能 + 端到端回归

反模式清单（NEVER）

❌ 仅凭单一规模数据做优化决策
❌ 端到端不达标时直接优化 kernel（应先 msprof 确认瓶颈）
❌ 为性能牺牲精度 / hardcode 破坏泛化性
❌ FP16 直接归约 / 非 16 倍数 BLOCK 做矩阵乘
❌ BLOCK_SIZE 超 UB（192KB）/ 非连续访存
❌ 热路径用 tensor.item()（触发 CPU-NPU 同步）
❌ 循环内用 if 分支修改变量（Triton 编译为 masked 操作，灾难性性能下降）
❌ 2D Tiling 只算数据 buffer 的 UB（必须包含 offset/mask/index 数组）
❌ 用预计算的 offset tensor 做 2D broadcasting（触发编译器 addptr 多用户 assertion）
❌ kernel 内用 broadcast stride 访问辅助张量（cos/sin 等），应改为 host 侧 expand+contiguous。expand 额外内存 < 非连续访存的性能损失
❌ 归约类算子用双 pass（多次 load 同一数据分别算统计量和归一化）——应单 pass，一次 load 后 UB 内全计算，详见踩坑 8
❌ msprof 对比时两个 kernel 同名——msprof 按 OP Type 聚合，同名会混在一起
❌ 大矩阵不用对角线调度（L2 缓存颠簸，BLOCK_THRESHOLD 以上必须启用）

检查清单

[ ] 精度对齐 PyTorch-NPU（rtol=1e-3, atol=1e-3）
[ ] 非对齐维度和边界通过
[ ] 性能测试覆盖小/中/大
[ ] grid ≤ 物理核数，BLOCK_SIZE 为编译时常量
[ ] 缓冲区 < 192KB，所有 load/store 有 Mask
[ ] 归约升 FP32，矩阵乘 BLOCK 为 16 倍数
[ ] 归约类算子是否单 pass（D ≤ UB 时必须）

常见瓶颈速查

Related Skills

Ascend/k8s-check-fix

testing

VerifiedTrustedCommunity

Kubernetes 集群健康检查与安全修复 — 诊断问题，用户确认后执行修复

13SKILL.mdUpdated May 8, 2026

Ascend/cann-nnal-installer

tools

VerifiedTrustedCommunity

昇腾NPU CANN Toolkit+Kernels+NNAL安装部署技能。支持从官网下载run包安装和从Docker镜像提取两种方式，覆盖驱动检查、包下载、安装、环境变量配置与验证全流程。当用户需要安装CANN全套组件或指定版本CANN到自定义路径时调用。

13SKILL.mdUpdated May 8, 2026

Ascend/cann-nnal-installer

Ascend/atb-testframework-build

development

VerifiedTrustedCommunity

编译 ATB (Ascend Transformer Boost) 测试框架。当用户需要编译 ATB 测试框架、运行 CSV 测试、或构建 atb_test_framework 时调用。支持全量编译（含第三方依赖克隆与源替换）和增量编译两种模式。需在 Docker 容器内配合 CANN 环境执行。

13SKILL.mdUpdated May 7, 2026

Ascend/atb-testframework-build

Ascend/atb-ops-to-aclnn-migration-workflow

databases

VerifiedTrustedCommunity

ATB OPS→ACLNN 迁移标准化工作流主模板。整合前置学习、设计文档生成、CSV用例设计、实际迁移、编译验证、测试验证全流程，提供明确的阶段 Gates 和用户确认机制。

13SKILL.mdUpdated May 7, 2026

Ascend/atb-ops-to-aclnn-migration-workflow

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/Ascend/agent-skills.git

# Copy into Claude Code skills folder (global)
cp -r agent-skills/skills/triton-operator-performance-optim ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

Ascend/agent-skills

13 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT