skills/ascend-profiling-anomaly/SKILL.md
Analyze Huawei Ascend NPU profiling data to discover hidden performance anomalies and produce a detailed model architecture report reverse-engineered from profiling. Trigger on Ascend profiling traces, NPU bottlenecks, device idle gaps, host-device issues, kernel_details.csv / trace_view.json / op_summary / communication.json. Also trigger on "profiling", "step time", "device bubble", "underfeed", "host bound", "device bound", "AICPU", "wait anchor", "kernel gap", "Ascend performance", "model architecture", "layer structure", "forward pass", "model structure". Runs anomaly discovery (bubble detection, wait-anchor, AICPU exposure) alongside model architecture analysis (layer classification, per-layer sub-structure, communication pipeline). Outputs a separate Markdown architecture report alongside anomaly analysis.
npx skillsauth add Ascend/agent-skills ascend-profiling-anomalyInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Analyze Ascend NPU profiling data through three parallel pipelines:
The core philosophy is separation of concerns: "anomaly exists" is a hard fact derived from device intervals; "why it exists" is a soft attribution that may require additional evidence. Even under weak profiling configurations (no stacks, no shapes, sparse host events), the skill must still reliably surface device idle bubbles and risk labels.
Read these before starting analysis:
| File | When to read | What it contains |
|---|---|---|
| references/kernel_data_guide.md | Always — read first | Raw data column schemas for kernel_details.csv, op_summary, trace_view.json; the step → structure → block/side → op hierarchy; how to parse, filter, assign kernels at each level; multi-stream handling; per-level timing aggregation |
| references/rulebook.md | Always | Anomaly thresholds, tagging rules, decision tables, soft attribution rules, AICPU classification, wait-anchor rules |
| references/architecture_report_template.md | Always — read before producing the architecture report | Full template for the standalone Markdown architecture report: required sections, formatting rules, analysis techniques for layer classification, communication overlap measurement, per-layer timing breakdowns |
| references/schema.json | When producing structured JSON output | Full JSON schema for the anomaly_discovery output object |
| scripts/reference_host_gap_branch.py | When writing analysis scripts | Reference Python implementation for interval merging, bubble metrics, soft attribution, wait-anchor scoring |
The full state machine:
INGEST → INVENTORY → FACT_EXTRACTION → CANDIDATE_STEP_DETECTION →
STEP_GROUPING → MACRO_STEP_RESOLUTION → SEGMENTATION →
CLOCK_ACCOUNTING → ANOMALY_DISCOVERY → PERF_JUDGEMENT →
RECOMMENDATION → RENDER → DONE
↘
ARCHITECTURE_ANALYSIS → ARCH_REPORT_RENDER → ARCH_REPORT_SAVE
ANOMALY_DISCOVERY sits after CLOCK_ACCOUNTING and before PERF_JUDGEMENT. It receives already-segmented steps and structures, and runs the bubble detection pipeline on top of them.
ARCHITECTURE_ANALYSIS runs in parallel with PERF_JUDGEMENT, using the same segmented data from CLOCK_ACCOUNTING plus FIA timeline analysis. It produces a separate Markdown report file saved alongside the profiling data.
Understanding how raw kernel data maps to each level is essential. Read references/kernel_data_guide.md for the full column schemas and parsing details. Here is the conceptual overview:
Each row in kernel_details.csv represents a single device kernel execution — one invocation of an AI Core task, an AI CPU task, or an HCCL communication task. Key fields: Name, Task Type, Start Time(us), Duration(us), Wait Time(us), Accelerator Core, Stream ID, Input Shapes, Output Shapes.
A step is one training/inference iteration. Steps are identified by ProfilerStep#N user annotations or Iteration#N markers in trace_view.json. Each step defines a service window [S_i, S_{i+1}). All kernels whose start time falls within this window belong to step i.
At step level, compute:
Within a step, kernels form repeating structures — typically corresponding to model layers (e.g., transformer blocks, attention layers, MLP blocks). Segmentation identifies these by:
Each structure contains a contiguous span of kernels within the step window.
At structure level, compute:
Within each structure, kernels split into:
At block/side level, maintain four timing perspectives simultaneously:
wall_ms — wall-clock time from first kernel start to last kernel endbusy_union_ms — merged device-busy time (accounts for multi-stream overlap)kernel_sum_ms — arithmetic sum of all kernel durations (ignores overlap)total_cost_ms — sum of duration + wait for all kernelsConclusions based on only one metric are incomplete. A block appearing heavy in kernel_sum but light in wall means high stream parallelism. A side appearing heavy in total_cost but light in duration means wait-anchor false hotspot risk.
The finest grain. Each op may produce one or many device kernels. Op-level analysis handles:
wait_ratio > 0.95 and tiny duration but high total_costmasked_ratioFor each step, collect all device kernel intervals from kernel_details.csv:
device_intervals = []
for each kernel row where Start_Time_us is within step window:
s = max(step_start_us, row.Start_Time_us)
e = min(step_end_us, row.Start_Time_us + row.Duration_us)
if e > s:
device_intervals.append(Interval(s, e))
Key rules:
references/kernel_data_guide.md section on comm dedup)Sort intervals by start time, merge overlapping ones:
merged = merge(device_intervals) # see scripts/reference_host_gap_branch.py
busy_union = sum of merged segment durations
This produces the merged busy segments from which all bubble metrics derive.
From the merged segments, compute per-step metrics:
service_ms = step window durationdevice_busy_union_ms = sum of merged segment durationsunderfeed_ms = service − busy_unionunderfeed_ratio = underfeed / serviceprelaunch_gap_ms = start(first_merged_segment) − step_starttail_gap_ms = step_end − end(last_merged_segment)internal_bubble_total_ms = sum of gaps between consecutive merged segmentslargest_internal_bubble_ms = max gap between consecutive merged segmentsbubble_count = number of inter-segment gapsFor each bubble window (gap between merged segments, or prelaunch/tail gap), scan the same time range in host events from trace_view.json:
Host event categories to collect:
cpu_op / python_function / user_annotation — general host activityAscendCL@* — ACL runtime eventsHostToDevice / torch_to_npu / aclrtMemcpy* / aclrtSynchronize* — sync/copy markersc10d / Hccl / hcom / StreamWaitEvent / Notify_Wait — communication markersCompute overlap ratios:
host_visible_coverage_ratio = fraction of bubble covered by any host eventsync_marker_overlap_ratio = fraction covered by sync/copy markerscomm_marker_overlap_ratio = fraction covered by communication markersFor each significant bubble, assign probability-level labels based on overlap ratios:
| Condition | Label |
|---|---|
| sync_overlap ≥ 0.20 | possible_sync_or_h2d |
| comm_overlap ≥ 0.20 | possible_comm_wait |
| host_coverage < 0.05 | possible_untraced_host_blocking |
| host_coverage ≥ 0.10 but no sync/comm dominance | possible_host_launch_lag |
| host_parallelism < 1.2 and none of above | possible_python_serialization_or_lock |
| nothing applies | insufficient_evidence |
Multiple labels can co-exist. These are explicitly NOT unique root causes.
Apply the anomaly tags from references/rulebook.md decision tables. Core tags:
Bubble severity: DEVICE_IDLE_GAP_HEAVY, PRELAUNCH_GAP_HEAVY, TAIL_GAP_HEAVY, INTERNAL_BUBBLE_HEAVY
Risk tags: HOST_ORIGINATED_RISK, COMM_SYNC_RISK, WAIT_POLLUTION_RISK, WAIT_ANCHOR_FALSE_HOTSPOT, AICPU_EXPOSED_RISK, UNTRACED_HOST_BLOCKING_RISK, PARTIAL_CAPTURE_BOUNDARY, VARIABLE_SHAPE_SAME_TEMPLATE
At op level, scan for false hotspots:
wait_ratio = wait_us / (duration_us + wait_us)
if wait_ratio > 0.95 and duration_us < 10.0 and total_cost_rank <= 10:
tag WAIT_ANCHOR_FALSE_HOTSPOT
These ops absorb idle wait time and appear expensive, but their kernel execution is negligible. Demote them in root-cause ranking.
Aggregate step-level metrics by step_group_id:
recurring_bubble_pattern = true if ≥60% of steps in group have bubble_count > 0dominant_idle_pattern = whichever of prelaunch/internal_bubble/tail contributes mostMerge anomaly results with structure breakdown. The report MUST include:
For the dominant step, break down bubble contributions per structure:
This pipeline produces a standalone Markdown report file that documents the model architecture as reverse-engineered from profiling data. Read references/architecture_report_template.md for the full template, formatting rules, and analysis techniques.
The report is saved as model_architecture_report_<profiling_dir_name>.md in the profiling or output directory.
Always. Every profiling analysis MUST produce this report alongside the anomaly discovery output. The architecture report provides essential context that makes the anomaly findings interpretable.
Use FusedInferAttentionScore (FIA) invocations as the primary structural marker:
kernel_details.csv (match by name containing FusedInferAttentionScore)Start Time(us)num_passes = total_FIA / FIA_per_passFor each forward pass, determine:
Cross-pass variation (FIA duration, wall time) should be noted — it may reveal KV cache growth or memory pressure effects.
For each layer (delimited by consecutive FIA invocations), extract the kernel sequence and classify:
| Classifier kernel | Layer type | |---|---| | No MoE markers (no MoeGatingTopK, no DFC, no GroupedMatmul) | Dense | | DispatchFFNCombine present | MoE+DFC | | GroupedMatmul + alltoallv present (no DFC) | MoE+GMM | | Sampling ops (ArgMax, rejection_*) present | includes decode/sampling logic |
Build a summary table: Layer Type | Layer Range | Count | Characteristics
Count key ops per pass and verify they match the layer classification:
Discrepancies indicate classification errors — resolve before proceeding.
For EACH distinct layer type, analyze the kernel execution sequence:
Present as kernel sequence trees with timing annotations and stream labels. See the template for the exact tree notation format.
Decode layers require separate analysis because they have fundamentally different cost profiles:
Produce a dominant costs table and explain why the cost profile differs from prefill.
Document the multi-stream overlap strategy:
Assemble all findings into the Markdown report following the template in references/architecture_report_template.md. All 10 required sections must be present:
Save the report as model_architecture_report_<profiling_dir_name>.md. Inform the user of the file location.
insufficient_evidence or possible_untraced_host_blocking when host evidence is sparse — never silently omit the anomaly section.possible / probable / insufficient evidence for root causes.| Missing data | Impact | Action |
|---|---|---|
| record_shapes=false | Cannot detect shape variation | Bubble detection continues; tag VARIABLE_SHAPE_SAME_TEMPLATE skipped |
| with_stack=false | Soft attribution specificity degrades | Lower confidence; bubble detection unaffected |
| Sparse host events | Cannot narrow root-cause family | UNTRACED_HOST_BLOCKING_RISK, requires_host_followup=true |
| Capture boundary truncation | Edge gaps may be artifacts | PARTIAL_CAPTURE_BOUNDARY on boundary-adjacent gaps |
| No communication.json | Cannot assess comm wait | Skip comm overlap, note in evidence gaps. Architecture report omits comm pipeline bandwidth stats but still documents stream roles |
| No step markers | Cannot define step windows | Fall back to global capture span as single pseudo-step |
| op_summary only (no kernel_details) | Coarser granularity | Use op-level intervals instead; note in limitations. Architecture report uses op counts for layer classification but cannot produce per-layer kernel sequence trees |
| No FIA kernels detected | Cannot determine layer boundaries via FIA | Architecture report falls back to alternative structural markers (e.g., repeating kernel patterns, communication boundaries). Note reduced confidence in layer count |
| Single forward pass captured | Cannot cross-validate pass consistency | Architecture report documents single pass; notes that cross-pass variation analysis is unavailable |
| No decode FIA detected | Inference-only or prefill-only capture | Architecture report omits decode phase analysis section; notes capture scope limitation |
Every analysis must produce TWO outputs:
The anomaly_discovery top-level object containing: enabled, dominant_group_id, global_device_gap_analysis, step_group_anomalies, bubble_windows, wait_anchor_ops, soft_root_cause_summary, requires_host_followup, confidence.
Each step result must include bubble metrics, anomaly tags, and soft root-cause labels.
For the full JSON schema, read references/schema.json.
A standalone Markdown file saved as model_architecture_report_<profiling_dir_name>.md containing all 10 required sections from the architecture template. Read references/architecture_report_template.md for the full specification.
The report must include at minimum:
The architecture report is the primary deliverable for understanding model structure. It must be self-contained — a reader should be able to understand the full model execution without referring to the anomaly output.
Each recommendation must include scope (global/step_group/structure/side/op), followup_required, evidence_gap, and priority (P0–P3).
Common follow-up patterns:
with_stack=truerecord_shapes=truetesting
Kubernetes 集群健康检查与安全修复 — 诊断问题,用户确认后执行修复
tools
昇腾NPU CANN Toolkit+Kernels+NNAL安装部署技能。支持从官网下载run包安装和从Docker镜像提取两种方式,覆盖驱动检查、包下载、安装、环境变量配置与验证全流程。当用户需要安装CANN全套组件或指定版本CANN到自定义路径时调用。
development
编译 ATB (Ascend Transformer Boost) 测试框架。当用户需要编译 ATB 测试框架、 运行 CSV 测试、或构建 atb_test_framework 时调用。支持全量编译(含第三方依赖克隆与源替换) 和增量编译两种模式。需在 Docker 容器内配合 CANN 环境执行。
databases
ATB OPS→ACLNN 迁移标准化工作流主模板。整合前置学习、设计文档生成、CSV用例设计、 实际迁移、编译验证、测试验证全流程,提供明确的阶段 Gates 和用户确认机制。