plugins/sagemaker-ai/skills/hyperpod-performance-debugger/SKILL.md
Diagnose performance issues on Amazon SageMaker HyperPod clusters — uneven NCCL bandwidth across nodes and poor filesystem throughput. Read-only. Surfaces host-side signals (Xid, ECC, NVLink, EFA reachability, FSx saturation) and routes to the appropriate sibling skill (hyperpod-node-debugger, hyperpod-nccl, hyperpod-version-checker, hyperpod-issue-report) for any remediation. Triggers on uneven NCCL across nodes, straggler node, FSx slow, checkpoint slow, dataloader slow, filesystem bottleneck, FSx throughput, cross-AZ latency, topology mismatch.
npx skillsauth add awslabs/agent-plugins hyperpod-performance-debuggerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Route findings outside the two in-scope scenarios to the owner skill below.
| Concern observed | Route to |
| ---------------------------------------------------------------------- | ------------------------------------------------------------ |
| GPU hardware fault, ECC, NVLink, Xid, DCGM diagnostics, drain/replace | hyperpod-node-debugger (§ F Hardware/Auto-Repair, § G GPU) |
| Cannot allocate memory at os.fork(), root volume exhausted | hyperpod-node-debugger (§ I Resource Exhaustion) |
| NCCL timeouts, hangs, AllReduce stalls, EFA TCP fallback, RDMA memlock | hyperpod-nccl |
| EFA / NCCL / CUDA / NVIDIA driver version drift across nodes | hyperpod-version-checker |
| EFA self-referencing security-group rule missing — single node | hyperpod-node-debugger § A (EFA / Security Group) |
| EFA self-referencing security-group rule missing — cluster-wide | hyperpod-cluster-debugger § A (EFA Health Checks) |
| Slurm node state changes (drain / resume / reboot) | hyperpod-slurm-debugger |
| Diagnostic bundle for AWS Support | hyperpod-issue-report |
| Shell access on a node | hyperpod-ssm |
hyperpod-version-checker.hyperpod-node-debugger § G.scripts/perf-snapshot.sh (read-only) to gather host-side signals for the suspect node and FSx filesystems mounted on it.[CONCERN] line in the script output, open the matching section below and read the supporting reference.bash scripts/perf-snapshot.sh --cluster <CLUSTER_NAME_OR_ARN> --region <REGION>
# Scope to one suspect node:
bash scripts/perf-snapshot.sh --cluster <C> --region <R> --node <INSTANCE_ID>
The script samples one node by default. It collects host-side data via hyperpod-ssm: nvidia-smi output (temperature, SM clocks, PCIe link width, ECC, NVLink, topo -m), recent dmesg Xid lines, EFA port state and fi_info provider visibility, EFA installer + kernel module versions, CPU governor, NVL72 Fabric Manager state, FSx CloudWatch utilization, df -h / lfs df -h per mount, host iowait, /dev/shm size, and root-volume usage. All read-only.
Tags: [OK] healthy · [CONCERN] signal worth investigating (carries a → pointer to the owner skill) · [INFO] informational.
Host vs container scope. The script runs on the host via SSM and reports host-scope values. Many setups ship the EFA / libfabric / OFI-NCCL / CUDA stack inside the training container by design — a host value of unknown is not by itself a defect. What matters for performance is the stack the workload actually uses. Verify versions inside the container (and across nodes) via hyperpod-version-checker before drawing conclusions.
| Observation | Section |
| ----------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------- |
| Pairwise NCCL bandwidth varies across node pairs / suspected straggler | A: Uneven NCCL Performance |
| Nodes spread across AZs / network-node-layer labels / UltraServer boundaries | A |
| EFA port not ACTIVE on a node, missing OFI plugin, or FI provider not visible | A + route to hyperpod-node-debugger § A; hyperpod-version-checker for cross-node version compare |
| iostat shows high iowait, FSx CloudWatch utilization sustained near 100% | B: Poor Filesystem Performance |
| DataLoader stalls, checkpoint dominates step time | B |
| Xid line in dmesg, uncorrectable ECC, inactive NVLink lane, GPU ≥ 88°C | Route to hyperpod-node-debugger § G |
| Container vs host version drift suspected | Route to hyperpod-version-checker |
| Cannot allocate memory at os.fork(), root volume full, OOM events | Route to hyperpod-node-debugger § I |
| NCCL timeout, hang, TCP fallback (NET/OFI Using TCP), RDMA memlock | Route to hyperpod-nccl |
The customer reports identical training jobs running with different step times on different node sets, pairwise bandwidth variance, or some allocations consistently slower than others despite identical code.
Per the official troubleshooting guide, the common contributing factors are network topology differences between nodes (cross-AZ, cross-rack, cross-UltraServer), degraded EFA performance on some nodes, mixed instance types or generations within an instance group, and CPU frequency scaling differences.
The host-side data points — GPU thermal/ECC/PCIe/clocks, Xid, NVLink lanes, EFA port state and provider visibility, CPU governor, EFA/OFI/driver versions, nvidia-smi topo -m — are all collected by scripts/perf-snapshot.sh (Step 1 above). The script tags [CONCERN] with thresholds and emits routing pointers; rerun it per suspect node via --node <INSTANCE_ID>.
For driver / CUDA / NCCL / EFA / OFI version drift across nodes, run hyperpod-version-checker skill.
Run the standard nccl-tests recipes from awslabs/awsome-distributed-training. For an N-node cluster, run all-reduce across every pair and record busbw for each pair. Pairs more than ~5% below the run mean (the threshold the AWS validation script flags) are problematic candidates.
Expected busbw per SKU is published in the AI-on-HyperPod NCCL test guide. Benchmark the specific instance type before relying on a number.
Pairwise scripts, HyperPod topology surfaces (HyperPod API, EKS labels, Slurm topology.conf), and GB200 NVL72 specifics are in references/perf-details.md § Uneven NCCL.
HyperPod exposes topology through three operator-visible surfaces:
aws sagemaker describe-cluster-node returns NodeDetails.Placement.AvailabilityZone / AvailabilityZoneId and NodeDetails.UltraServerInfo.Id (UltraServer SKUs only).topology.kubernetes.io/zone, topology.k8s.aws/network-node-layer-{1,2,3} (highest-numbered = closest to instance), topology.k8s.aws/ultraserver-id.topology.conf. Inspect via scontrol show topology.Tightly coupled work shares the same AZ, the same highest-numbered network-node-layer label (EKS) or the same Slurm topology block, and — for NVL72 jobs — the same UltraServerInfo.Id / topology.k8s.aws/ultraserver-id. If the cluster is spread across AZs or layers, topology must be re-established at provisioning time. Route provisioning changes to hyperpod-cluster-debugger § B (Capacity & AZ).
The customer reports training bottlenecked on data loading, checkpoint save/load dominating step time, executables/scripts loading slowly, or iowait high.
Per the official troubleshooting guide, the resolution path follows this order:
This skill covers steps 1–3. Steps 4–5 are customer decisions; surface the data and let the customer pick.
scripts/perf-snapshot.sh (Step 1 above) covers the on-node side of this pass: it discovers FSx mounts, calls aws cloudwatch get-metric-statistics on DataReadBytes and (for OpenZFS) FileServerDiskIopsUtilization, prints df -h for /fsx /opt/dlami/nvme /opt/sagemaker, runs lfs df -h per Lustre mount, and reports iostat iowait. It tags [CONCERN] when OpenZFS IOPS utilization sustains ≥ 80% or iowait > 20%.
For longer windows or additional metrics (DataWriteBytes, Lustre DiskIopsUtilization, OpenZFS FileServerDiskThroughputUtilization), drive the query directly:
aws cloudwatch get-metric-statistics --region <REGION> \
--namespace AWS/FSx --metric-name DataReadBytes \
--dimensions Name=FileSystemId,Value=<FSID> \
--start-time "$(date -u -d '3 hours ago' +%Y-%m-%dT%H:%M:%S)" \
--end-time "$(date -u +%Y-%m-%dT%H:%M:%S)" \
--period 60 --statistics Sum Maximum
The full per-filesystem-type metric catalog is in references/perf-details.md § Filesystem.
Provisioned capacity is saturated. CloudWatch utilization sustained near 100% across the workload window. Customer decision: scale up the filesystem.
StorageCapacity × PerUnitStorageThroughput; capacity changes are non-disruptive.I/O pattern is inefficient. CloudWatch shows headroom but the workload is still I/O-bound. Customer decision: change the application.
num_workers, set pin_memory=True, persistent_workers=True.torch.distributed.checkpoint.async_save plus FSDP SHARDED_STATE_DICT). FULL_STATE_DICT serializes through rank 0 and is a frequent root cause.Filesystem-selection guidance and the async-checkpoint pattern are in references/perf-details.md § Filesystem.
Once the immediate incident is diagnosed, recommend HyperPod's built-in health features so problems are caught before the next training run rather than after another customer-reported regression.
Enable NodeRecovery=Automatic on the cluster. The Health Monitoring Agent (HMA) continuously monitors GPU- and Trainium-based instances and marks instances unhealthy on detected failure. With auto-recovery enabled, HyperPod reboots or replaces the node — no operator intervention.
Enable OnStartDeepHealthChecks on every GPU instance group with both check categories:
InstanceStress — stress-ng on CPU/memory/disk, GPU and PCI device count verification, DCGM level-4 diagnostics (memory test included), and EFA loopback bandwidth/latency.InstanceConnectivity — multi-node NCCL all-reduce.Every newly provisioned or auto-replaced node passes the same hardware bar before accepting jobs.
Run on-demand deep health checks when this skill or any sibling surfaces a hardware concern but the cluster is mid-workload. aws sagemaker start-cluster-health-check runs the same checks against a specific instance group; nodes are placed in a Slurm maintenance reservation and the check is queued until any running job completes (not preempted). Console: HyperPod → Clusters → Instances → Run deep health checks.
Not supported when NodeProvisioningMode=Continuous; one on-demand request per cluster at a time. Requires the latest AMI — run UpdateClusterSoftware first.
Logs land in CloudWatch at /aws/sagemaker/Clusters/<cluster_name>/<cluster_id> under DeepHealthCheckResults/<log_stream_id>, and on each node at /var/log/aws/clusters/sagemaker-deep-health-check.log.
External:
busbw per SKU): https://awslabs.github.io/ai-on-sagemaker-hyperpod/docs/slurm-orchestration/validation-and-testing/performance-testing/nccl-testsdevelopment
Deploy to AWS Elastic Beanstalk. Triggers on: elastic beanstalk, EB, managed EC2 platform, web app with managed patching, worker on EC2, Heroku alternative, don't want to manage servers or containers, migrate from Heroku, managed operational lifecycle. Covers Elastic Beanstalk on EC2 for web and worker applications.
testing
Evaluate, configure, and migrate workloads to AWS Lambda Managed Instances (LMI). Triggers on: Lambda Managed Instances, LMI, capacity provider, multi-concurrency Lambda, dedicated instance Lambda, EC2-backed Lambda, cold start elimination, Graviton Lambda, instance type for Lambda, Lambda cost optimization with Reserved Instances or Savings Plans. Also trigger when users describe high-volume predictable workloads seeking cost savings, or compare Lambda vs EC2 for steady-state traffic. For standard Lambda without LMI, use the aws-lambda skill instead.
development
Deploy applications to AWS. Triggers on phrases like: deploy to AWS, host on AWS, run this on AWS, AWS architecture, estimate AWS cost, generate infrastructure. Analyzes any codebase and deploys to optimal AWS services.
development
Build with Aurora DSQL — manage schemas, execute queries, handle migrations, diagnose query plans, load data, and develop applications with a serverless, distributed SQL database. Covers IAM auth, multi-tenant patterns, MySQL-to-DSQL migration, DDL operations, query plan explainability, SQL compatibility validation, and bulk data loading. Triggers on phrases like: DSQL, Aurora DSQL, create DSQL table, DSQL schema, migrate to DSQL, distributed SQL database, serverless PostgreSQL-compatible database, DSQL query plan, DSQL EXPLAIN ANALYZE, why is my DSQL query slow, aurora-dsql-loader, load CSV into DSQL.