plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/SKILL.md
Diagnose and remediate cluster-wide HyperPod (EKS or Slurm) problems — creation / deployment failures (CloudFormation, EFA health check, lifecycle scripts, capacity), EKS access, node replacement, CloudFormation nested-stack errors, post-maintenance rollback state, dangling nodes, autoscaler conflicts. Includes `--validate` pre-flight. Read-only.
npx skillsauth add awslabs/agent-plugins hyperpod-cluster-debuggerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Operating policy. Run read-only diagnostics yourself. Never run a command that changes cluster, node, or workload state — present each one as a Suggested command (run this yourself) block and wait for the customer to run it. Destructive order: investigate → reboot → replace (replace destroys root + secondary volumes; not supported on Slurm controller nodes).
Before any state-changing CLI: ask if it's IaC-managed. HyperPod clusters, SGs, EKS access entries, and IAM are usually provisioned via CloudFormation / CDK / Terraform. If yes, the fix belongs in IaC — running the CLI will drift and the next deploy reverts it. Use the CLI only when IaC is unavailable (locked out, predates IaC, mid-review).
scripts/diagnose-cluster.sh is read-only: it collects state via AWS APIs (and SSM for Slurm controller health) and prints each issue as [FAIL] ... → references/<file>.md § <section>.
| Reference | Open when |
| ------------------------------------------------------------------------- | ------------------------------------------------------------------- |
| cluster-diagnostics-detail.md | Per-finding remediation runbook (§ A–L) |
| cluster-operations.md | Operational deep-dives (EFA SG, EKS access, SSM, Slurm, filesystem) |
| cloudformation-errors.md | § H needs the full per-resource CFN error catalog |
| capacity-planning.md | § B or --validate flags capacity / subnet sizing |
| lifecycle-scripts.md | § C points at a specific lifecycle failure |
| iam-permissions.md | Full IAM policy for the diagnostic |
scripts/diagnose-cluster.sh (or --validate for pre-create).[FAIL] line, Read the referenced section.# Diagnose an existing cluster:
bash scripts/diagnose-cluster.sh --cluster <CLUSTER_NAME_OR_ARN> --region <REGION>
# Pre-flight (no cluster needed) — validates SGs, subnets, IAM, VPC endpoints,
# optionally S3 lifecycle scripts and per-AZ capacity:
bash scripts/diagnose-cluster.sh --validate --region <REGION> \
--sg-ids <sg-1,sg-2> --subnet-ids <sub-1,sub-2> [--iam-role <role-arn>] \
[--s3-uri s3://<BUCKET>/path/] [--instance-type ml.p5.48xlarge]
Pass --instance-type when the target instance type is known — enables the per-AZ capacity check (warns if none of the provided subnets are in an AZ that offers that type, which causes insufficient-capacity failures at creation time).
Tags: [PASS] · [FAIL] (counted, has → references/... pointer) · [WARN] · [INFO]. Priorities: P0 blocks operation · P1 degraded · P2 informational.
Error messages / events:
| Signal | Section |
| ---------------------------------------------------------------------------- | -------------------------------------------------------------- |
| "EFA health checks did not run successfully" (public-doc verbatim signal) | A: EFA Health Checks |
| Insufficient-capacity or AZ-mismatch failure at creation | B: Capacity & AZ |
| Lifecycle-script failure or timeout during provisioning | C: Lifecycle Scripts |
| kubectl auth error (server asks for credentials / no API group list) | D: EKS Access |
| InService but not all instances visible | E: Cluster Provisioning |
| "Target is not connected" / SSM errors | F: SSM Connectivity |
| Node replacement not happening / batch-replace not working | G: Node Replacement |
| "Embedded stack failed" / any CloudFormation error | H: CloudFormation Errors |
| UpdateClusterSoftware failed or cluster in post-maintenance rollback state | J: AMI & Cluster Updates |
| Dangling / orphaned nodes in EKS vs list-cluster-nodes | K: Dangling Nodes & Cleanup |
| Cluster Autoscaler breaks after HyperPod attached | L: Autoscaler Compatibility |
| Slow I/O, FSx throughput saturated | cluster-operations.md § 9 |
| Slurm node name → instance ID lookup | I: Utilities |
SG missing self-reference. Add inbound + outbound self-ref to every SG on the cluster, plus least-privilege egress for the AWS APIs the node needs (HTTPS 443 to S3 / ECR / SageMaker / SSM / STS / CloudWatch Logs — via VPC-endpoint prefix-lists when possible). Full procedure: cluster-diagnostics-detail.md § A.
Instance type unavailable in the requested AZ. Verify with describe-instance-type-offerings, then change AZ, use Flexible Training Plans, or request ODCR. Full: § B · strategy: capacity-planning.md.
Script failed or timed out during provisioning. Read CloudWatch under /aws/sagemaker/Clusters/<name>/<id> — common causes: missing S3 VPC endpoint, IAM gap, CRLF line endings, instance-group name mismatch. Full: § C · layout: lifecycle-scripts.md.
IAM identity not in EKS access entries. Verify with sts get-caller-identity, create an access entry with admin policy, update kubeconfig. Full: § D.
InService without all instances is expected under Continuous Provisioning — failures surface as events, not cluster errors. For stuck Creating/Updating/Deleting: check CFN nested stacks (§ H), IAM, capacity, events; if stuck Deleting check VPC ENI dependencies. Full: § E.
Target is not connected: use sagemaker-cluster:<CLUSTER_ID>_<GROUP>-<INSTANCE_ID> format (not raw EC2 ID), install session-manager-plugin, confirm node Running. Check IAM + VPC endpoints on timeouts. Full: § F.
Auto-repair: confirm NodeRecovery=Automatic, check Health Monitoring Agent (HMA) logs + node labels / Slurm reason, confirm capacity. Manual: reboot first, replace only if reboot fails. Replace requires the cluster to have been patched via UpdateClusterSoftware at least once and cannot target a Slurm controller node. Full: § G.
Embedded stack failed hides the real error. Drill into nested stacks via Events tab (filter Failed) until you reach a non-stack resource. CLI: describe-stack-events --query 'StackEvents[?ResourceStatus==\CREATE_FAILED`]'`. Also covers SLR creation failures and permission-boundary denials. Full: § H · catalog: cloudformation-errors.md.
Map Slurm node names (ip-10-x-y-z) to HyperPod instance IDs via list-cluster-nodes or on-node /opt/ml/config/resource_config.json. Full: § I.
UpdateClusterSoftware fails and rolls back, or the cluster stays in a post-maintenance rollback state. Common causes: lifecycle script incompatible with new AMI, HMA version too old, insufficient rolling-update capacity. If the cluster has active nodes, collect diagnostics and escalate rather than delete-and-recreate. Full: § J.
Nodes in kubectl get nodes but not in list-cluster-nodes (ghost EKS nodes), or the inverse (HyperPod nodes that never registered kubelet). Script flags both. Full: § K.
Cluster Autoscaler errors on HyperPod provider IDs and breaks autoscaling for all node groups. No officially endorsed workaround — escalate to AWS Support. Karpenter does not conflict with HyperPod nodes by default. Full: § L.
aws CLI v2.13+ authenticated to the cluster's accountjq, python3, bash 4.2+kubectl authenticated to the EKS cluster (EKS checks skipped if absent)session-manager-plugin (Slurm controller health checks only)IAM policy: references/iam-permissions.md.
--region or set $AWS_DEFAULT_REGION.--cluster <NAME> (diagnose) or --validate (pre-create).--no-color to force off.| Failure | Script | Tell the customer |
| --------------------------------------------------- | ---------------------------------------------------------- | ----------------------------------------------------- |
| aws sts get-caller-identity fails | Exit 1 | "Fix AWS credentials and rerun." |
| Cluster not found | Exit 1 after listing region's clusters | "Confirm HyperPod cluster name (not EKS) and region." |
| sagemaker:* / ec2:* / eks:* / logs:* denied | Warn, add Missing IAM permission for <API>, continue | "Grant the listed IAM action and rerun." |
| kubectl absent or unauthenticated | Skip EKS checks (access entries, add-ons, aws-auth, nodes) | "Install/authenticate kubectl." |
| session-manager-plugin absent (Slurm) | Skip Slurm controller probe | "Install session-manager-plugin." |
| SSM throttled / times out (180s) | Retry with backoff; warn and continue if still failing | "Rerun later — script is idempotent." |
| CloudWatch log group not found | Skip CloudWatch check | "CloudWatch not configured on this cluster." |
Exit codes: 0 no critical failures · 1 one or more critical failures (cluster not found, fatal prerequisite missing, or any [FAIL] in diagnose or --validate mode). [WARN] lines do not affect the exit code.
| Need | Use |
| ------------------------------- | -------------------------- |
| Shell on nodes | hyperpod-ssm |
| Version comparison across nodes | hyperpod-version-checker |
Escalate when:
Creating, Updating, or a post-maintenance rollback state) for an extended period.Run these commands and attach the output. Goal: AWS Support has everything at case open.
# 1. Cluster identity + status (confirms region, ARN, orchestrator, instance groups)
aws sagemaker describe-cluster --cluster-name <CLUSTER> --region <REGION>
# 2. Full cluster-level diagnostic bundle
bash scripts/diagnose-cluster.sh --cluster <CLUSTER> --region <REGION> > diag.txt
# 3. Per-node log/config bundle to S3 (delegates to hyperpod-issue-report skill)
# See skills/hyperpod-issue-report/SKILL.md for the exact invocation.
ClusterId suffix) and AWS regionClusterStatus + FailureMessage from describe-clusterNodeLogicalIds / instance group namesdiag.txt from step 2 abovehyperpod-issue-report bundle from step 3development
Deploy to AWS Elastic Beanstalk. Triggers on: elastic beanstalk, EB, managed EC2 platform, web app with managed patching, worker on EC2, Heroku alternative, don't want to manage servers or containers, migrate from Heroku, managed operational lifecycle. Covers Elastic Beanstalk on EC2 for web and worker applications.
testing
Evaluate, configure, and migrate workloads to AWS Lambda Managed Instances (LMI). Triggers on: Lambda Managed Instances, LMI, capacity provider, multi-concurrency Lambda, dedicated instance Lambda, EC2-backed Lambda, cold start elimination, Graviton Lambda, instance type for Lambda, Lambda cost optimization with Reserved Instances or Savings Plans. Also trigger when users describe high-volume predictable workloads seeking cost savings, or compare Lambda vs EC2 for steady-state traffic. For standard Lambda without LMI, use the aws-lambda skill instead.
development
Deploy applications to AWS. Triggers on phrases like: deploy to AWS, host on AWS, run this on AWS, AWS architecture, estimate AWS cost, generate infrastructure. Analyzes any codebase and deploys to optimal AWS services.
development
Build with Aurora DSQL — manage schemas, execute queries, handle migrations, diagnose query plans, load data, and develop applications with a serverless, distributed SQL database. Covers IAM auth, multi-tenant patterns, MySQL-to-DSQL migration, DDL operations, query plan explainability, SQL compatibility validation, and bulk data loading. Triggers on phrases like: DSQL, Aurora DSQL, create DSQL table, DSQL schema, migrate to DSQL, distributed SQL database, serverless PostgreSQL-compatible database, DSQL query plan, DSQL EXPLAIN ANALYZE, why is my DSQL query slow, aurora-dsql-loader, load CSV into DSQL.