plugins/sagemaker-ai/skills/hyperpod-slurm-debugger/SKILL.md
Diagnostic-only skill for Slurm scheduler and node-daemon issues on Amazon SageMaker HyperPod Slurm clusters. Scope mirrors the HyperPod troubleshooting guide. Invoke when the user reports a Slurm node stuck in down/drain, "Node unexpectedly rebooted" after auto-repair, slurmd not running, jobs stuck PENDING with REASON=Resources while sinfo shows idle nodes, jobs stuck COMPLETING after node replacement, GRES/GPU counts wrong, scontrol ping failing, slurmctld unresponsive, an Action:Reboot/Replace request that did not trigger HyperPod auto-recovery, or auto-resume not restarting a job. Also triggers on "drain before reboot", "diagnose a Slurm node", "investigate stuck jobs."
npx skillsauth add awslabs/agent-plugins hyperpod-slurm-debuggerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Diagnostic-only. Identify and classify Slurm scheduler and node-daemon issues on HyperPod Slurm clusters. Do not run, recommend, or print any state-mutating command. For remediation, link to the official AWS or Slurm documentation.
Invoke when the user reports any of the symptoms in the decision table.
Orchestrator.Eks — invoke hyperpod-node-debugger or hyperpod-nccl.hyperpod-node-debugger.hyperpod-nccl.hyperpod-ssm.Canonical recovery URLs: references/slurm-details.md → Authoritative recovery documentation.
sagemaker:DescribeCluster, sagemaker:ListClusterNodesssm:StartSession on the HyperPod-created SSM documentjq ≥ 1.6.unbuffer (from the expect package). Required — without it aws ssm start-session
returns empty stdout intermittently with Cannot perform start session: EOF and every
check silently misreports. Install: expect package on Amazon Linux / RHEL / Debian /
Ubuntu / macOS. Script exits at prerequisite check if missing.Ask the user for:
aws sagemaker describe-cluster --cluster-name <NAME/ARN> --region <REGION> \
--query 'Orchestrator' --output json
If Orchestrator.Eks is present, stop. Route per When NOT to invoke.
bash scripts/slurm-diagnose.sh --cluster <NAME> --region <REGION>
# Scope to a node:
bash scripts/slurm-diagnose.sh --cluster <NAME> --region <REGION> --node <SLURM_NODE>
Relay the script output to the user verbatim.
For each finding, look up the section in the decision table and link the user to the corresponding AWS / Slurm doc. Do not type out remediation commands.
| Symptom (sinfo -o "%N %T %30E" or script finding) | Section |
| ----------------------------------------------------------- | ------------------------------------------------------ |
| Node state = down or down*, reason other than below | A: Node Down |
| Node state = down*, Reason = Node unexpectedly rebooted | B: Unexpected Reboot |
| Jobs PENDING with REASON=Resources while nodes are idle | C: Controller State |
| Jobs stuck COMPLETING after node replacement | C: Controller State |
| scontrol ping returns DOWN for the controller | C: Controller State |
| GRES (GPU) counts incorrect or not released | C: Controller State |
| state=fail issued but no recovery occurred | D: Action Reason Mismatch |
| Accounting errors or RPC errors mentioning dbd | C: Controller State (slurmdbd) |
| slurm.conf edited; new partitions or nodes not visible | C: Controller State (config) |
| Job exited on a hardware failure but did not restart | E: Auto-resume |
| Behavior | Default | Override |
| -------------------- | -------------------------------------------------------------------------------------------------- | -------------------------- |
| Mode | read-only — always; no remediation flag exists | n/a |
| Region | $AWS_DEFAULT_REGION, falling back to us-east-1 | --region <R> |
| Scope | all nodes in down / drain / fail / "unexpectedly rebooted" | --node <SLURM_NODE_NAME> |
| Output | colorized terminal | --no-color |
| SSM target format | sagemaker-cluster:<clusterId>_<instanceGroupName>-<instanceId> (derived) | n/a |
| Controller discovery | --controller-group (if set) → SlurmConfig.NodeType=Controller → provisioning_parameters.json | --controller-group <N> |
| Failure | Skill behavior | Required user action |
| -------------------------------------------------- | -------------------------------------- | ----------------------------------------------- |
| describe-cluster fails | Print AWS error; exit 1 | Fix credentials/region; verify cluster name |
| Cluster has Orchestrator.Eks | Exit 1 with pointer to EKS-side skills | Use hyperpod-node-debugger or hyperpod-nccl |
| session-manager-plugin missing / SSM unreachable | sinfo returns empty; exit 1 | Install plugin; verify node InService |
| Disk ≥ 95 % full on a down node | Report finding disk-full-<node> | Refer to AWS troubleshooting docs |
| Missing jq or aws | Exit 1 at prerequisite check | Install per Prerequisites |
Node is down because slurmd stopped responding. Causes: slurmd crash, disk full,
OOM, network partition, hardware fault.
Script checks: systemctl is-active slurmd, srun -w <NODE> hostname (RPC layer), disk,
memory.
Link: https://github.com/aws/sagemaker-hyperpod-cluster-setup/blob/troubleshooting-doc-20250917/troubleshoot/index.md
If node returns to down after a manual resume → escalate to hyperpod-node-debugger.
Context: references/slurm-details.md § A.
Node is down* with Reason "Node unexpectedly rebooted" because slurmd
re-registered after an out-of-band reboot. Upstream Slurm behavior, not HyperPod.
Node is typically healthy.
Links:
state=resume semantics)If node reboots again within minutes → escalate to hyperpod-node-debugger.
Context: references/slurm-details.md § B.
slurmctld in-memory state can desync from the on-disk state. A controller restart reloads from StateSaveLocation and clears bad caches. User decides and executes.
Restart may help:
| Symptom | Why |
| -------------------------------------------------- | ------------------------------------------- |
| PENDING with REASON=Resources, idle nodes | Re-evaluates the queue |
| Jobs stuck COMPLETING after node replacement | Controller held a reference to the old node |
| GRES (GPU, EFA) not released after a job ends | Resource accounting de-synced |
| Nodes stuck Unknown after reboot, slurmd is up | Re-registration was not processed |
| scontrol ping times out | Controller event loop is hung |
| Lost connection to slurmdbd / RPC errors | DBD connection wedged |
Do NOT restart when:
Action:Replace) in progress on any node — concurrent changes
fail the replacement.slurmd on that node.sinfo and squeue are responsive — problem is elsewhere.journalctl -u slurmctld not reviewed yet — panic / OOM will reproduce.slurm.conf was just edited — try scontrol reconfigure first.sacct fails, accounting fields show Unknown,
controller log spams Unable to contact slurmdbd. Restore slurmdbd before
considering controller restart.
https://slurm.schedmd.com/accounting.html ·
details.slurm.conf / topology.conf mtime > slurmctld start.
scontrol reconfigure first; restart is fallback.
https://slurm.schedmd.com/scontrol.html ·
details.Restart procedure / what's preserved:
Context: references/slurm-details.md § C.
scontrol update state=fail reason=... was issued with a reason that does not match
Action:Reboot or Action:Replace exactly. HyperPod silently ignores anything else.
Script detects near-misses on nodes in fail state.
Required strings (case-sensitive, no whitespace, no punctuation):
Action:RebootAction:ReplaceLink: https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance.html
Context: references/slurm-details.md § Action reason-string validation.
--auto-resume=1 is an srun step option. It re-runs the step after HMA (the Health
Monitoring Agent) flags a node and Automatic node recovery replaces it.
Why it didn't restart the job:
sbatch not srun — per-step; sbatch directives are silently ignored.NodeRecovery is None — faulty nodes are labeled but not replaced.Link: https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-resiliency-slurm-auto-resume.html
Context: references/slurm-details.md § HyperPod auto-resume.
| Condition | Next skill |
| --------------------------------------------------------------- | ------------------------------------- |
| Node returns to down shortly after a manual resume | hyperpod-node-debugger (hardware) |
| slurmd logs contain CUDA / NVIDIA / XID errors | hyperpod-node-debugger § G |
| Disk full or /dev/shm exhausted | hyperpod-node-debugger § I |
| Node unreachable via SSM | hyperpod-ssm |
| Controller restart does not clear COMPLETING after 2 attempts | hyperpod-issue-report + AWS Support |
development
Deploy to AWS Elastic Beanstalk. Triggers on: elastic beanstalk, EB, managed EC2 platform, web app with managed patching, worker on EC2, Heroku alternative, don't want to manage servers or containers, migrate from Heroku, managed operational lifecycle. Covers Elastic Beanstalk on EC2 for web and worker applications.
testing
Evaluate, configure, and migrate workloads to AWS Lambda Managed Instances (LMI). Triggers on: Lambda Managed Instances, LMI, capacity provider, multi-concurrency Lambda, dedicated instance Lambda, EC2-backed Lambda, cold start elimination, Graviton Lambda, instance type for Lambda, Lambda cost optimization with Reserved Instances or Savings Plans. Also trigger when users describe high-volume predictable workloads seeking cost savings, or compare Lambda vs EC2 for steady-state traffic. For standard Lambda without LMI, use the aws-lambda skill instead.
development
Deploy applications to AWS. Triggers on phrases like: deploy to AWS, host on AWS, run this on AWS, AWS architecture, estimate AWS cost, generate infrastructure. Analyzes any codebase and deploys to optimal AWS services.
development
Build with Aurora DSQL — manage schemas, execute queries, handle migrations, diagnose query plans, load data, and develop applications with a serverless, distributed SQL database. Covers IAM auth, multi-tenant patterns, MySQL-to-DSQL migration, DDL operations, query plan explainability, SQL compatibility validation, and bulk data loading. Triggers on phrases like: DSQL, Aurora DSQL, create DSQL table, DSQL schema, migrate to DSQL, distributed SQL database, serverless PostgreSQL-compatible database, DSQL query plan, DSQL EXPLAIN ANALYZE, why is my DSQL query slow, aurora-dsql-loader, load CSV into DSQL.