plugins/sagemaker-ai/skills/hyperpod-ssm/SKILL.md
Remote command execution and file transfer on SageMaker HyperPod cluster nodes via AWS Systems Manager (SSM). This is the primary interface for accessing HyperPod nodes — direct SSH is not available. Use when any skill, workflow, or user request needs to execute commands on cluster nodes, upload files to nodes, read/download files from nodes, run diagnostics, install packages, or perform any operation requiring shell access to HyperPod instances. Other HyperPod skills depend on this skill for all node-level operations.
npx skillsauth add awslabs/agent-plugins hyperpod-ssmInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
aws CLI v2, authenticated for the target account/Region.session-manager-plugin — installed alongside the AWS CLI.jq — the scripts build JSON payloads with it.unbuffer (from the expect package) — wraps aws ssm start-session with a PTY so the session-manager-plugin flushes stdout instead of racing to close. Without it, calls intermittently return empty output with Cannot perform start session: EOF even when the command ran. Install with sudo yum install expect, sudo apt install expect, or brew install expect. ssm-exec.sh detects and uses it automatically; falls back with a warning if missing.Target: sagemaker-cluster:<CLUSTER_ID>_<GROUP_NAME>-<INSTANCE_ID>
CLUSTER_ID: Last segment of cluster ARN (NOT the cluster name). Extract via get-cluster-info.sh.GROUP_NAME: Instance group name — retrieve via list-nodes.sh.INSTANCE_ID: EC2 instance ID (e.g., i-0123456789abcdef0)Three scripts under scripts/. Resolve cluster info and nodes once, then execute per node.
scripts/get-cluster-info.sh CLUSTER_NAME [--region REGION]
# Output: {"cluster_id":"...","cluster_arn":"...","cluster_name":"...","region":"..."}
scripts/list-nodes.sh CLUSTER_NAME [--region REGION] [--instance-group GROUP] [--instance-id ID]
# Output: JSON array of ClusterNodeSummaries (InstanceId, InstanceGroupName, InstanceStatus, etc.)
list-cluster-nodes paginates at 100 nodes. This script handles pagination automatically.
# Execute — with pre-built target
scripts/ssm-exec.sh --target "sagemaker-cluster:CLUSTERID_GROUP-INSTANCEID" 'command' [--region REGION]
# Execute — with parts
scripts/ssm-exec.sh --cluster-id ID --group GROUP --instance-id INSTANCE_ID 'command' [--region REGION]
# Upload
scripts/ssm-exec.sh --target TARGET --upload LOCAL_PATH REMOTE_PATH [--region REGION]
# Read remote file
scripts/ssm-exec.sh --target TARGET --read REMOTE_PATH [--region REGION]
SSM start-session rate limit: 3 TPS per account. Plan batch size and delay accordingly.
aws ssm send-command does NOT support sagemaker-cluster: targets — only start-session works.
When the scripts aren't suitable, use aws ssm start-session directly with AWS-StartNonInteractiveCommand. Wrap every invocation in unbuffer — without it, stdout is intermittently empty (see Prerequisites).
cat > /tmp/cmd.json << 'EOF'
{"command": ["bash -c 'echo hello && whoami'"]}
EOF
unbuffer aws ssm start-session \
--target sagemaker-cluster:{CLUSTER_ID}_{GROUP_NAME}-{INSTANCE_ID} \
--region REGION \
--document-name AWS-StartNonInteractiveCommand \
--parameters file:///tmp/cmd.json
--parameters — inline parameters break with special characters.command parameter is argv, not shell input. Wrap multi-statement scripts in bash -c '...' so pipes, semicolons, and redirects evaluate.| Task | Command |
| ---------------- | -------------------------------------------------------------- |
| Lifecycle logs | cat /var/log/provision/provisioning.log |
| Memory | free -h |
| Disk/mounts | df -h && lsblk |
| GPU status | nvidia-smi |
| GPU memory | nvidia-smi --query-gpu=memory.used,memory.total --format=csv |
| EFA/network | fi_info -p efa |
| CloudWatch agent | sudo systemctl status amazon-cloudwatch-agent |
| Top processes | ps aux --sort=-%mem \| head -20 |
root.--document-name to get a shell.AWS-StartNonInteractiveCommand.development
Build workflows with AWS Step Functions state machines using the JSONata query language. Covers Amazon States Language (ASL) structure, state types, variables, data transformation, error handling, AWS service integration, and migrating from the JSONPath to the JSONata query language.
tools
Design, build, deploy, test, and debug serverless applications with AWS Lambda. Triggers on phrases like: Lambda function, event source, serverless application, API Gateway, EventBridge, Step Functions, serverless API, event-driven architecture, Lambda trigger. For deploying non-serverless apps to AWS, use deploy-on-aws plugin instead.
development
Validates the user's environment for SageMaker AI operations — checks SDK version, AWS region, and execution role. Use when the user says "set up", "getting started", "check my environment", "configure SDK", or as the first step in any plan involving SageMaker/Bedrock training, evaluation, or deployment.
data-ai
Selects a base model for the user's use case by querying SageMaker Hub. Use when the user asks which model to use, wants to select or change their base model, mentions a model name or family (e.g., "Llama", "Mistral", "Nova"), or wants to evaluate a base model — always activate even for known model names because the exact Hub model ID must be resolved. Queries available models, presents benchmarks and licenses, and confirms selection.