plugins/sagemaker-ai/skills/hyperpod-issue-report/SKILL.md
Generate comprehensive issue reports from HyperPod clusters (EKS and Slurm) by collecting diagnostic logs and configurations for troubleshooting and AWS Support cases. Use when users need to collect diagnostics from HyperPod cluster nodes, generate issue reports for AWS Support, investigate node failures or performance problems, document cluster state, or create diagnostic snapshots. Triggers on requests involving issue reports, diagnostic collection, support case preparation, or cluster troubleshooting that requires gathering logs and system information from multiple nodes.
npx skillsauth add awslabs/agent-plugins hyperpod-issue-reportInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Collect diagnostic logs from HyperPod cluster nodes via SSM, store results in S3. Supports both EKS and Slurm clusters with auto-detection. Uses the bundled scripts/hyperpod_issue_report.py for reliable parallel collection.
sagemaker:DescribeCluster, sagemaker:ListClusterNodes, ssm:StartSession, s3:PutObject, s3:GetObject, eks:DescribeClusters3:GetObject/s3:PutObject on the report bucketCollect from the user:
arn:aws:sagemaker:us-west-2:123456789012:cluster/abc123)s3://bucket/prefix). If the user doesn't have a bucket, create one (e.g., s3://hyperpod-diagnostics-<account-id>-<region>)aws sts get-caller-identity
aws sagemaker describe-cluster --cluster-name <name-or-arn> --region <region>
If the S3 bucket doesn't exist, create it:
aws s3 mb s3://<bucket-name> --region <region>
For EKS clusters (check Orchestrator.Eks in describe-cluster output):
Ensure kubectl is installed (which kubectl). If missing, install it for the current platform.
Configure kubeconfig using the EKS cluster name from the describe-cluster response:
aws eks update-kubeconfig --name <eks-cluster-name> --region <region>
uv run scripts/hyperpod_issue_report.py \
--cluster <cluster-name-or-arn> \
--region <region> \
--s3-path s3://<bucket>[/prefix]
Use --help for all options including --instance-groups, --nodes, --command, --max-workers, and --debug. Note: --instance-groups and --nodes are mutually exclusive. Node identifiers accept instance IDs (i-*), EKS names (hyperpod-i-*), or Slurm names (ip-*).
After collection, the script shows statistics and offers interactive download. Report the S3 location and offer to:
See references/troubleshooting.md for error handling, large cluster tuning, and known limitations.
development
Build workflows with AWS Step Functions state machines using the JSONata query language. Covers Amazon States Language (ASL) structure, state types, variables, data transformation, error handling, AWS service integration, and migrating from the JSONPath to the JSONata query language.
tools
Design, build, deploy, test, and debug serverless applications with AWS Lambda. Triggers on phrases like: Lambda function, event source, serverless application, API Gateway, EventBridge, Step Functions, serverless API, event-driven architecture, Lambda trigger. For deploying non-serverless apps to AWS, use deploy-on-aws plugin instead.
development
Validates the user's environment for SageMaker AI operations — checks SDK version, AWS region, and execution role. Use when the user says "set up", "getting started", "check my environment", "configure SDK", or as the first step in any plan involving SageMaker/Bedrock training, evaluation, or deployment.
data-ai
Selects a base model for the user's use case by querying SageMaker Hub. Use when the user asks which model to use, wants to select or change their base model, mentions a model name or family (e.g., "Llama", "Mistral", "Nova"), or wants to evaluate a base model — always activate even for known model names because the exact Hub model ID must be resolved. Queries available models, presents benchmarks and licenses, and confirms selection.