plugins/sagemaker-ai/skills/hyperpod-version-checker/SKILL.md
Check and compare software component versions on SageMaker HyperPod cluster nodes - NVIDIA drivers, CUDA toolkit, cuDNN, NCCL, EFA, AWS OFI NCCL, GDRCopy, MPI, Neuron SDK (Trainium/Inferentia), Python, and PyTorch. Use when checking component versions, verifying CUDA/driver compatibility, detecting version mismatches across nodes, planning upgrades, documenting cluster configuration, or troubleshooting version-related issues on HyperPod. Triggers on requests about versions, compatibility, component checks, or upgrade planning for HyperPod clusters.
npx skillsauth add awslabs/agent-plugins hyperpod-version-checkerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Upload to cluster nodes via hyperpod-ssm skill, then execute.
# Text report to console + file
bash hyperpod_check_versions.sh
# JSON only to stdout (text report still saved to file) — best for piping/parsing
bash hyperpod_check_versions.sh --json
# Custom output file
bash hyperpod_check_versions.sh --output /tmp/versions.txt
# No color (for logging)
bash hyperpod_check_versions.sh --no-color
Output file: component_versions_<hostname>_<timestamp>.txt (default)
| Component | Detection Method | Applicable When |
| ----------------- | ----------------------------------------------- | --------------------------------------------- |
| NVIDIA Driver | nvidia-smi | GPU instances (p3/p4/p5/g5) |
| CUDA Toolkit | nvcc, /usr/local/cuda symlink | GPU instances |
| cuDNN | Header file, packages | GPU instances doing deep learning |
| NCCL | Library filename, header, packages | Distributed GPU training |
| EFA | /opt/amazon/efa_installed_packages, fi_info | EFA-capable instances (p4d/p4de/p5/trn1/trn2) |
| AWS OFI NCCL | efa_installed_packages, library search | EFA + NCCL workloads |
| GDRCopy | rpm/dpkg, kernel module | GPU instances with RDMA (p4d+/p5) |
| MPI | mpirun, /opt/amazon/openmpi | Distributed training |
| Neuron SDK | neuronx-cc, neuron-ls, packages | Trainium/Inferentia (trn1/trn2/inf1/inf2) |
| Python/PyTorch | python3, torch import | ML workloads |
| Container runtime | docker, containerd, kubectl, nvidia-ctk | EKS clusters |
Run on each node individually via the hyperpod-ssm skill. With --json, stdout is clean JSON for easy diffing.
The script automatically analyzes CUDA/driver compatibility. For reference:
| Driver Series | Supported CUDA | | ------------- | ----------------------------- | | 580+ | 13.x, 12.x, 11.x | | 570+ | 12.8+ (Blackwell), 12.x, 11.x | | 545+ | 12.3-12.7, 11.x | | 525-535 | 12.0-12.2, 11.x | | 450+ | 11.x only |
NCCL: Use 2.18+ for CUDA 12.x, 2.12+ for CUDA 11.x. Must be consistent across all nodes.
| EFA Installer | AWS OFI NCCL | | ------------- | --------------------- | | 1.29+ | v1.7.3+ (recommended) | | 1.26-1.28 | v1.7.0-v1.7.2 | | 1.20-1.25 | v1.6.0+ |
development
Deploy to AWS Elastic Beanstalk. Triggers on: elastic beanstalk, EB, managed EC2 platform, web app with managed patching, worker on EC2, Heroku alternative, don't want to manage servers or containers, migrate from Heroku, managed operational lifecycle. Covers Elastic Beanstalk on EC2 for web and worker applications.
testing
Evaluate, configure, and migrate workloads to AWS Lambda Managed Instances (LMI). Triggers on: Lambda Managed Instances, LMI, capacity provider, multi-concurrency Lambda, dedicated instance Lambda, EC2-backed Lambda, cold start elimination, Graviton Lambda, instance type for Lambda, Lambda cost optimization with Reserved Instances or Savings Plans. Also trigger when users describe high-volume predictable workloads seeking cost savings, or compare Lambda vs EC2 for steady-state traffic. For standard Lambda without LMI, use the aws-lambda skill instead.
development
Deploy applications to AWS. Triggers on phrases like: deploy to AWS, host on AWS, run this on AWS, AWS architecture, estimate AWS cost, generate infrastructure. Analyzes any codebase and deploys to optimal AWS services.
development
Build with Aurora DSQL — manage schemas, execute queries, handle migrations, diagnose query plans, load data, and develop applications with a serverless, distributed SQL database. Covers IAM auth, multi-tenant patterns, MySQL-to-DSQL migration, DDL operations, query plan explainability, SQL compatibility validation, and bulk data loading. Triggers on phrases like: DSQL, Aurora DSQL, create DSQL table, DSQL schema, migrate to DSQL, distributed SQL database, serverless PostgreSQL-compatible database, DSQL query plan, DSQL EXPLAIN ANALYZE, why is my DSQL query slow, aurora-dsql-loader, load CSV into DSQL.