.github/skills/infrastructure/SKILL.md
Deploy and manage Azure infrastructure for the Physical AI Toolchain including Terraform IaC, Kubernetes setup, GPU configuration, and network topology
npx skillsauth add microsoft/physical-ai-toolchain infrastructureInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Deploy and manage Azure cloud infrastructure for the Physical AI Toolchain — Terraform IaC, AKS cluster configuration, GPU node pools, and network topology.
| Tool | Requirement |
|------|-------------|
| Azure CLI | az login authenticated |
| Terraform | 1.5+ |
| kubectl | Matching cluster version |
| Helm | 3.x |
| shellcheck | For script validation |
Follow these steps in order for a complete deployment.
source infrastructure/terraform/prerequisites/az-sub-init.sh
Exports ARM_SUBSCRIPTION_ID and validates Azure CLI authentication.
cd infrastructure/terraform
cp terraform.tfvars.example terraform.tfvars
Edit terraform.tfvars with environment-specific values. Example configurations are in infrastructure/examples/:
| File | Scenario |
|------|----------|
| terraform.tfvars.dev | Single spot GPU pool, public networking |
| terraform.tfvars.prod | Multiple GPU pools, full private networking, HA |
| terraform.tfvars.hybrid | Private data services, public AKS API server |
terraform init
terraform plan -var-file=terraform.tfvars
terraform apply -var-file=terraform.tfvars
Required when should_enable_private_aks_cluster = true:
cd infrastructure/terraform/vpn
terraform init && terraform apply
az aks get-credentials --resource-group <rg> --name <aks>
kubectl cluster-info
cd infrastructure/setup
./01-deploy-robotics-charts.sh
./02-deploy-azureml-extension.sh
./03-deploy-osmo-control-plane.sh
./04-deploy-osmo-backend.sh
Scripts must run in numeric order. Each supports --config-preview for dry-run output.
Three network modes control connectivity and security:
| Mode | should_enable_private_endpoint | should_enable_private_aks_cluster | VPN Required |
|------|----------------------------------|-------------------------------------|--------------|
| Full Private | true | true | Yes |
| Hybrid | true | false | No |
| Full Public | false | false | No |
Full Private is the default and recommended for production. Hybrid mode allows kubectl access without VPN while keeping data services private.
cd infrastructure/terraform
terraform plan -var-file=terraform.tfvars
terraform apply -var-file=terraform.tfvars
terraform destroy -var-file=terraform.tfvars
cd infrastructure/terraform/vpn
terraform init && terraform apply
cd infrastructure/terraform/dns
terraform init && terraform apply
shellcheck infrastructure/setup/01-deploy-robotics-charts.sh
infrastructure/setup/01-deploy-robotics-charts.sh --config-preview
terraform fmt -check -recursive infrastructure/terraform/
infrastructure/
├── terraform/ # Infrastructure as Code
│ ├── main.tf # Module composition
│ ├── variables.tf # Input variables
│ ├── outputs.tf # Output values
│ ├── versions.tf # Provider requirements
│ ├── terraform.tfvars.example # Example configuration
│ ├── prerequisites/ # Azure subscription setup
│ ├── modules/ # Terraform modules
│ ├── vpn/ # Standalone VPN deployment
│ ├── automation/ # Standalone automation deployment
│ └── dns/ # Standalone DNS deployment
├── setup/ # Post-deploy cluster configuration
│ ├── 01-deploy-robotics-charts.sh # GPU Operator, KAI Scheduler
│ ├── 02-deploy-azureml-extension.sh # AzureML K8s extension
│ ├── 03-deploy-osmo-control-plane.sh# OSMO control plane
│ ├── 04-deploy-osmo-backend.sh # OSMO backend services
│ ├── defaults.conf # Central version and namespace config
│ └── lib/ # Shared shell libraries
├── specifications/ # Domain specification documents
└── examples/ # Example tfvars configurations
| GPU | VM SKU | Driver Source | gpu_driver | MIG Strategy |
|-----|--------|--------------|--------------|--------------|
| A10 | Standard_NV36ads_A10_v5 | AKS-managed | Install | N/A |
| RTX PRO 6000 | Standard_NC128ds_xl_RTXPRO6000BSE_v6 | GRID DaemonSet | None | single |
| H100 | Standard_NC40ads_H100_v5 | GPU Operator | None | Disabled |
RTX PRO 6000 nodes require nvidia.com/gpu.deploy.driver=false label to prevent GPU Operator driver conflicts.
| Guide | Description | |-------|-------------| | Infrastructure README | Domain overview and quick start | | Terraform README | Terraform configuration reference | | Setup README | Setup script reference | | Infrastructure Deployment | Full deployment walkthrough | | GPU Configuration | Detailed GPU driver and operator reference |
devops
Generate synthetic training data using NVIDIA Cosmos world foundation models for SDG pipelines
tools
Submit, monitor, analyze, and evaluate LeRobot imitation learning training jobs on OSMO with Azure ML MLflow integration and inference evaluation - Brought to you by microsoft/physical-ai-toolchain
devops
Monitor robot fleet telemetry via Azure IoT Operations, drift detection, Grafana dashboards, and Fabric analytics
tools
Deploy trained robot policies to edge fleets via FluxCD GitOps, image automation, and deployment gating