skills/osmo-deploy/SKILL.md
How to deploy OSMO to a Kubernetes cluster on Azure (AKS), AWS (EKS), MicroK8s (single-node), or any kubectl-reachable cluster (BYO). Use this skill whenever the user asks to install, deploy, set up, or stand up OSMO; whenever they ask to provision an OSMO cluster; whenever they mention deploy-osmo-minimal.sh, deploy-k8s.sh, or "OSMO helm install"; whenever they ask to wire up workflow storage (MinIO / Azure Blob / S3); or whenever they ask to add a GPU pool to an OSMO cluster, install KAI scheduler, install the NVIDIA GPU Operator, or run the post-install smoke tests. Targets OSMO 6.3 (ConfigMap mode).
npx skillsauth add nvidia/osmo osmo-deployInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Activate when the user asks to install, deploy, set up, stand up, or provision OSMO; when they reference deploy-osmo-minimal.sh / deploy-k8s.sh / "OSMO helm install"; when they ask to wire up workflow storage (MinIO / Azure Blob / S3); or when they ask to add a GPU pool, install KAI Scheduler, install the NVIDIA GPU Operator, or run post-install smoke tests on an OSMO cluster.
Do not activate for general OSMO usage questions (running workflows, CLI usage, troubleshooting a running deployment) — those belong to the osmo-user skill.
This skill requires OSMO >= 6.3 (ConfigMap mode). Earlier 6.2 CLI-write mode is not supported — the chart's HTTP 409 on osmo config update is the runtime signal that the cluster landed on 6.2. If the default latest channel still resolves to a 6.2 release, the deploy with no env-var pins lands on the unsupported variant. You MUST set the three version pins to a 6.3.x release before invoking the script — see "Picking chart, image, and CLI versions" below. Run --list-chart-versions to discover the latest available tags.
The canonical entry point is scripts/deploy-osmo-minimal.sh under osmo/external/deployments/. Run from inside that directory:
cd osmo/external/deployments
./scripts/deploy-osmo-minimal.sh --provider <azure|aws|microk8s|byo> [options]
The script orchestrates these phases:
services.configs.workflow.workflow_*.credential.secretNameservice release (the 6.3 chart bundles router + UI) + backend-operator release. Static base values come from values/service.yaml and values/backend-operator.yaml; per-cluster overrides ride on --set; auto-detected fragments (pod-monitor-on.yaml when prometheus-operator CRDs exist, gpu-pool.yaml when GPU nodes exist) and the storage fragment are layered with additional -f flags.verify-hello.yaml (CPU) + verify-gpu.yaml (GPU; skipped under --no-gpu)osmo-gateway:9000 (gateway-aware target — falls back to osmo-service when the gateway is disabled) and osmo-ui:3000| Provider | When to use |
|---|---|
| azure | Cloud install on Azure AKS with managed PostgreSQL + Redis. Optional GPU node pool + Blob storage account when --gpu-node-pool and storage_account_enabled=true. |
| aws | Cloud install on AWS EKS with RDS + ElastiCache. Optional GPU node group via --gpu-node-pool. |
| microk8s | Single-node K8s on a Brev/local Ubuntu box. The script bootstraps MicroK8s itself (snapd, addons, optional NVIDIA addon). |
| byo | A cluster you already have. Skips bootstrap and TF entirely. Required env vars: POSTGRES_HOST POSTGRES_USERNAME POSTGRES_PASSWORD POSTGRES_DB_NAME REDIS_HOST REDIS_PORT REDIS_PASSWORD (IS_PRIVATE_CLUSTER optional, defaults to false). |
When the user asks to deploy OSMO without supplying every flag/env, prompt for these inputs first. Map answers to env vars (or --flag equivalents) and only then invoke deploy-osmo-minimal.sh.
Universal prompts (every provider):
TF_GPU_NODE_POOL_ENABLED=false, skip prompts 2-4.TF_GPU_COUNT=<n> and TF_GPU_NODE_POOL_ENABLED=true.Standard_NC40ads_H100_v5) or AWS instance type (e.g. p4d.24xlarge). If the user gives an informal name (H100, A10, T4), translate to the canonical SKU: H100 → Standard_NC40ads_H100_v5, A10 → Standard_NV36ads_A10_v5, T4 → Standard_NC4as_T4_v3 on Azure. Set TF_GPU_VM_SIZE=<sku>. Default to Standard_NC40ads_H100_v5 on Azure / p4d.24xlarge on AWS if the user leaves it blank.idk.
idk → call ./scripts/deploy-osmo-minimal.sh --find-gpu-region "$TF_GPU_VM_SIZE" "$TF_GPU_COUNT". The script iterates TF_REGION_CANDIDATES (env-overridable; default covers H100-likely Azure regions: eastus2 swedencentral westus3 southcentralus westeurope) and prints the first region whose quota fits. Set TF_REGION to that. Exits non-zero if no candidate has quota — surface the error to the user with a suggestion to expand TF_REGION_CANDIDATES.TF_REGION to it directly.Azure-only prompts (provider=azure, when not already supplied via flags/env):
--subscription-id or TF_SUBSCRIPTION_ID, default to $(az account show --query id -o tsv) and ask the user to confirm or override.--resource-group or TF_RESOURCE_GROUP. The deploy preflight auto-creates the group (tagged osmo-deploy-managed=true) when TF_RESOURCE_GROUP is set and the group doesn't already exist — so the agent should pass the chosen name and let the script handle creation. Check with az group show -n <rg> if you want to see whether it pre-exists; create manually with az group create -n <rg> -l <region> only if you want the group to outlive future terraform destroy runs.AWS-only prompts: AWS region (--aws-region) and profile (--aws-profile) — defaults us-west-2 / default are usually fine; only re-prompt if the user explicitly didn't pick a region.
Once these are collected, invoke the script in --non-interactive mode with the answers passed as env vars (or --flag equivalents) — that avoids the script re-asking the same questions and keeps the agent's prompts as the single source of truth.
This skill requires OSMO >= 6.3 (ConfigMap mode). Pinning all three env vars below is mandatory, not optional.
The script's default behavior (no env vars set) resolves to:
OSMO_CHART_VERSION empty → helm picks the latest stable chart in repoOSMO_IMAGE_TAG=latest → most recent GA imageOSMO_CLI_REF=main → bootstraps the latest GA CLI via the upstream install.shIf the latest GA is still on 6.2, leaving any of these at default lands you on the unsupported CLI-write-mode variant. The Helm install fails (e.g. on Ingress validation), or the CLI's wire format doesn't match the service, or the chart attempts osmo config update and gets HTTP 409 in ConfigMap mode.
# Step 1: discover the latest available 6.3.x chart/image/CLI tags
# (passes --devel so prerelease RCs appear).
./scripts/deploy-osmo-minimal.sh --list-chart-versions
# Step 2: pin all three env vars to the same release before invoking the deploy.
# - Use the latest non-prerelease 6.3.x release if one exists.
# - Otherwise pin to the latest 6.3.x prerelease RC.
export OSMO_CHART_VERSION=<chart version from step 1>
export OSMO_IMAGE_TAG=<matching app/image tag>
export OSMO_CLI_REF=<matching CLI release tag>
The chart version + image/app tag + CLI tag are published together as a release pair. Match them — don't mix.
OSMO_CHART_VERSION — helm's "latest" resolution can roll forward unexpectedly. Pinning a specific 6.3.x chart prevents both (a) accidentally landing on a 6.2 chart and (b) drifting between deploys.OSMO_IMAGE_TAG — must match the chart's expected app version. The chart's templates assume specific image entrypoints and env contracts that change across minor releases.OSMO_CLI_REF — the osmo CLI's wire format (auth, workflow submit/get, configmap loading) must match the service. A CLI from a different minor version often connects but fails at the first non-trivial call. The deploy script's install_osmo_cli_if_missing honors OSMO_CLI_REF by downloading the matching installer directly to $HOME/.local/bin (no sudo); override the destination via OSMO_CLI_TARGET.Both stable and prerelease 6.3.x versions are published to nvidia/osmo. Prerelease tags (*-prerelease-rc*) are hidden from helm search by default — that's why --list-chart-versions passes --devel. The pinning workflow above is identical either way.
Use --storage-backend {auto|minio|azure-blob|s3|byo|none}:
s3_bucket output → osmo Azure TF storage_account outputminio addon; otherwise installs the bitnami MinIO chartSTORAGE_ACCOUNT/STORAGE_KEY env vars first, falls back to osmo Azure TF outputsSTORAGE_BUCKET/STORAGE_ACCESS_KEY_ID/STORAGE_ACCESS_KEY env vars first, falls back to osmo AWS TF outputs (when s3_bucket_enabled=true). For IAM-role-based auth (IRSA), use --backend byo --auth-method workload-identity instead.STORAGE_ACCESS_KEY_ID, STORAGE_ACCESS_KEY, STORAGE_ENDPOINT, optional STORAGE_REGION, STORAGE_OVERRIDE_URL). No resources created.In static-auth mode the helper writes K8s Secrets (osmo-workflow-{data,log,app}-cred) and a Helm values fragment that the chart consumes via services.configs.workflow.workflow_*.credential.secretName. There are no osmo config update or osmo credential set CLI calls — those return HTTP 409 in 6.3 ConfigMap mode.
--auth-method)--auth-method {static|workload-identity} controls how OSMO services authenticate to the cloud storage backend.
azure-blob + WI = AKS Workload Identity (UAMI + federated credential)byo + WI = AWS IRSA (IAM role + EKS OIDC trust policy)minio + WI = not supported (MinIO has no cloud-vendor IdP)⚠ Workload identity mode requires caller-provisioned cloud-side identity. The deploy scripts do not create the UAMI / IAM role, attach RBAC, or create the federated credential — those are owned by the caller (typically the platform/security team). The script does the K8s-side wiring (SA annotation + pod labels for the AKS WI mutating webhook + DefaultDataCredential values fragment) and surfaces a prominent prerequisite checklist before any work begins. If prerequisites aren't met, OSMO will start successfully but workflows will fail at runtime with 401/403 from the storage backend.
# 1. AKS cluster has OIDC issuer + Workload Identity addons
az aks update -g <rg> -n <cluster> --enable-oidc-issuer --enable-workload-identity
# 2. Provision UAMI
az identity create -g <rg> -n osmo-data-uami
# 3. Grant Storage Blob Data Contributor on the storage account
az role assignment create \
--role "Storage Blob Data Contributor" \
--assignee <UAMI-principal-id> \
--scope /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Storage/storageAccounts/<account>
# 4. Federate UAMI to the chart's ServiceAccount (default name: osmo-minimal)
az identity federated-credential create \
--name osmo-osmo-minimal \
--identity-name osmo-data-uami \
--resource-group <rg> \
--issuer "$(az aks show -g <rg> -n <cluster> --query oidcIssuerProfile.issuerUrl -o tsv)" \
--subject "system:serviceaccount:osmo-minimal:osmo-minimal"
# 5. Run deploy with WI
./scripts/deploy-osmo-minimal.sh --provider byo \
--storage-backend azure-blob --auth-method workload-identity \
--workload-identity-client-id "$(az identity show -g <rg> -n osmo-data-uami --query clientId -o tsv)"
# 1. EKS cluster has an OIDC identity provider (most clusters already do)
aws eks describe-cluster --name <cluster> --query "cluster.identity.oidc.issuer"
# 2. Create IAM role with S3 access + EKS OIDC trust policy
# Trust policy admits: system:serviceaccount:osmo-minimal:osmo-minimal
# 3. Run deploy with WI
./scripts/deploy-osmo-minimal.sh --provider byo \
--storage-backend byo --auth-method workload-identity \
--workload-identity-role-arn arn:aws:iam::<acct>:role/osmo-data-access
--with-nfs-storage, azure only)Pass --with-nfs-storage on Azure when a downstream skill on the cluster needs an Azure Files Premium NFS-backed ReadWriteMany StorageClass. The canonical consumer is NIM Operator multi-node inference (its NIMCache + shared-model volumes are RWX-only per https://docs.nvidia.com/nim-operator/latest/multi-node.html); other RWX users (shared dataset caches, KServe RWX models) need it too.
What osmo provisions: a Premium FileStorage Azure Storage Account (VNet-restricted, output as nfs_storage_account) and the four AKS role assignments file.csi.azure.com needs to dynamically provision NFS file shares against it — Network Contributor on the VNet, on the AKS NSG, and on the database NSG, plus Storage Account Contributor scoped to the NFS SA. Without all four, dynamic PVC provisioning fails with LinkedAuthorizationFailed.
What osmo does NOT provision: the StorageClass manifest, the default-SC swap, or any consumer-specific PVC. Those belong to the consumer skill (per separation of concerns — osmo doesn't own RWX, the consumers do). The consumer reads terraform output -raw nfs_storage_account to learn the SA name and uses it to render its own StorageClass against file.csi.azure.com with protocol: nfs.
Cost note: Premium FileStorage SAs bill on provisioned capacity (~$0.16/GiB-month, 100 GiB minimum). Opt in only when a downstream consumer needs RWX.
Without the flag (default): no NFS SA is created. RWX PVCs created later sit Pending forever — the AKS default managed-csi / default classes only support RWO. The error is the consumer's to surface.
Hand-editable static values live in deployments/values/:
service.yaml — base values for the service chart (router + UI bundled)backend-operator.yaml — base values for the backend-operator chartgpu-pool.yaml — opt-in fragment, layered when GPU nodes are detectedpod-monitor-on.yaml — opt-in fragment, layered when prometheus-operator CRDs are detectedPer-cluster values (PG/Redis hosts, image registry/tag, NGC pull secret name, namespace) are not in those files — they're injected at install time via --set so users can edit the YAML for things that don't change per-cluster. service.yaml mirrors the docs minimal-deploy reference. See the values README for layering details.
Security note:
service.yamlships with the gateway's OAuth2 Proxy + authz disabled (matching the docs minimal example). The gateway then trusts client-suppliedx-osmo-{user,roles,allowed-pools}headers. Do not expose this gateway to untrusted networks. For production deploys, use the standard deployment guide path which keeps OAuth2 + authz enabled.
# Azure with GPU pool + Blob storage
./scripts/deploy-osmo-minimal.sh --provider azure --gpu-node-pool --storage-backend azure-blob
# AWS, CPU only
./scripts/deploy-osmo-minimal.sh --provider aws --no-gpu
# Single-node MicroK8s on a fresh Ubuntu box (with GPU)
./scripts/deploy-osmo-minimal.sh --provider microk8s --gpu --storage-backend minio
# Existing cluster (orion-cluster-azure, etc.) — caller exports DB/Redis env vars first
export POSTGRES_HOST=... REDIS_HOST=... # (full list above)
./scripts/deploy-osmo-minimal.sh --provider byo --storage-backend azure-blob
# Tear down
./scripts/deploy-osmo-minimal.sh --provider <x> --destroy
Every cluster-agnostic install is safe to no-op when its target is already present (CRD checks for KAI, multi-signal detection for GPU Operator, addon/release detection for MinIO). This makes the script safe to layer on top of clusters that already have these components installed (e.g. when an upstream skill provisioned the cluster + KAI + GPU Operator first). Re-runs are also safe — backend-operator tokens are reused if a non-placeholder value already exists in osmo-operator-token.
osmo CLI not found: the script will install from GitHub on first run; if it fails, install manually then re-run.kubectl logs -n osmo-minimal -l app=osmo-servicenvidia.com/gpu.present=true; kubectl describe node to verifypkill -f 'osmo-pf-watchdog:'az aks command invoke for kubectl access; the script sets IS_PRIVATE_CLUSTER=true automatically when detectedinstall-kai-scheduler.sh, install-gpu-operator.sh, install-minio.sh,
configure-storage.sh (+ storage/{minio,azure-blob,s3,byo}.sh),
port-forward.sh, verify.sh, microk8s/install.shverify-hello.yaml, verify-gpu.yamltools
Operate the OSMO CLI to discover GPU resources, submit and monitor workflows, debug PENDING/FAILED/stuck workflows, interpret OSMO errors, surface OSMO workflow Grafana and Kubernetes dashboard links, and publish workflows as OSMO apps. Trigger when the user asks about OSMO pools, quota, GPUs, workflow status/logs/submission, OSMO errors, OSMO apps, or about the Grafana or Kubernetes dashboard for an OSMO workflow — even if they don't say "OSMO" explicitly. Do NOT use for general kubectl install/configuration, raw Kubernetes setup unrelated to an OSMO workflow, NVIDIA hardware/product questions unrelated to OSMO, or non-OSMO compute platforms.
testing
Create, edit, improve, or audit AgentSkills. Use when creating a new skill from scratch or when asked to improve, review, audit, tidy up, or clean up an existing skill or SKILL.md file. Also use when editing or restructuring a skill directory (moving files to references/ or scripts/, removing stale content, validating against the AgentSkills spec). Triggers on phrases like "create a skill", "author a skill", "tidy up a skill", "improve this skill", "review the skill", "clean up the skill", "audit the skill".
testing
Host security hardening and risk-tolerance configuration for OpenClaw deployments. Use when a user asks for security audits, firewall/SSH/update hardening, risk posture, exposure review, OpenClaw cron scheduling for periodic checks, or version status checks on a machine running OpenClaw (laptop, workstation, Pi, VPS).
testing
Create, edit, improve, or audit AgentSkills. Use when creating a new skill from scratch or when asked to improve, review, audit, tidy up, or clean up an existing skill or SKILL.md file. Also use when editing or restructuring a skill directory (moving files to references/ or scripts/, removing stale content, validating against the AgentSkills spec). Triggers on phrases like "create a skill", "author a skill", "tidy up a skill", "improve this skill", "review the skill", "clean up the skill", "audit the skill".