skills/osmo-agent/SKILL.md
Operate the OSMO CLI to discover GPU resources, submit and monitor workflows, debug PENDING/FAILED/stuck workflows, interpret OSMO errors, surface OSMO workflow Grafana and Kubernetes dashboard links, and publish workflows as OSMO apps. Trigger when the user asks about OSMO pools, quota, GPUs, workflow status/logs/submission, OSMO errors, OSMO apps, or about the Grafana or Kubernetes dashboard for an OSMO workflow — even if they don't say "OSMO" explicitly. Do NOT use for general kubectl install/configuration, raw Kubernetes setup unrelated to an OSMO workflow, NVIDIA hardware/product questions unrelated to OSMO, or non-OSMO compute platforms.
npx skillsauth add nvidia/osmo osmo-agentInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
This skill has been flagged as suspicious. Review the scan results before using.
2 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Run, monitor, and debug OSMO workflows from natural-language requests. OSMO is
NVIDIA's cloud platform for robotics compute and data storage; this skill maps
user requests to the right osmo CLI commands and walks the user through failure
diagnosis when workflows go wrong.
osmo CLI installed and on PATH (verify: osmo --version). If osmo is
not found, tell the user to install it from the OSMO public repository and
stop — never fabricate command output to fill the gap.osmo login). If commands return auth errors, ask
the user to re-run osmo login and stop until they confirm.osmo profile list
and osmo pool list).kubectl or that platform's own tooling.grafana_url and dashboard_url can be null for workflows that haven't
started yet or completed long enough ago that metrics were retired. Surface as
"not available" — never silently omit.The agents/ directory contains instructions for specialized subagents. Read them when you need to spawn the relevant subagent.
agents/workflow-expert.md — workflow generation, resource check, submission, failure diagnosisagents/logs-reader.md — log fetching and summarization for monitoring and failure diagnosisThe references/ directory has additional documentation:
references/workflow-patterns.md — Multi-task, parallel execution, data dependencies, Jinja templatingreferences/advanced-patterns.md — Checkpointing, retry/exit behavior, node exclusionreferences/cookbook-fetching.md — How to fetch a cookbook example and decide submission countreferences/resource-check-format.md — Output format spec for resource availability responsesreferences/troubleshooting.md — Catalog of common failure modes, exit codes, and fixesreferences/validation-error-recovery.md — Resource sizing rules when submission fails capacity assertionsreferences/workflow-status-handling.md — Link rendering, PENDING diagnosis, post-completion follow-upsAlways consult this file (SKILL.md) and the relevant reference file(s) before
running any osmo command — the use cases below specify the right command
sequence, the expected output format, and which reference holds the detail. Do
not guess at command names or flags from memory; follow the use case's steps.
Match the user's intent to a use case below, follow its steps in order, and read the linked reference when the steps say so. For diagnosing failures, jump straight to "Debug a Failed or Stuck Workflow" or the Troubleshooting section near the bottom.
When to use: The user asks what resources, nodes, GPUs, or pools are available (e.g. "what resources are available?", "what nodes can I use?", "do I have GPU quota?", "what pools do I have access to?").
Check accessible pools — run to see which pools the user's profile has access to:
osmo profile list
This returns the user's profile settings, including which pools they belong to.
Check pool resources — run to see GPU availability across all accessible pools:
osmo pool list
By default this shows used/total GPU counts. To see what's free instead:
osmo pool list --mode free
Effective availability = min(Quota Free, Total Free) — both quota and physical
limits apply. Always highlight any LOW-priority opportunity: when a pool has
Quota Free = 0 but Total Free > 0, the user can still submit with
--priority LOW to run on idle capacity (with preemption risk).
references/resource-check-format.md is required reading before generating the
response — it defines column meanings, the grouped-table layout, sorting,
callouts, and GPU-type derivation rules.
When to use: The user wants to submit a job to run on OSMO (e.g. "submit a workflow to run SDG", "run RL training for me", "submit this yaml to OSMO").
If the user also wants monitoring, debugging, or reporting results, use the "Orchestrate a Workflow End-to-End" use case instead.
Get or generate a workflow spec.
If the user provides a workflow YAML, use it as-is. Otherwise, generate one based on
what they want to run. Write the spec to workflow.yaml in the current directory.
When generating a workflow spec:
references/cookbook-fetching.md. Read it before generating.references/workflow-patterns.md for
the correct spec patterns.references/advanced-patterns.md.references/cookbook-fetching.md. Use {{output}} as the
placeholder for the output mount path — OSMO substitutes it at runtime.Ask the user what GPU type they want (e.g. H100, L40, GB200), then check availability using the steps in the "Check Available Resources" use case to confirm the right pool to use.
Ask the user for confirmation with this exact wording:
Would you like me to submit this workflow to this pool?
Then execute the command yourself — do not tell the user to run it. Once confirmed, run:
osmo workflow submit workflow.yaml --pool <pool_name> --set key=value other_key=value
Include --set only when the workflow has Jinja template variables to override
(e.g. --set num_gpu=4). Omit it if the YAML has no template variables.
If the user wants to run the same workflow multiple times (e.g. "submit 2 of these"),
submit the same YAML file multiple times — do not create duplicate YAML files.
Report each workflow ID returned by the CLI so the user can track them.
When quota is exhausted but GPUs are physically free (Quota Free = 0, Total Free > 0):
Offer to submit with --priority LOW, which bypasses quota limits and schedules on
idle capacity. LOW priority jobs may be preempted if quota-holding jobs need those
GPUs, so let the user know before proceeding. If they agree, run:
osmo workflow submit workflow.yaml --pool <pool_name> --priority LOW
Validation errors: If submission fails with a validation error indicating that
resources failed assertions, read the node capacity values from the error table,
adjust the hard-coded values in the resources section of workflow.yaml, and
resubmit. The exact sizing rules (storage/memory/CPU caps, GPU pairing, proportional
scaling) are in references/validation-error-recovery.md. Do not touch Jinja
template variables like {{num_gpu}} — those are resolved at runtime via --set.
When to use: The user wants to see all their workflows or recent submissions (e.g. "what are my workflows?", "show me my recent jobs", "what's the status of my workflows?").
List all workflows:
osmo workflow list --format-type json
Summarize results in a table showing workflow name, pool, status, and duration. Group or sort by status if helpful. Use clear symbols to indicate outcome:
When to use: The user asks about the status or logs of a workflow (e.g. "what's the status of workflow abc-123?", "is my workflow done?", "show me the logs for xyz", "show me the resource usage for my workflow", "give me the Kubernetes dashboard link"). Also used as the polling step during end-to-end orchestration.
Query the workflow:
osmo workflow query <workflow name> --format-type json
Cache the JSON for the rest of the conversation — do not re-query just to extract a field.
Fetch logs based on task count:
osmo workflow logs <workflow_id> -n 10000.agents/logs-reader.md subagents (one per 5 tasks).
Do not fetch logs inline yourself in the main conversation.Report to the user. State the current status, summarize logs concisely, and
include the Grafana and Kubernetes dashboard links by default for detailed
reports. Exact phrasing for link rendering, null handling, and resource-usage
triggers is in references/workflow-status-handling.md. If the status is
PENDING, follow that reference's pending-diagnosis steps (events + resource
list, translated to plain language).
For COMPLETED workflows, offer the output dataset download and proactively
suggest creating an OSMO app from the workflow. Exact prompts, name suggestion
rules, and batch-monitoring behavior are in
references/workflow-status-handling.md.
When to use: The user wants to create a workflow, submit it, and monitor it to completion (e.g. "train GR00T on my data", "submit and monitor my workflow", "run end-to-end training", "submit this and tell me when it's done").
The lifecycle is split between the workflow-expert subagent (workflow generation,
resource check, submission, failure diagnosis) and you (live monitoring so the
user sees real-time updates).
Spawn the workflow-expert subagent for setup and submission.
Ask it to write workflow YAML if needed, check resources, and submit only. Do NOT ask it to monitor, poll status, or report results — that is your job.
Example prompt:
Create a workflow based on user's request, if any. Check resources first, then submit the workflow to an available resource pool. Return the workflow ID when done.
The subagent returns: workflow ID, pool name, and OSMO Web link.
Monitor the workflow inline (you do this — user sees live updates).
Use the "Check Workflow Status" use case to poll and report. Repeat until a terminal state is reached. Adjust the polling interval based on how long you expect the workflow to take — poll more frequently for short jobs (every 10-15s) and less frequently for long training runs (every 30-60s). Report each state transition to the user:
Status: SCHEDULING (queued 15s)Workflow transitioned: SCHEDULING → RUNNINGStatus: RUNNING (task "train" active, 2m elapsed)Handle the outcome.
If COMPLETED: Report results — workflow ID, OSMO Web link, output datasets. Then follow Step 4 of "Check Workflow Status" (download offer + app creation).
If FAILED: First, fetch logs using the log-fetching rule from "Check Workflow
Status" Step 2 (1 task = inline, 2+ tasks = delegate to logs-reader subagents).
Then resume the workflow-expert subagent (use the resume parameter with the
agent ID from Step 1) and pass the logs summary: "Workflow <id> FAILED. Here is
the logs summary: <summary>. Diagnose and fix." It returns a new workflow ID.
Resume monitoring from Step 2. Max 3 retries before asking the user for guidance.
When to use: The user asks why a workflow failed, why it's stuck, or how to fix an OSMO error (e.g. "my workflow keeps failing", "what does this OSMO error mean?", "my pod won't start", "training crashed with exit 137", "image pull keeps failing"). This is the manual debugging path. If the user wants you to also fix and resubmit automatically, use "Orchestrate a Workflow End-to-End" instead.
Establish current state. Run the steps from "Check Workflow Status" to get the workflow's status, recent logs, and (for PENDING workflows) events. Cache the query JSON so you don't re-fetch.
Match the symptom. Open references/troubleshooting.md and look up the
matching pattern by symptom — exit code, error keyword, status, or behavior.
Common patterns covered there:
PENDING for an unusually long time (scheduling block, quota exhausted)137 (OOM kill), 139 (segfault), 143 (SIGTERM / preempted), 127
(command not found)ImagePullBackOff / ErrImagePullInit:CrashLoopBackOff (init container failure)references/validation-error-recovery.md)Explain the diagnosis in plain language. State the root cause without raw Kubernetes jargon. Say "the container ran out of memory and was killed" rather than "exit 137 / OOMKilled". If multiple causes are plausible from the logs, list them in order of likelihood and explain how to confirm each.
Recommend a concrete fix. Pull the fix recipe from the matched troubleshooting
pattern. If the fix involves editing workflow.yaml, show the user the exact diff
you would apply. Do not edit the YAML without confirmation unless the user
pre-authorized you to fix-and-resubmit.
Offer to apply the fix and resubmit. Ask the user whether to apply the fix
yourself. If they agree, edit workflow.yaml per the troubleshooting recipe and
submit using the steps in "Generate and Submit a Workflow".
logs-reader subagent for
multi-task workflows), and recent events, then ask the user how they want to
proceed. Do not invent fixes.When to use: The user asks what a workflow does, what it's configured to run, or wants to understand its purpose (e.g. "what does workflow abc-123 do?", "explain this workflow", "what is workflow xyz running?").
Fetch the workflow template:
osmo workflow spec <workflow name> --template
This returns the original workflow spec YAML that was used to submit the job, including the container image, entrypoint scripts, environment variables, and resource requests.
Read and summarize the spec. Based on the YAML output, give the user a concise plain-language summary covering:
Keep the summary short — a few sentences or a brief bullet list. The user asked what it does, not for a line-by-line YAML walkthrough.
When to use: The user wants to publish a workflow as an OSMO app (e.g. "create an app for this workflow", "make an app from my workflow", "publish this as an app"), or you are proactively offering app creation after a workflow completes.
Determine the workflow file path. If the user already has a workflow YAML (e.g.
workflow.yaml in the current directory), use that path. If they're coming from a
completed workflow, use the spec file that was submitted.
Decide on a name and description.
If the user explicitly asked to create an app, ask them what they'd like to
name it. Suggest a name based on the workflow name (e.g. sdg-run → sdg-run-app)
so they have a sensible default to accept or override. Also generate a one-sentence
description summarizing what the workflow does, and confirm it with the user before
proceeding.
If you are proactively offering (post-completion), present your suggested name and description upfront — don't ask two separate questions. Something like:
"Would you like to create an app for this workflow? I'd suggest naming it
sdg-isaac-appwith the description: 'Runs Isaac Lab SDG to generate synthetic training data.' Does that work, or would you like to change anything?"
Create the app — once the user confirms name and description, run:
osmo app create <app-name> --description "<description>" --file <path-to-workflow.yaml>
Execute this yourself — do not ask the user to run it.
Report the result — confirm the app was created and share any URL or identifier returned by the CLI.
When the user reports a failed, stuck, or misbehaving workflow, follow "Use Case:
Debug a Failed or Stuck Workflow" above. The detailed catalog of failure
signatures, diagnoses, and fixes — including exit-code lookups (137/139/143/127),
image pull errors, init container failures, NCCL timeouts, missing-output
patterns, and PENDING capacity vs quota distinction — is in
references/troubleshooting.md. For submission-time validation errors, the
resource-sizing recipe is in references/validation-error-recovery.md.
testing
Use only for offline/local OSMO service-config admin requests involving explicit config roots or values files, or to ask for one when a file-specific config request omits it. Do not inspect the workspace to infer a root. Do not use for live workflow support, resource capacity, pod/node diagnostics, or cluster operations, except live service-config paths that must be refused.
testing
How to deploy OSMO to a Kubernetes cluster on Azure (AKS), AWS (EKS), MicroK8s (single-node), or any kubectl-reachable cluster (BYO). Use this skill whenever the user asks to install, deploy, set up, or stand up OSMO; whenever they ask to provision an OSMO cluster; whenever they mention deploy-osmo-minimal.sh, deploy-k8s.sh, or "OSMO helm install"; whenever they ask to wire up workflow storage (MinIO / Azure Blob / S3); or whenever they ask to add a GPU pool to an OSMO cluster, install KAI scheduler, install the NVIDIA GPU Operator, or run the post-install smoke tests. Targets OSMO 6.3 (ConfigMap mode).
tools
Use when work should span one or more detached tasks but still behave like one job with a single owner context. TaskFlow is the durable flow substrate under authoring layers like Lobster, ACPX, plugins, or plain code. Keep conditional logic in the caller; use TaskFlow for flow identity, child-task linkage, waiting state, revision-checked mutations, and user-facing emergence.
tools
# Lobster Lobster executes multi-step workflows with approval checkpoints. Use it when: - User wants a repeatable automation (triage, monitor, sync) - Actions need human approval before executing (send, post, delete) - Multiple tool calls should run as one deterministic operation ## When to use Lobster | User intent | Use Lobster? | | ------------------------------------------------------ | --------------------------