Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

nvidia-nemo/launching-evals

Name: launching-evals
Author: nvidia-nemo

packages/nemo-evaluator-launcher/.claude/skills/launching-evals/SKILL.md

npx skillsauth add nvidia-nemo/evaluator launching-evals

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

NeMo Evaluator Skill

Quick Reference

nemo-evaluator-launcher CLI

# Run evaluation
uv run nemo-evaluator-launcher run --config <path.yaml>
uv run nemo-evaluator-launcher run --config <path.yaml> -t <a_single_task_to_be_run_by_name>
uv run nemo-evaluator-launcher run --config <path.yaml> -t <task_name_1> -t <task_name_2> ...
uv run nemo-evaluator-launcher run --config <path.yaml> -o evaluation.nemo_evaluator_config.config.params.limit_samples=10 ...

# Preview the resolved config and the sbatch script without running the evaluation
uv run nemo-evaluator-launcher run --config <path.yaml> --dry-run

# Check status (--json for machine-readable output)
uv run nemo-evaluator-launcher status <invocation_id> --json

# Get evaluation run info (output paths, slurm job IDs, cluster hostname, etc.)
uv run nemo-evaluator-launcher info <invocation_id>

# Copy just the logs (quick — good for debugging)
uv run nemo-evaluator-launcher info <invocation_id> --copy-logs ./evaluation-results/

# For artifacts: use `nel info` to discover paths. If remote, SSH to explore and rsync what you need.
# If local, just read directly from the paths shown by `nel info`.
# ssh <user>@<hostname> "ls <artifacts_path>/"
# rsync -avzP <user>@<hostname>:<artifacts_path>/{results.yml,eval_factory_metrics.json,config.yml} ./evaluation-results/<invocation_id>.<job_index>/artifacts/

# List past runs
uv run nemo-evaluator-launcher ls runs --since 1d   

# List available evaluation tasks (by default, only shows tasks from the latest released containers)
uv run nemo-evaluator-launcher ls tasks
uv run nemo-evaluator-launcher ls tasks --from_container gitlab-master.nvidia.com/dl/joc/competitive_evaluation/nvidia-core-evals/ci-llm/long-context-eval:dev-2025-12-16T14-37-1693de28-amd64

Workflow

The complete evaluation workflow is divided into the following steps you should follow IN ORDER.

Create or modify a config using the nel-assistant skill. If the user provides a past run, use its config.yml artifact as a starting point.
Run the evaluation. See references/run-evaluation.md when executing this step.
Check progress (while RUNNING). See references/check-progress.md when executing this step.
Post-run actions (when terminal state reached):
1. When the evaluation status is SUCCESS, analyze the results. See references/analyze-results.md when executing this step.
2. When the evaluation status is FAILED, debug the failed run. See references/debug-failed-runs.md when executing this step.

Key Facts

Benchmark-specific info learned during launching/analyzing evals should be added to references/benchmarks/
PPP = Slurm account (the account field in cluster_config.yaml). When the user says "change PPP to X", update the account value (e.g., coreai_dlalgo_compeval → coreai_dlalgo_llm).
Slurm job pairs: NEL (nemo-evaluator-launcher) submits paired Slurm jobs — a RUNNING job + a PENDING restart job (for when the 4h walltime expires). Never cancel the pending restart jobs — they are expected and necessary.
HF cache requirement: For configs with HF_HUB_OFFLINE=1, models must be pre-downloaded to the HF cache on each cluster before launching. Before running a model on a new cluster, always ask the user if the model is already cached there. If not, on the cluster login node: python3 -m venv hf_cli && source hf_cli/bin/activate && pip install huggingface_hub then HF_HOME=/lustre/fsw/portfolios/coreai/users/<username>/cache/huggingface hf download <model>. Without this, vLLM will fail with LocalEntryNotFoundError.
data_parallel_size is per node: dp_size=1 with num_nodes=8 means 8 model instances total (one per node), load-balanced by haproxy. Do NOT interpret dp_size as the global replica count.
payload_modifier interceptor: The params_to_remove list (e.g. [max_tokens, max_completion_tokens]) strips those fields from the outgoing payload, intentionally lifting output length limits so reasoning models can think as long as they need.
Auto-export git workaround: The export container (python:3.12-slim) lacks git. When installing the launcher from a git URL, set auto_export.launcher_install_cmd to install git first (e.g., apt-get update -qq && apt-get install -qq -y git && pip install "nemo-evaluator-launcher[all] @ git+...#subdirectory=packages/nemo-evaluator-launcher").
Do NOT use nemo-evaluator-launcher export --dest local — it only writes a summary JSON (processed_results.json), it does NOT copy actual logs or artifacts despite accepting --copy_logs and --copy-artifacts flags. nel info --copy-artifacts works but copies everything (very slow for large benchmarks). Preferred approach: use nel info to discover paths — if local, read directly; if remote, SSH to explore and rsync only what you need. Note that nel info prints standard artifacts but benchmarks produce additional artifacts in subdirs — explore to find them.

nvidia-nemo/launching-evals

packages/nemo-evaluator-launcher/.claude/skills/launching-evals/SKILL.md

Run, monitor, analyze, and debug LLM evaluations via nemo-evaluator-launcher. Covers running evaluations, checking status and live progress, debugging failed runs, exporting artifacts and logs, and analyzing results. ALWAYS triggers on mentions of running evaluations, checking progress, debugging failed evals, analyzing or analysing runs or results, run directories or artifact paths on clusters, Slurm job issues, invocation IDs, or inspecting logs (client logs, server logs, SSH to cluster, tail logs, grep logs). Do NOT use for creating or modifying evaluation configs.

255 stars

tools

Updated Apr 15, 2026

$ install --global

skillsauth

npx skillsauth add nvidia-nemo/evaluator launching-evals

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 15, 2026, 11:28 AM13.1s9 files scanned

SKILL.md

name:: launching-evals
description:: Run, monitor, analyze, and debug LLM evaluations via nemo-evaluator-launcher. Covers running evaluations, checking status and live progress, debugging failed runs, exporting artifacts and logs, and analyzing results. ALWAYS triggers on mentions of running evaluations, checking progress, debugging failed evals, analyzing or analysing runs or results, run directories or artifact paths on clusters, Slurm job issues, invocation IDs, or inspecting logs (client logs, server logs, SSH to cluster, tail logs, grep logs). Do NOT use for creating or modifying evaluation configs.

NeMo Evaluator Skill

Quick Reference

nemo-evaluator-launcher CLI

# Run evaluation
uv run nemo-evaluator-launcher run --config <path.yaml>
uv run nemo-evaluator-launcher run --config <path.yaml> -t <a_single_task_to_be_run_by_name>
uv run nemo-evaluator-launcher run --config <path.yaml> -t <task_name_1> -t <task_name_2> ...
uv run nemo-evaluator-launcher run --config <path.yaml> -o evaluation.nemo_evaluator_config.config.params.limit_samples=10 ...

# Preview the resolved config and the sbatch script without running the evaluation
uv run nemo-evaluator-launcher run --config <path.yaml> --dry-run

# Check status (--json for machine-readable output)
uv run nemo-evaluator-launcher status <invocation_id> --json

# Get evaluation run info (output paths, slurm job IDs, cluster hostname, etc.)
uv run nemo-evaluator-launcher info <invocation_id>

# Copy just the logs (quick — good for debugging)
uv run nemo-evaluator-launcher info <invocation_id> --copy-logs ./evaluation-results/

# For artifacts: use `nel info` to discover paths. If remote, SSH to explore and rsync what you need.
# If local, just read directly from the paths shown by `nel info`.
# ssh <user>@<hostname> "ls <artifacts_path>/"
# rsync -avzP <user>@<hostname>:<artifacts_path>/{results.yml,eval_factory_metrics.json,config.yml} ./evaluation-results/<invocation_id>.<job_index>/artifacts/

# List past runs
uv run nemo-evaluator-launcher ls runs --since 1d   

# List available evaluation tasks (by default, only shows tasks from the latest released containers)
uv run nemo-evaluator-launcher ls tasks
uv run nemo-evaluator-launcher ls tasks --from_container gitlab-master.nvidia.com/dl/joc/competitive_evaluation/nvidia-core-evals/ci-llm/long-context-eval:dev-2025-12-16T14-37-1693de28-amd64

Workflow

The complete evaluation workflow is divided into the following steps you should follow IN ORDER.

Create or modify a config using the nel-assistant skill. If the user provides a past run, use its config.yml artifact as a starting point.
Run the evaluation. See references/run-evaluation.md when executing this step.
Check progress (while RUNNING). See references/check-progress.md when executing this step.
Post-run actions (when terminal state reached):
1. When the evaluation status is SUCCESS, analyze the results. See references/analyze-results.md when executing this step.
2. When the evaluation status is FAILED, debug the failed run. See references/debug-failed-runs.md when executing this step.

Key Facts

Benchmark-specific info learned during launching/analyzing evals should be added to references/benchmarks/
PPP = Slurm account (the account field in cluster_config.yaml). When the user says "change PPP to X", update the account value (e.g., coreai_dlalgo_compeval → coreai_dlalgo_llm).
Slurm job pairs: NEL (nemo-evaluator-launcher) submits paired Slurm jobs — a RUNNING job + a PENDING restart job (for when the 4h walltime expires). Never cancel the pending restart jobs — they are expected and necessary.
HF cache requirement: For configs with HF_HUB_OFFLINE=1, models must be pre-downloaded to the HF cache on each cluster before launching. Before running a model on a new cluster, always ask the user if the model is already cached there. If not, on the cluster login node: python3 -m venv hf_cli && source hf_cli/bin/activate && pip install huggingface_hub then HF_HOME=/lustre/fsw/portfolios/coreai/users/<username>/cache/huggingface hf download <model>. Without this, vLLM will fail with LocalEntryNotFoundError.
data_parallel_size is per node: dp_size=1 with num_nodes=8 means 8 model instances total (one per node), load-balanced by haproxy. Do NOT interpret dp_size as the global replica count.
payload_modifier interceptor: The params_to_remove list (e.g. [max_tokens, max_completion_tokens]) strips those fields from the outgoing payload, intentionally lifting output length limits so reasoning models can think as long as they need.
Auto-export git workaround: The export container (python:3.12-slim) lacks git. When installing the launcher from a git URL, set auto_export.launcher_install_cmd to install git first (e.g., apt-get update -qq && apt-get install -qq -y git && pip install "nemo-evaluator-launcher[all] @ git+...#subdirectory=packages/nemo-evaluator-launcher").
Do NOT use nemo-evaluator-launcher export --dest local — it only writes a summary JSON (processed_results.json), it does NOT copy actual logs or artifacts despite accepting --copy_logs and --copy-artifacts flags. nel info --copy-artifacts works but copies everything (very slow for large benchmarks). Preferred approach: use nel info to discover paths — if local, read directly; if remote, SSH to explore and rsync only what you need. Note that nel info prints standard artifacts but benchmarks produce additional artifacts in subdirs — explore to find them.

Related Skills

nvidia-nemo/byob

development

VerifiedTrustedCommunity

Create custom LLM evaluation benchmarks using the BYOB decorator framework. Use when the user wants to (1) create a new benchmark from a dataset, (2) pick or write a scorer, (3) compile and run a BYOB benchmark, (4) containerize a benchmark, or (5) use LLM-as-Judge evaluation. Triggers on mentions of BYOB, custom benchmark, bring your own benchmark, scorer, or benchmark compilation.

255SKILL.mdUpdated Apr 15, 2026

nvidia-nemo/nel-assistant

development

VerifiedTrustedCommunity

Interactive config wizard for NeMo Evaluator Launcher (NEL). Use when the user wants to create a new evaluation config from scratch, set up an evaluation from existing configs, or modify a NEL config (deployment, tasks, multi-node, interceptors). ALWAYS triggers on mentions of creating configs, setting up evaluations, configuring models for evaluation, or modifying NEL YAML files. Do NOT use for monitoring, debugging, or analyzing already-running evaluations.

255SKILL.mdUpdated Apr 15, 2026

nvidia-nemo/nel-assistant

nvidia-nemo/accessing-mlflow

tools

VerifiedTrustedCommunity

Query and browse evaluation results stored in MLflow. Use when the user wants to look up runs by invocation ID, compare metrics across models, fetch artifacts (configs, logs, results), or set up the MLflow MCP server. ALWAYS triggers on mentions of MLflow, experiment results, run comparison, invocation IDs in the context of results, or MLflow MCP setup.

255SKILL.mdUpdated Apr 15, 2026

nvidia-nemo/accessing-mlflow

openclaw/taskflow

tools

VerifiedTrustedCommunity

Use when work should span one or more detached tasks but still behave like one job with a single owner context. TaskFlow is the durable flow substrate under authoring layers like Lobster, ACPX, plugins, or plain code. Keep conditional logic in the caller; use TaskFlow for flow identity, child-task linkage, waiting state, revision-checked mutations, and user-facing emergence.

357,764SKILL.mdUpdated Apr 10, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/nvidia-nemo/evaluator.git

# Copy into Claude Code skills folder (global)
cp -r evaluator/packages/nemo-evaluator-launcher/.claude/skills/launching-evals ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

nvidia-nemo/evaluator

255 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT