Add Benchmark to NeMo-Gym

Determine Integration Type

Before starting, determine which type of benchmark you're adding:

Native benchmark — verification logic implemented directly in a Gym resources server:

Resources server implements verify() with reward logic
Agent server orchestrates model calls (use simple_agent for single-turn, or custom agent for multi-turn)
Example: code_gen, instruction_following, math_with_judge

External benchmark — wrapping a 3rd-party library that has its own orchestration:

Integrate at the agent server level (not resources server)
Agent's /run endpoint wraps the external library
Pre-process from Gym schema to library input, post-process back to BaseVerifyResponse
Reproduce publicly reported numbers with the original repo first, then reproduce again after Gym integration
Add the dependency in requirements.txt

Workflow

Step 1: Scaffold the server

Run ng_init_resources_server to generate the directory structure:

ng_init_resources_server +entrypoint=resources_servers/my_benchmark

This creates:

resources_servers/my_benchmark/
├── app.py              # Server template
├── configs/my_benchmark.yaml
├── data/.gitignore
├── tests/test_app.py
├── requirements.txt
└── README.md

For external benchmarks, create the agent server manually under responses_api_agents/my_agent/ with the same structure.

Step 2: Prepare data

Convert your source dataset to Gym JSONL format. Each line must have responses_create_params.input (OpenAI message format). Task-specific verification data goes in verifier_metadata.

{
  "responses_create_params": {
    "input": [
      {"role": "system", "content": "System prompt"},
      {"role": "user", "content": "Problem statement"}
    ]
  },
  "verifier_metadata": {
    "test_cases": [{"input": "...", "expected_output": "..."}],
    "task_id": "unique_id"
  }
}

Data conversion: Write conversion scripts in the source repo (e.g. your dataset repository), not in NeMo-Gym. Prompt files also belong in the source repo. Exception: when there is no external source repo. See references/patterns.md § "Data Conversion Script Pattern".

example.jsonl: Generate 5 entries for smoke testing. This file is committed directly to git in data/example.jsonl.

train/validation datasets: Upload to the GitLab dataset registry — these must NOT be committed to git.

ng_upload_dataset_to_gitlab \
    +dataset_name=my_benchmark \
    +version=0.0.1 \
    +input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl

Requires MLflow credentials in env.yaml (or passed via CLI):

mlflow_tracking_uri: <your-gitlab-mlflow-tracking-uri>
mlflow_tracking_token: <your-gitlab-api-token>

data/.gitignore: The scaffold generates default patterns (*train.jsonl, *validation.jsonl, etc.). If your filename doesn't match (e.g. my_eval.jsonl), add a custom pattern (e.g. *eval.jsonl). If data was previously tracked, run git rm --cached <file>.

Validate your data:

# Validate example data (for PR submission)
ng_prepare_data "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml]" \
    +output_dirpath=/tmp/prepare +mode=example_validation

# Download and prepare train/validation from GitLab
ng_prepare_data "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml]" \
    +output_dirpath=data/my_benchmark +mode=train_preparation +should_download=true +data_source=gitlab

Step 3: Implement verify()

Edit app.py. The verify() method receives model output + verifier_metadata, returns reward.

For code execution benchmarks, see references/patterns.md § "Subprocess Execution with Ray" and "Resources Server Pattern".

Critical rules:

Return reward as 0.0 or 1.0 (binary)
Handle empty/missing model output gracefully — return 0.0, don't crash
Must handle 4k-65k concurrent requests without crashing
Use asyncio.Semaphore for subprocess concurrency control
For Ray remote tasks: result = await future (Ray futures are directly awaitable). Never call ray.get() in async context.
Decode subprocess output with errors="replace"
Strip <think>/<thinking> blocks before parsing model output (thinking models emit these)
Tests should pytest.mark.skipif when external tools aren't installed
If the benchmark auto-installs its tool (see Step 3b), add a pytest_configure hook in conftest.py to run the install before test collection — skipif evaluates at import time, before fixtures run

Step 3b: Auto-install external tools (if applicable)

If the benchmark requires an external tool (compiler, runtime, etc.), auto-install it on server startup so users don't need manual setup. See references/patterns.md § "External Tool Auto-Install Pattern".

Key points:

Create setup_<tool>.py with ensure_<tool>() — checks PATH, forks on sys.platform (brew on macOS, build from source on Linux)
Call it in model_post_init() before semaphore init
Build scripts should be idempotent and install into a local gitignored prefix
Add a pytest_configure hook in tests/conftest.py that calls ensure_<tool>() before collection

Step 4: Wire YAML config

Edit configs/my_benchmark.yaml. Define the resources server instance and agent pairing(s). See references/patterns.md § "YAML Config Pattern".

Key points:

verified: false is auto-added by pre-commit hook (set to true after baselining)
license is required for train and validation datasets
Agent references resources server and model server by instance name

For multi-turn benchmarks, either use proof_refinement_agent or create a custom agent. See references/patterns.md § "Agent Patterns".

For train/validation datasets, add gitlab_identifier alongside jsonl_fpath:

datasets:
- name: my_dataset
  type: train
  jsonl_fpath: resources_servers/my_benchmark/data/my_dataset.jsonl
  gitlab_identifier:
    dataset_name: my_benchmark
    version: 0.0.1
    artifact_fpath: my_dataset.jsonl
  license: MIT
- name: example
  type: example
  jsonl_fpath: resources_servers/my_benchmark/data/example.jsonl

Both fields must coexist: jsonl_fpath is the local download destination, gitlab_identifier tells the system where to fetch from. example datasets don't need gitlab_identifier — they're committed to git directly.

Step 5: Test

# Run server tests (creates isolated .venv, slow on first run)
ng_test +entrypoint=resources_servers/my_benchmark

# Run core library tests to check nothing broke
pytest tests/unit_tests/ -x

Test coverage must be >= 95%. Write tests for: verify pass, verify fail (wrong output), verify fail (no code extracted), verify fail (compilation error if applicable), verify timeout.

Step 6: Smoke test end-to-end

# Start servers
ng_run "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml,responses_api_models/openai_model/configs/openai_model.yaml]"

# Quick test with example data
ng_collect_rollouts +agent_name=my_benchmark_simple_agent \
  +input_jsonl_fpath=resources_servers/my_benchmark/data/example.jsonl \
  +output_jsonl_fpath=results/example_rollouts.jsonl \
  +num_repeats=1 \
  "+responses_create_params={max_output_tokens: 16384, temperature: 1.0}"

# Inspect results

Step 7: Baseline (reward profiling)

Run against multiple models to validate correctness. Recommended suite:

Your policy model of interest
At least one open-source instruct model (e.g. Qwen 3 30B A3B Instruct)
At least one open-source thinking model (e.g. Qwen 3 30B A3B Thinking)
At least one closed-source model (e.g. GPT-5 Nano or GPT-5)

# Collect rollouts
ng_collect_rollouts +agent_name=my_benchmark_simple_agent \
  +input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl \
  +output_jsonl_fpath=results/rollouts.jsonl \
  +num_repeats=5 \
  "+responses_create_params={max_output_tokens: 16384, temperature: 1.0}"

# Compute per-task pass rates
ng_reward_profile +input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl \
  +rollouts_jsonl_fpath=results/rollouts.jsonl \
  +output_jsonl_fpath=results/profiled.jsonl \
  +pass_threshold=1.0

# Aggregate metrics (pass@1 = avg_reward, pass@k from max_reward)
python scripts/print_aggregate_results.py +jsonl_fpath=results/profiled.jsonl

Increase num_repeats until variance < 1% across runs on the same model.

Closed-source models should score at or above open-source models. If not, investigate for bugs. Inspect actual failure cases in the rollout JSONL, not just aggregate numbers.

For external benchmarks: reproduce the original repo's published numbers first. Then reproduce after Gym integration. Scores should match.

Step 8: Pre-commit and PR

pre-commit run --all-files

First run may fail as hooks auto-modify files (verified: false flag, README table). Stage changes and run again.

Set verified: true in YAML config after successful baselining. Include W&B links and screenshots of results in the PR description.

To avoid committing unrelated auto-fixes from other servers, scope pre-commit to your files:

pre-commit run --files resources_servers/my_benchmark/**/*

If hooks modify files in other directories, discard those changes:

git checkout -- resources_servers/other_server/

Constraints

Use NeMo Gym's OpenAI client (nemo_gym/openai_utils.py), not LiteLLM/Anthropic/other
Use aiohttp, not httpx, for async HTTP. All async HTTP calls must go through nemo_gym.server_utils.request() (aiohttp). httpx has O(n^2) connection pooling that hangs at high concurrency. When wrapping external libraries that use httpx internally, replace their HTTP transport with an aiohttp adapter — see resources_servers/tavily_search/app.py (TavilySearchAIOHTTPClient) for the pattern and docs/infrastructure/engineering-notes/aiohttp-vs-httpx.md for the rationale.
Pass configuration through Gym config (YAML), not environment variables
Code must run on Linux
/run endpoint must be async
Errors from tool execution or bad model output must return error responses, not crash
All commits require DCO sign-off (-s) and cryptographic signature (-S)

Reference

For detailed code patterns, schemas, and examples: see references/patterns.md.

Add Benchmark to NeMo-Gym

Determine Integration Type

Before starting, determine which type of benchmark you're adding:

Native benchmark — verification logic implemented directly in a Gym resources server:

Resources server implements verify() with reward logic
Agent server orchestrates model calls (use simple_agent for single-turn, or custom agent for multi-turn)
Example: code_gen, instruction_following, math_with_judge

External benchmark — wrapping a 3rd-party library that has its own orchestration:

Integrate at the agent server level (not resources server)
Agent's /run endpoint wraps the external library
Pre-process from Gym schema to library input, post-process back to BaseVerifyResponse
Reproduce publicly reported numbers with the original repo first, then reproduce again after Gym integration
Add the dependency in requirements.txt

Workflow

Step 1: Scaffold the server

Run ng_init_resources_server to generate the directory structure:

ng_init_resources_server +entrypoint=resources_servers/my_benchmark

This creates:

resources_servers/my_benchmark/
├── app.py              # Server template
├── configs/my_benchmark.yaml
├── data/.gitignore
├── tests/test_app.py
├── requirements.txt
└── README.md

For external benchmarks, create the agent server manually under responses_api_agents/my_agent/ with the same structure.

Step 2: Prepare data

Convert your source dataset to Gym JSONL format. Each line must have responses_create_params.input (OpenAI message format). Task-specific verification data goes in verifier_metadata.

{
  "responses_create_params": {
    "input": [
      {"role": "system", "content": "System prompt"},
      {"role": "user", "content": "Problem statement"}
    ]
  },
  "verifier_metadata": {
    "test_cases": [{"input": "...", "expected_output": "..."}],
    "task_id": "unique_id"
  }
}

example.jsonl: Generate 5 entries for smoke testing. This file is committed directly to git in data/example.jsonl.

train/validation datasets: Upload to the GitLab dataset registry — these must NOT be committed to git.

ng_upload_dataset_to_gitlab \
    +dataset_name=my_benchmark \
    +version=0.0.1 \
    +input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl

Requires MLflow credentials in env.yaml (or passed via CLI):

mlflow_tracking_uri: <your-gitlab-mlflow-tracking-uri>
mlflow_tracking_token: <your-gitlab-api-token>

Validate your data:

# Validate example data (for PR submission)
ng_prepare_data "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml]" \
    +output_dirpath=/tmp/prepare +mode=example_validation

# Download and prepare train/validation from GitLab
ng_prepare_data "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml]" \
    +output_dirpath=data/my_benchmark +mode=train_preparation +should_download=true +data_source=gitlab

Step 3: Implement verify()

Edit app.py. The verify() method receives model output + verifier_metadata, returns reward.

For code execution benchmarks, see references/patterns.md § "Subprocess Execution with Ray" and "Resources Server Pattern".

Critical rules:

Return reward as 0.0 or 1.0 (binary)
Handle empty/missing model output gracefully — return 0.0, don't crash
Must handle 4k-65k concurrent requests without crashing
Use asyncio.Semaphore for subprocess concurrency control
For Ray remote tasks: result = await future (Ray futures are directly awaitable). Never call ray.get() in async context.
Decode subprocess output with errors="replace"
Strip <think>/<thinking> blocks before parsing model output (thinking models emit these)
Tests should pytest.mark.skipif when external tools aren't installed
If the benchmark auto-installs its tool (see Step 3b), add a pytest_configure hook in conftest.py to run the install before test collection — skipif evaluates at import time, before fixtures run

Step 3b: Auto-install external tools (if applicable)

Key points:

Create setup_<tool>.py with ensure_<tool>() — checks PATH, forks on sys.platform (brew on macOS, build from source on Linux)
Call it in model_post_init() before semaphore init
Build scripts should be idempotent and install into a local gitignored prefix
Add a pytest_configure hook in tests/conftest.py that calls ensure_<tool>() before collection

Step 4: Wire YAML config

Edit configs/my_benchmark.yaml. Define the resources server instance and agent pairing(s). See references/patterns.md § "YAML Config Pattern".

Key points:

verified: false is auto-added by pre-commit hook (set to true after baselining)
license is required for train and validation datasets
Agent references resources server and model server by instance name

For multi-turn benchmarks, either use proof_refinement_agent or create a custom agent. See references/patterns.md § "Agent Patterns".

For train/validation datasets, add gitlab_identifier alongside jsonl_fpath:

datasets:
- name: my_dataset
  type: train
  jsonl_fpath: resources_servers/my_benchmark/data/my_dataset.jsonl
  gitlab_identifier:
    dataset_name: my_benchmark
    version: 0.0.1
    artifact_fpath: my_dataset.jsonl
  license: MIT
- name: example
  type: example
  jsonl_fpath: resources_servers/my_benchmark/data/example.jsonl

Step 5: Test

# Run server tests (creates isolated .venv, slow on first run)
ng_test +entrypoint=resources_servers/my_benchmark

# Run core library tests to check nothing broke
pytest tests/unit_tests/ -x

Test coverage must be >= 95%. Write tests for: verify pass, verify fail (wrong output), verify fail (no code extracted), verify fail (compilation error if applicable), verify timeout.

Step 6: Smoke test end-to-end

# Start servers
ng_run "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml,responses_api_models/openai_model/configs/openai_model.yaml]"

# Quick test with example data
ng_collect_rollouts +agent_name=my_benchmark_simple_agent \
  +input_jsonl_fpath=resources_servers/my_benchmark/data/example.jsonl \
  +output_jsonl_fpath=results/example_rollouts.jsonl \
  +num_repeats=1 \
  "+responses_create_params={max_output_tokens: 16384, temperature: 1.0}"

# Inspect results

Step 7: Baseline (reward profiling)

Run against multiple models to validate correctness. Recommended suite:

Your policy model of interest
At least one open-source instruct model (e.g. Qwen 3 30B A3B Instruct)
At least one open-source thinking model (e.g. Qwen 3 30B A3B Thinking)
At least one closed-source model (e.g. GPT-5 Nano or GPT-5)

# Collect rollouts
ng_collect_rollouts +agent_name=my_benchmark_simple_agent \
  +input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl \
  +output_jsonl_fpath=results/rollouts.jsonl \
  +num_repeats=5 \
  "+responses_create_params={max_output_tokens: 16384, temperature: 1.0}"

# Compute per-task pass rates
ng_reward_profile +input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl \
  +rollouts_jsonl_fpath=results/rollouts.jsonl \
  +output_jsonl_fpath=results/profiled.jsonl \
  +pass_threshold=1.0

# Aggregate metrics (pass@1 = avg_reward, pass@k from max_reward)
python scripts/print_aggregate_results.py +jsonl_fpath=results/profiled.jsonl

Increase num_repeats until variance < 1% across runs on the same model.

Closed-source models should score at or above open-source models. If not, investigate for bugs. Inspect actual failure cases in the rollout JSONL, not just aggregate numbers.

For external benchmarks: reproduce the original repo's published numbers first. Then reproduce after Gym integration. Scores should match.

Step 8: Pre-commit and PR

pre-commit run --all-files

First run may fail as hooks auto-modify files (verified: false flag, README table). Stage changes and run again.

Set verified: true in YAML config after successful baselining. Include W&B links and screenshots of results in the PR description.

To avoid committing unrelated auto-fixes from other servers, scope pre-commit to your files:

pre-commit run --files resources_servers/my_benchmark/**/*

If hooks modify files in other directories, discard those changes:

git checkout -- resources_servers/other_server/

Constraints

Use NeMo Gym's OpenAI client (nemo_gym/openai_utils.py), not LiteLLM/Anthropic/other
Use aiohttp, not httpx, for async HTTP. All async HTTP calls must go through nemo_gym.server_utils.request() (aiohttp). httpx has O(n^2) connection pooling that hangs at high concurrency. When wrapping external libraries that use httpx internally, replace their HTTP transport with an aiohttp adapter — see resources_servers/tavily_search/app.py (TavilySearchAIOHTTPClient) for the pattern and docs/infrastructure/engineering-notes/aiohttp-vs-httpx.md for the rationale.
Pass configuration through Gym config (YAML), not environment variables
Code must run on Linux
/run endpoint must be async
Errors from tool execution or bad model output must return error responses, not crash
All commits require DCO sign-off (-s) and cryptographic signature (-S)

Reference

For detailed code patterns, schemas, and examples: see references/patterns.md.

Adoption

nvidia-nemo/add-benchmark

$ install --global

Security Scan Results

SKILL.md

Add Benchmark to NeMo-Gym

Determine Integration Type

Workflow

Step 1: Scaffold the server

Step 2: Prepare data

Step 3: Implement verify()

Step 3b: Auto-install external tools (if applicable)

Step 4: Wire YAML config

Step 5: Test

Step 6: Smoke test end-to-end

Step 7: Baseline (reward profiling)

Step 8: Pre-commit and PR

Constraints

Reference

Related Skills

openclaw/openclaw-secret-scanning-maintainer

openclaw/openclaw-release-maintainer

openclaw/openclaw-qa-testing

openclaw/openclaw-parallels-smoke

nvidia-nemo/add-benchmark

$ install --global

Security Scan Results

SKILL.md

Add Benchmark to NeMo-Gym

Determine Integration Type

Workflow

Step 1: Scaffold the server

Step 2: Prepare data

Step 3: Implement verify()

Step 3b: Auto-install external tools (if applicable)

Step 4: Wire YAML config

Step 5: Test

Step 6: Smoke test end-to-end

Step 7: Baseline (reward profiling)

Step 8: Pre-commit and PR

Constraints

Reference

Related Skills

openclaw/openclaw-secret-scanning-maintainer

openclaw/openclaw-release-maintainer

openclaw/openclaw-qa-testing

openclaw/openclaw-parallels-smoke