.claude/skills/add-benchmark/SKILL.md
Guide for adding a new benchmark or training environment to NeMo-Gym. Use when the user asks to add, create, or integrate a benchmark, evaluation, training environment, or resources server into NeMo-Gym. Also use when wrapping an existing 3rd-party benchmark library. Covers the full workflow: data preparation, resources server implementation, agent wiring, YAML config, testing, and reward profiling (baselining). Triggered by: "add benchmark", "new resources server", "integrate benchmark", "wrap benchmark", "add training environment", "add eval".
npx skillsauth add nvidia-nemo/gym add-benchmarkInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Before starting, determine which type of benchmark you're adding:
Native benchmark — verification logic implemented directly in a Gym resources server:
verify() with reward logicsimple_agent for single-turn, or custom agent for multi-turn)code_gen, instruction_following, math_with_judgeExternal benchmark — wrapping a 3rd-party library that has its own orchestration:
/run endpoint wraps the external libraryBaseVerifyResponserequirements.txtRun ng_init_resources_server to generate the directory structure:
ng_init_resources_server +entrypoint=resources_servers/my_benchmark
This creates:
resources_servers/my_benchmark/
├── app.py # Server template
├── configs/my_benchmark.yaml
├── data/.gitignore
├── tests/test_app.py
├── requirements.txt
└── README.md
For external benchmarks, create the agent server manually under responses_api_agents/my_agent/ with the same structure.
Convert your source dataset to Gym JSONL format. Each line must have responses_create_params.input (OpenAI message format). Task-specific verification data goes in verifier_metadata.
{
"responses_create_params": {
"input": [
{"role": "system", "content": "System prompt"},
{"role": "user", "content": "Problem statement"}
]
},
"verifier_metadata": {
"test_cases": [{"input": "...", "expected_output": "..."}],
"task_id": "unique_id"
}
}
Data conversion: Write conversion scripts in the source repo (e.g. your dataset repository), not in NeMo-Gym. Prompt files also belong in the source repo. Exception: when there is no external source repo. See references/patterns.md § "Data Conversion Script Pattern".
example.jsonl: Generate 5 entries for smoke testing. This file is committed directly to git in data/example.jsonl.
train/validation datasets: Upload to the GitLab dataset registry — these must NOT be committed to git.
ng_upload_dataset_to_gitlab \
+dataset_name=my_benchmark \
+version=0.0.1 \
+input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl
Requires MLflow credentials in env.yaml (or passed via CLI):
mlflow_tracking_uri: <your-gitlab-mlflow-tracking-uri>
mlflow_tracking_token: <your-gitlab-api-token>
data/.gitignore: The scaffold generates default patterns (*train.jsonl, *validation.jsonl, etc.). If your filename doesn't match (e.g. my_eval.jsonl), add a custom pattern (e.g. *eval.jsonl). If data was previously tracked, run git rm --cached <file>.
Validate your data:
# Validate example data (for PR submission)
ng_prepare_data "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml]" \
+output_dirpath=/tmp/prepare +mode=example_validation
# Download and prepare train/validation from GitLab
ng_prepare_data "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml]" \
+output_dirpath=data/my_benchmark +mode=train_preparation +should_download=true +data_source=gitlab
Edit app.py. The verify() method receives model output + verifier_metadata, returns reward.
For code execution benchmarks, see references/patterns.md § "Subprocess Execution with Ray" and "Resources Server Pattern".
Critical rules:
reward as 0.0 or 1.0 (binary)asyncio.Semaphore for subprocess concurrency controlresult = await future (Ray futures are directly awaitable). Never call ray.get() in async context.errors="replace"<think>/<thinking> blocks before parsing model output (thinking models emit these)pytest.mark.skipif when external tools aren't installedpytest_configure hook in conftest.py to run the install before test collection — skipif evaluates at import time, before fixtures runIf the benchmark requires an external tool (compiler, runtime, etc.), auto-install it on server startup so users don't need manual setup. See references/patterns.md § "External Tool Auto-Install Pattern".
Key points:
setup_<tool>.py with ensure_<tool>() — checks PATH, forks on sys.platform (brew on macOS, build from source on Linux)model_post_init() before semaphore initpytest_configure hook in tests/conftest.py that calls ensure_<tool>() before collectionEdit configs/my_benchmark.yaml. Define the resources server instance and agent pairing(s). See references/patterns.md § "YAML Config Pattern".
Key points:
verified: false is auto-added by pre-commit hook (set to true after baselining)license is required for train and validation datasetsFor multi-turn benchmarks, either use proof_refinement_agent or create a custom agent. See references/patterns.md § "Agent Patterns".
For train/validation datasets, add gitlab_identifier alongside jsonl_fpath:
datasets:
- name: my_dataset
type: train
jsonl_fpath: resources_servers/my_benchmark/data/my_dataset.jsonl
gitlab_identifier:
dataset_name: my_benchmark
version: 0.0.1
artifact_fpath: my_dataset.jsonl
license: MIT
- name: example
type: example
jsonl_fpath: resources_servers/my_benchmark/data/example.jsonl
Both fields must coexist: jsonl_fpath is the local download destination, gitlab_identifier tells the system where to fetch from. example datasets don't need gitlab_identifier — they're committed to git directly.
# Run server tests (creates isolated .venv, slow on first run)
ng_test +entrypoint=resources_servers/my_benchmark
# Run core library tests to check nothing broke
pytest tests/unit_tests/ -x
Test coverage must be >= 95%. Write tests for: verify pass, verify fail (wrong output), verify fail (no code extracted), verify fail (compilation error if applicable), verify timeout.
# Start servers
ng_run "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml,responses_api_models/openai_model/configs/openai_model.yaml]"
# Quick test with example data
ng_collect_rollouts +agent_name=my_benchmark_simple_agent \
+input_jsonl_fpath=resources_servers/my_benchmark/data/example.jsonl \
+output_jsonl_fpath=results/example_rollouts.jsonl \
+num_repeats=1 \
"+responses_create_params={max_output_tokens: 16384, temperature: 1.0}"
# Inspect results
Run against multiple models to validate correctness. Recommended suite:
# Collect rollouts
ng_collect_rollouts +agent_name=my_benchmark_simple_agent \
+input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl \
+output_jsonl_fpath=results/rollouts.jsonl \
+num_repeats=5 \
"+responses_create_params={max_output_tokens: 16384, temperature: 1.0}"
# Compute per-task pass rates
ng_reward_profile +input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl \
+rollouts_jsonl_fpath=results/rollouts.jsonl \
+output_jsonl_fpath=results/profiled.jsonl \
+pass_threshold=1.0
# Aggregate metrics (pass@1 = avg_reward, pass@k from max_reward)
python scripts/print_aggregate_results.py +jsonl_fpath=results/profiled.jsonl
Increase num_repeats until variance < 1% across runs on the same model.
Closed-source models should score at or above open-source models. If not, investigate for bugs. Inspect actual failure cases in the rollout JSONL, not just aggregate numbers.
For external benchmarks: reproduce the original repo's published numbers first. Then reproduce after Gym integration. Scores should match.
pre-commit run --all-files
First run may fail as hooks auto-modify files (verified: false flag, README table). Stage changes and run again.
Set verified: true in YAML config after successful baselining. Include W&B links and screenshots of results in the PR description.
To avoid committing unrelated auto-fixes from other servers, scope pre-commit to your files:
pre-commit run --files resources_servers/my_benchmark/**/*
If hooks modify files in other directories, discard those changes:
git checkout -- resources_servers/other_server/
nemo_gym/openai_utils.py), not LiteLLM/Anthropic/othernemo_gym.server_utils.request() (aiohttp). httpx has O(n^2) connection pooling that hangs at high concurrency. When wrapping external libraries that use httpx internally, replace their HTTP transport with an aiohttp adapter — see resources_servers/tavily_search/app.py (TavilySearchAIOHTTPClient) for the pattern and docs/infrastructure/engineering-notes/aiohttp-vs-httpx.md for the rationale./run endpoint must be async-s) and cryptographic signature (-S)For detailed code patterns, schemas, and examples: see references/patterns.md.
development
Maintainer-only workflow for handling GitHub Secret Scanning alerts on OpenClaw. Use when Codex needs to triage, redact, clean up, and resolve secret leakage found in issue comments, issue bodies, PR comments, or other GitHub content.
development
Maintainer workflow for OpenClaw releases, prereleases, changelog release notes, and publish validation. Use when Codex needs to prepare or verify stable or beta release steps, align version naming, assemble release notes, check release auth requirements, or validate publish-time commands and artifacts.
development
Run, watch, debug, and extend OpenClaw QA testing with qa-lab and qa-channel. Use when Codex needs to execute the repo-backed QA suite, inspect live QA artifacts, debug failing scenarios, add new QA scenarios, or explain the OpenClaw QA workflow. Prefer the live OpenAI lane with regular openai/gpt-5.4 in fast mode; do not use gpt-5.4-pro or gpt-5.4-mini unless the user explicitly overrides that policy.
development
End-to-end Parallels smoke, upgrade, and rerun workflow for OpenClaw across macOS, Windows, and Linux guests. Use when Codex needs to run, rerun, debug, or interpret VM-based install, onboarding, gateway smoke tests, latest-release-to-main upgrade checks, fresh snapshot retests, or optional Discord roundtrip verification under Parallels.