06-post-training/openrlhf/SKILL.md
High-performance RLHF framework with Ray+vLLM acceleration. Use for PPO, GRPO, RLOO, DPO training of large models (7B-70B+). Built on Ray, vLLM, ZeRO-3. 2× faster than DeepSpeedChat with distributed architecture and GPU resource sharing.
npx skillsauth add Orchestra-Research/AI-Research-SKILLs openrlhf-trainingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
OpenRLHF is a Ray-based RLHF framework optimized for distributed training with vLLM inference acceleration.
Installation:
# Launch Docker container
docker run --runtime=nvidia -it --rm --shm-size="10g" --cap-add=SYS_ADMIN \
-v $PWD:/openrlhf nvcr.io/nvidia/pytorch:25.02-py3 bash
# Uninstall conflicts
sudo pip uninstall xgboost transformer_engine flash_attn pynvml -y
# Install OpenRLHF with vLLM
pip install openrlhf[vllm]
PPO Training (Hybrid Engine):
ray start --head --node-ip-address 0.0.0.0 --num-gpus 8
ray job submit --address="http://127.0.0.1:8265" \
--runtime-env-json='{"working_dir": "/openrlhf"}' \
-- python3 -m openrlhf.cli.train_ppo_ray \
--ref_num_nodes 1 --ref_num_gpus_per_node 8 \
--reward_num_nodes 1 --reward_num_gpus_per_node 8 \
--critic_num_nodes 1 --critic_num_gpus_per_node 8 \
--actor_num_nodes 1 --actor_num_gpus_per_node 8 \
--vllm_num_engines 4 --vllm_tensor_parallel_size 2 \
--colocate_all_models \
--vllm_gpu_memory_utilization 0.5 \
--pretrain OpenRLHF/Llama-3-8b-sft-mixture \
--reward_pretrain OpenRLHF/Llama-3-8b-rm-700k \
--save_path ./output/llama3-8b-rlhf \
--micro_train_batch_size 8 --train_batch_size 128 \
--micro_rollout_batch_size 16 --rollout_batch_size 1024 \
--max_epochs 1 --prompt_max_len 1024 --generate_max_len 1024 \
--zero_stage 3 --bf16 \
--actor_learning_rate 5e-7 --critic_learning_rate 9e-6 \
--init_kl_coef 0.01 --normalize_reward \
--gradient_checkpointing --packing_samples \
--vllm_enable_sleep --deepspeed_enable_sleep
GRPO Training (Group Normalized Policy Optimization):
# Same command as PPO, but add:
--advantage_estimator group_norm
Step 1: Train reward model (DPO):
deepspeed --module openrlhf.cli.train_rm \
--save_path ./output/llama3-8b-rm \
--save_steps -1 --logging_steps 1 \
--eval_steps -1 --train_batch_size 256 \
--micro_train_batch_size 1 --pretrain meta-llama/Meta-Llama-3-8B \
--bf16 --max_epochs 1 --max_len 8192 \
--zero_stage 3 --learning_rate 9e-6 \
--dataset OpenRLHF/preference_dataset_mixture2_and_safe_pku \
--apply_chat_template --chosen_key chosen \
--rejected_key rejected --flash_attn --gradient_checkpointing
Step 2: PPO training:
ray start --head --node-ip-address 0.0.0.0 --num-gpus 8
ray job submit --address="http://127.0.0.1:8265" \
-- python3 -m openrlhf.cli.train_ppo_ray \
--ref_num_nodes 1 --ref_num_gpus_per_node 8 \
--reward_num_nodes 1 --reward_num_gpus_per_node 8 \
--critic_num_nodes 1 --critic_num_gpus_per_node 8 \
--actor_num_nodes 1 --actor_num_gpus_per_node 8 \
--vllm_num_engines 4 --vllm_tensor_parallel_size 2 \
--colocate_all_models \
--pretrain OpenRLHF/Llama-3-8b-sft-mixture \
--reward_pretrain ./output/llama3-8b-rm \
--save_path ./output/llama3-8b-ppo \
--micro_train_batch_size 8 --train_batch_size 128 \
--micro_rollout_batch_size 16 --rollout_batch_size 1024 \
--max_epochs 1 --prompt_max_len 1024 --generate_max_len 1024 \
--zero_stage 3 --bf16 \
--actor_learning_rate 5e-7 --critic_learning_rate 9e-6 \
--init_kl_coef 0.01 --normalize_reward \
--vllm_enable_sleep --deepspeed_enable_sleep
Memory-efficient alternative to PPO:
ray job submit --address="http://127.0.0.1:8265" \
-- python3 -m openrlhf.cli.train_ppo_ray \
--advantage_estimator group_norm \
--ref_num_nodes 1 --ref_num_gpus_per_node 8 \
--reward_num_nodes 1 --reward_num_gpus_per_node 8 \
--actor_num_nodes 1 --actor_num_gpus_per_node 8 \
--vllm_num_engines 4 --vllm_tensor_parallel_size 2 \
--colocate_all_models \
--pretrain OpenRLHF/Llama-3-8b-sft-mixture \
--reward_pretrain OpenRLHF/Llama-3-8b-rm-700k \
--save_path ./output/llama3-8b-grpo \
--micro_train_batch_size 8 --train_batch_size 128 \
--micro_rollout_batch_size 16 --rollout_batch_size 1024 \
--max_epochs 1 --bf16 \
--actor_learning_rate 5e-7 \
--init_kl_coef 0.01 --use_kl_loss --kl_estimator k3 \
--normalize_reward --no_advantage_std_norm
Key GRPO parameters:
--advantage_estimator group_norm - Enables GRPO--use_kl_loss - KL loss from GRPO paper--kl_estimator k3 - Loss function (k2 ≈ k1)--no_advantage_std_norm - Disables std normalizationSimpler alternative without reward model:
deepspeed --module openrlhf.cli.train_dpo \
--save_path ./output/llama3-8b-dpo \
--save_steps -1 --logging_steps 1 \
--eval_steps -1 --train_batch_size 256 \
--micro_train_batch_size 2 --pretrain meta-llama/Meta-Llama-3-8B \
--bf16 --max_epochs 1 --max_len 8192 \
--zero_stage 3 --learning_rate 5e-7 --beta 0.1 \
--dataset OpenRLHF/preference_dataset_mixture2_and_safe_pku \
--apply_chat_template --chosen_key chosen \
--rejected_key rejected --flash_attn --gradient_checkpointing
Use OpenRLHF when:
Algorithm selection:
Use alternatives instead:
Issue: GPU OOM with large models
Disable model colocation:
# Remove --colocate_all_models flag
# Allocate separate GPUs for each model
--actor_num_gpus_per_node 8 \
--critic_num_gpus_per_node 8 \
--reward_num_gpus_per_node 8 \
--ref_num_gpus_per_node 8
Issue: DeepSpeed GPU index out of range
Set environment variable:
export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1
Issue: Training instability
Use Hybrid Engine instead of async:
--colocate_all_models \
--vllm_enable_sleep \
--deepspeed_enable_sleep
Adjust KL coefficient:
--init_kl_coef 0.05 # Increase from 0.01
Issue: Slow generation during PPO
Enable vLLM acceleration:
--vllm_num_engines 4 \
--vllm_tensor_parallel_size 2 \
--vllm_gpu_memory_utilization 0.5
Hybrid Engine GPU sharing: See references/hybrid-engine.md for vLLM sleep mode, DeepSpeed sleep mode, and optimal node allocation.
Algorithm comparison: See references/algorithm-comparison.md for PPO vs GRPO vs RLOO vs REINFORCE++ benchmarks and hyperparameters.
Multi-node setup: See references/multi-node-training.md for Ray cluster configuration and fault tolerance.
Custom reward functions: See references/custom-rewards.md for reinforced fine-tuning and agent RLHF.
Performance:
development
Performs ARA Seal Level 2 semantic epistemic review on Agent-Native Research Artifacts, scoring six dimensions (evidence relevance, falsifiability, scope calibration, argument coherence, exploration integrity, methodological rigor) and producing a constructive, severity-ranked report with a Strong Accept-to-Reject recommendation. Use after Level 1 structural validation passes, when an ARA needs an objective epistemic critique before publication or release.
testing
Records research provenance as a post-task epilogue, scanning conversation history at the end of a coding or research session to extract decisions, experiments, dead ends, claims, heuristics, and pivots, and writing them into the ara/ directory with user-vs-AI provenance tags. Use as a session epilogue — never during execution — to maintain a faithful, auditable trace of how a research project actually evolved.
development
Compiles any research input — PDF papers, GitHub repositories, experiment logs, code directories, or raw notes — into a complete Agent-Native Research Artifact (ARA) with cognitive layer (claims, concepts, heuristics), physical layer (configs, code stubs), exploration graph, and grounded evidence. Use when ingesting a paper or codebase into a structured, machine-executable knowledge package, building an ARA from scratch, or converting research outputs into a falsifiable, agent-traversable form.
testing
Comprehensive guide for writing systems papers targeting OSDI, SOSP, ASPLOS, NSDI, and EuroSys. Provides paragraph-level structural blueprints, writing patterns, venue-specific checklists, reviewer guidelines, LaTeX templates, and conference deadlines. Use this skill for all systems conference paper writing.