Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

Orchestra-Research/torchforge-rl-training

Name: torchforge-rl-training
Author: Orchestra-Research

06-post-training/torchforge/SKILL.md

npx skillsauth add Orchestra-Research/AI-Research-SKILLs torchforge-rl-training

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Error

VirusTotalMulti-engine malware detection

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

torchforge: PyTorch-Native Agentic RL Library

torchforge is Meta's PyTorch-native RL library that separates infrastructure concerns from algorithm concerns. It enables rapid RL research by letting you focus on algorithms while handling distributed training, inference, and weight sync automatically.

When to Use torchforge

Choose torchforge when you need:

Clean separation between RL algorithms and infrastructure
PyTorch-native abstractions (no Ray dependency)
Easy algorithm experimentation (GRPO, DAPO, SAPO in ~100 lines)
Scalable training with Monarch actor system
Integration with TorchTitan for model parallelism

Consider alternatives when:

You need production-ready stability → use miles or verl
You want Megatron-native training → use slime
torchforge is experimental and APIs may change

Key Features

Algorithm isolation: Implement RL algorithms without touching infrastructure
Scalability: From single GPU to thousands via Monarch
Modern stack: TorchTitan (training), vLLM (inference), TorchStore (sync)
Loss functions: GRPO, DAPO, CISPO, GSPO, SAPO built-in

Architecture Overview

┌─────────────────────────────────────────────────────────┐
│ Application Layer (Your Code)                           │
│ - Define reward models, loss functions, sampling        │
└─────────────────────┬───────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────┐
│ Forge API Layer                                         │
│ - Episode, Group dataclasses                           │
│ - Service interfaces (async/await)                      │
└─────────────────────┬───────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────┐
│ Distributed Services (Monarch)                          │
│ ├── Trainer (TorchTitan FSDP)                          │
│ ├── Generator (vLLM inference)                          │
│ ├── Reference Model (frozen KL baseline)               │
│ └── Reward Actors (compute rewards)                    │
└─────────────────────────────────────────────────────────┘

Installation

# Create environment
conda create -n forge python=3.12
conda activate forge

# Install (handles PyTorch nightly + dependencies)
./scripts/install.sh

# Verify
python -c "import torch, forge, vllm; print('OK')"

ROCm Installation

./scripts/install_rocm.sh

Quick Start

SFT Training (2+ GPUs)

python -m apps.sft.main --config apps/sft/llama3_8b.yaml

GRPO Training (3+ GPUs)

python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml

Workflow 1: GRPO Training for Math Reasoning

Use this workflow for training reasoning models with group-relative advantages.

Prerequisites Checklist

[ ] 3+ GPUs (GPU0: trainer, GPU1: ref_model, GPU2: generator)
[ ] Model from HuggingFace Hub
[ ] Training dataset (GSM8K, MATH, etc.)

Step 1: Create Configuration

# config/grpo_math.yaml
model: "Qwen/Qwen2.5-7B-Instruct"

dataset:
  path: "openai/gsm8k"
  split: "train"
  streaming: true

training:
  batch_size: 4
  learning_rate: 1e-6
  seq_len: 4096
  dtype: bfloat16
  gradient_accumulation_steps: 4

grpo:
  n_samples: 8           # Responses per prompt
  clip_low: 0.2
  clip_high: 0.28
  beta: 0.1              # KL penalty coefficient
  temperature: 0.7

services:
  generator:
    procs: 1
    num_replicas: 1
    with_gpus: true
  trainer:
    procs: 1
    num_replicas: 1
    with_gpus: true
  ref_model:
    procs: 1
    num_replicas: 1
    with_gpus: true

Step 2: Define Reward Function

# rewards.py
# Reward functions are in forge.data.rewards
from forge.data.rewards import MathReward, ThinkingReward
import re

# Or define your own reward function
class CustomMathReward:
    def __call__(self, prompt: str, response: str, target: str) -> float:
        # Extract answer from response
        match = re.search(r'\\boxed{([^}]+)}', response)
        if not match:
            return 0.0

        answer = match.group(1).strip()
        return 1.0 if answer == target else 0.0

Step 3: Launch Training

python -m apps.grpo.main --config config/grpo_math.yaml

Step 4: Monitor Progress

[ ] Check W&B dashboard for loss curves
[ ] Verify entropy is decreasing (policy becoming more deterministic)
[ ] Monitor KL divergence (should stay bounded)

Workflow 2: Custom Loss Function

Use this workflow to implement new RL algorithms.

Step 1: Create Loss Class

# src/forge/losses/custom_loss.py
import torch
import torch.nn as nn

class CustomLoss(nn.Module):
    def __init__(self, clip_range: float = 0.2, beta: float = 0.1):
        super().__init__()
        self.clip_range = clip_range
        self.beta = beta

    def forward(
        self,
        logprobs: torch.Tensor,
        ref_logprobs: torch.Tensor,
        advantages: torch.Tensor,
        padding_mask: torch.Tensor,
    ) -> torch.Tensor:
        # Compute importance ratio
        ratio = torch.exp(logprobs - ref_logprobs)

        # Clipped policy gradient
        clipped_ratio = torch.clamp(
            ratio,
            1 - self.clip_range,
            1 + self.clip_range
        )
        pg_loss = -torch.min(ratio * advantages, clipped_ratio * advantages)

        # KL penalty
        kl = ref_logprobs - logprobs

        # Apply mask and aggregate
        masked_loss = (pg_loss + self.beta * kl) * padding_mask
        loss = masked_loss.sum() / padding_mask.sum()

        return loss

Step 2: Integrate into Application

# apps/custom/main.py
from forge.losses.custom_loss import CustomLoss

loss_fn = CustomLoss(clip_range=0.2, beta=0.1)

# In training loop
loss = loss_fn(
    logprobs=logprobs,
    ref_logprobs=ref_logprobs,
    advantages=advantages,
    padding_mask=padding_mask,
)

Workflow 3: Multi-GPU Distributed Training

Use this workflow for scaling to multiple GPUs or nodes.

Configuration for Distributed

# config/distributed.yaml
model: "meta-llama/Meta-Llama-3.1-8B-Instruct"

parallelism:
  tensor_parallel_degree: 2    # Split model across GPUs
  pipeline_parallel_degree: 1
  data_parallel_shard_degree: 2

services:
  generator:
    procs: 2                   # 2 processes for TP=2
    num_replicas: 1
    with_gpus: true
  trainer:
    procs: 2
    num_replicas: 1
    with_gpus: true

Launch with SLURM

# Submit job
sbatch --nodes=2 --gpus-per-node=8 run_grpo.sh

Launch Locally (Multi-GPU)

# 8 GPU setup
python -m apps.grpo.main \
    --config config/distributed.yaml \
    --trainer.procs 4 \
    --generator.procs 4

Core API Reference

Training Batch Format

torchforge uses dictionary-based batches for training:

# inputs: list of dicts with torch.Tensor values
inputs = [{"tokens": torch.Tensor}]

# targets: list of dicts with training signals
targets = [{
    "response": torch.Tensor,
    "ref_logprobs": torch.Tensor,
    "advantages": torch.Tensor,
    "padding_mask": torch.Tensor
}]

# train_step returns loss as float
loss = trainer.train_step(inputs, targets)

Completion

Generated output from vLLM:

@dataclass
class Completion:
    text: str              # Generated text
    token_ids: list[int]   # Token IDs
    logprobs: list[float]  # Log probabilities
    metadata: dict         # Custom metadata

Built-in Loss Functions

Loss Functions

Loss functions are in the forge.losses module:

from forge.losses import SimpleGRPOLoss, ReinforceLoss

# SimpleGRPOLoss for GRPO training
loss_fn = SimpleGRPOLoss(beta=0.1)

# Forward pass
loss = loss_fn(
    logprobs=logprobs,
    ref_logprobs=ref_logprobs,
    advantages=advantages,
    padding_mask=padding_mask
)

ReinforceLoss

from forge.losses.reinforce_loss import ReinforceLoss

# With optional importance ratio clipping
loss_fn = ReinforceLoss(clip_ratio=0.2)

Common Issues and Solutions

Issue: Not Enough GPUs

Symptoms: "Insufficient GPU resources" error

Solutions:

# Reduce service requirements
services:
  generator:
    procs: 1
    with_gpus: true
  trainer:
    procs: 1
    with_gpus: true
  # Remove ref_model (uses generator weights)

Or use CPU for reference model:

ref_model:
  with_gpus: false

Issue: OOM During Generation

Symptoms: CUDA OOM in vLLM

Solutions:

# Reduce batch size
grpo:
  n_samples: 4  # Reduce from 8

# Or reduce sequence length
training:
  seq_len: 2048

Issue: Slow Weight Sync

Symptoms: Long pauses between training and generation

Solutions:

# Enable RDMA (if available)
export TORCHSTORE_USE_RDMA=1

# Or reduce sync frequency
training:
  sync_interval: 10  # Sync every 10 steps

Issue: Policy Collapse

Symptoms: Entropy drops to zero, reward stops improving

Solutions:

# Increase KL penalty
grpo:
  beta: 0.2  # Increase from 0.1

# Or add entropy bonus
training:
  entropy_coef: 0.01

Resources

Documentation: https://meta-pytorch.org/torchforge
GitHub: https://github.com/meta-pytorch/torchforge
Discord: https://discord.gg/YsTYBh6PD9
TorchTitan: https://github.com/pytorch/torchtitan
Monarch: https://github.com/meta-pytorch/monarch

Orchestra-Research/torchforge-rl-training

06-post-training/torchforge/SKILL.md

Provides guidance for PyTorch-native agentic RL using torchforge, Meta's library separating infra from algorithms. Use when you want clean RL abstractions, easy algorithm experimentation, or scalable training with Monarch and TorchTitan.

5,311 stars

development

Updated Mar 20, 2026

$ install --global

skillsauth

npx skillsauth add Orchestra-Research/AI-Research-SKILLs torchforge-rl-training

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Error

VirusTotalMulti-engine malware detection

70%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Mar 20, 2026, 12:15 PM243.3s3 files scanned

SKILL.md

name:: torchforge-rl-training
description:: Provides guidance for PyTorch-native agentic RL using torchforge, Meta's library separating infra from algorithms. Use when you want clean RL abstractions, easy algorithm experimentation, or scalable training with Monarch and TorchTitan.
version:: 1.0.0
author:: Orchestra Research
license:: MIT
tags:: [Reinforcement Learning, PyTorch, GRPO, SFT, Monarch, TorchTitan, Meta]
dependencies:: [torch>=2.9.0, torchtitan>=0.2.0, vllm, monarch]

torchforge: PyTorch-Native Agentic RL Library

When to Use torchforge

Choose torchforge when you need:

Clean separation between RL algorithms and infrastructure
PyTorch-native abstractions (no Ray dependency)
Easy algorithm experimentation (GRPO, DAPO, SAPO in ~100 lines)
Scalable training with Monarch actor system
Integration with TorchTitan for model parallelism

Consider alternatives when:

You need production-ready stability → use miles or verl
You want Megatron-native training → use slime
torchforge is experimental and APIs may change

Key Features

Algorithm isolation: Implement RL algorithms without touching infrastructure
Scalability: From single GPU to thousands via Monarch
Modern stack: TorchTitan (training), vLLM (inference), TorchStore (sync)
Loss functions: GRPO, DAPO, CISPO, GSPO, SAPO built-in

Architecture Overview

┌─────────────────────────────────────────────────────────┐
│ Application Layer (Your Code)                           │
│ - Define reward models, loss functions, sampling        │
└─────────────────────┬───────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────┐
│ Forge API Layer                                         │
│ - Episode, Group dataclasses                           │
│ - Service interfaces (async/await)                      │
└─────────────────────┬───────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────┐
│ Distributed Services (Monarch)                          │
│ ├── Trainer (TorchTitan FSDP)                          │
│ ├── Generator (vLLM inference)                          │
│ ├── Reference Model (frozen KL baseline)               │
│ └── Reward Actors (compute rewards)                    │
└─────────────────────────────────────────────────────────┘

Installation

# Create environment
conda create -n forge python=3.12
conda activate forge

# Install (handles PyTorch nightly + dependencies)
./scripts/install.sh

# Verify
python -c "import torch, forge, vllm; print('OK')"

ROCm Installation

./scripts/install_rocm.sh

Quick Start

SFT Training (2+ GPUs)

python -m apps.sft.main --config apps/sft/llama3_8b.yaml

GRPO Training (3+ GPUs)

python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml

Workflow 1: GRPO Training for Math Reasoning

Use this workflow for training reasoning models with group-relative advantages.

Prerequisites Checklist

[ ] 3+ GPUs (GPU0: trainer, GPU1: ref_model, GPU2: generator)
[ ] Model from HuggingFace Hub
[ ] Training dataset (GSM8K, MATH, etc.)

Step 1: Create Configuration

# config/grpo_math.yaml
model: "Qwen/Qwen2.5-7B-Instruct"

dataset:
  path: "openai/gsm8k"
  split: "train"
  streaming: true

training:
  batch_size: 4
  learning_rate: 1e-6
  seq_len: 4096
  dtype: bfloat16
  gradient_accumulation_steps: 4

grpo:
  n_samples: 8           # Responses per prompt
  clip_low: 0.2
  clip_high: 0.28
  beta: 0.1              # KL penalty coefficient
  temperature: 0.7

services:
  generator:
    procs: 1
    num_replicas: 1
    with_gpus: true
  trainer:
    procs: 1
    num_replicas: 1
    with_gpus: true
  ref_model:
    procs: 1
    num_replicas: 1
    with_gpus: true

Step 2: Define Reward Function

# rewards.py
# Reward functions are in forge.data.rewards
from forge.data.rewards import MathReward, ThinkingReward
import re

# Or define your own reward function
class CustomMathReward:
    def __call__(self, prompt: str, response: str, target: str) -> float:
        # Extract answer from response
        match = re.search(r'\\boxed{([^}]+)}', response)
        if not match:
            return 0.0

        answer = match.group(1).strip()
        return 1.0 if answer == target else 0.0

Step 3: Launch Training

python -m apps.grpo.main --config config/grpo_math.yaml

Step 4: Monitor Progress

[ ] Check W&B dashboard for loss curves
[ ] Verify entropy is decreasing (policy becoming more deterministic)
[ ] Monitor KL divergence (should stay bounded)

Workflow 2: Custom Loss Function

Use this workflow to implement new RL algorithms.

Step 1: Create Loss Class

# src/forge/losses/custom_loss.py
import torch
import torch.nn as nn

class CustomLoss(nn.Module):
    def __init__(self, clip_range: float = 0.2, beta: float = 0.1):
        super().__init__()
        self.clip_range = clip_range
        self.beta = beta

    def forward(
        self,
        logprobs: torch.Tensor,
        ref_logprobs: torch.Tensor,
        advantages: torch.Tensor,
        padding_mask: torch.Tensor,
    ) -> torch.Tensor:
        # Compute importance ratio
        ratio = torch.exp(logprobs - ref_logprobs)

        # Clipped policy gradient
        clipped_ratio = torch.clamp(
            ratio,
            1 - self.clip_range,
            1 + self.clip_range
        )
        pg_loss = -torch.min(ratio * advantages, clipped_ratio * advantages)

        # KL penalty
        kl = ref_logprobs - logprobs

        # Apply mask and aggregate
        masked_loss = (pg_loss + self.beta * kl) * padding_mask
        loss = masked_loss.sum() / padding_mask.sum()

        return loss

Step 2: Integrate into Application

# apps/custom/main.py
from forge.losses.custom_loss import CustomLoss

loss_fn = CustomLoss(clip_range=0.2, beta=0.1)

# In training loop
loss = loss_fn(
    logprobs=logprobs,
    ref_logprobs=ref_logprobs,
    advantages=advantages,
    padding_mask=padding_mask,
)

Workflow 3: Multi-GPU Distributed Training

Use this workflow for scaling to multiple GPUs or nodes.

Configuration for Distributed

# config/distributed.yaml
model: "meta-llama/Meta-Llama-3.1-8B-Instruct"

parallelism:
  tensor_parallel_degree: 2    # Split model across GPUs
  pipeline_parallel_degree: 1
  data_parallel_shard_degree: 2

services:
  generator:
    procs: 2                   # 2 processes for TP=2
    num_replicas: 1
    with_gpus: true
  trainer:
    procs: 2
    num_replicas: 1
    with_gpus: true

Launch with SLURM

# Submit job
sbatch --nodes=2 --gpus-per-node=8 run_grpo.sh

Launch Locally (Multi-GPU)

# 8 GPU setup
python -m apps.grpo.main \
    --config config/distributed.yaml \
    --trainer.procs 4 \
    --generator.procs 4

Core API Reference

Training Batch Format

torchforge uses dictionary-based batches for training:

# inputs: list of dicts with torch.Tensor values
inputs = [{"tokens": torch.Tensor}]

# targets: list of dicts with training signals
targets = [{
    "response": torch.Tensor,
    "ref_logprobs": torch.Tensor,
    "advantages": torch.Tensor,
    "padding_mask": torch.Tensor
}]

# train_step returns loss as float
loss = trainer.train_step(inputs, targets)

Completion

Generated output from vLLM:

@dataclass
class Completion:
    text: str              # Generated text
    token_ids: list[int]   # Token IDs
    logprobs: list[float]  # Log probabilities
    metadata: dict         # Custom metadata

Built-in Loss Functions

Loss Functions

Loss functions are in the forge.losses module:

from forge.losses import SimpleGRPOLoss, ReinforceLoss

# SimpleGRPOLoss for GRPO training
loss_fn = SimpleGRPOLoss(beta=0.1)

# Forward pass
loss = loss_fn(
    logprobs=logprobs,
    ref_logprobs=ref_logprobs,
    advantages=advantages,
    padding_mask=padding_mask
)

ReinforceLoss

from forge.losses.reinforce_loss import ReinforceLoss

# With optional importance ratio clipping
loss_fn = ReinforceLoss(clip_ratio=0.2)

Common Issues and Solutions

Issue: Not Enough GPUs

Symptoms: "Insufficient GPU resources" error

Solutions:

# Reduce service requirements
services:
  generator:
    procs: 1
    with_gpus: true
  trainer:
    procs: 1
    with_gpus: true
  # Remove ref_model (uses generator weights)

Or use CPU for reference model:

ref_model:
  with_gpus: false

Issue: OOM During Generation

Symptoms: CUDA OOM in vLLM

Solutions:

# Reduce batch size
grpo:
  n_samples: 4  # Reduce from 8

# Or reduce sequence length
training:
  seq_len: 2048

Issue: Slow Weight Sync

Symptoms: Long pauses between training and generation

Solutions:

# Enable RDMA (if available)
export TORCHSTORE_USE_RDMA=1

# Or reduce sync frequency
training:
  sync_interval: 10  # Sync every 10 steps

Issue: Policy Collapse

Symptoms: Entropy drops to zero, reward stops improving

Solutions:

# Increase KL penalty
grpo:
  beta: 0.2  # Increase from 0.1

# Or add entropy bonus
training:
  entropy_coef: 0.01

Resources

Documentation: https://meta-pytorch.org/torchforge
GitHub: https://github.com/meta-pytorch/torchforge
Discord: https://discord.gg/YsTYBh6PD9
TorchTitan: https://github.com/pytorch/torchtitan
Monarch: https://github.com/meta-pytorch/monarch

Related Skills

Orchestra-Research/model-merging

development

VerifiedTrustedOfficial

Merge multiple fine-tuned models using mergekit to combine capabilities without retraining. Use when creating specialized models by blending domain-specific expertise (math + coding + chat), improving performance beyond single models, or experimenting rapidly with model variants. Covers SLERP, TIES-Merging, DARE, Task Arithmetic, linear merging, and production deployment strategies.

9,743SKILL.mdUpdated Mar 20, 2026

Orchestra-Research/model-merging

Orchestra-Research/ara-rigor-reviewer

development

VerifiedTrustedOfficial

Performs ARA Seal Level 2 semantic epistemic review on Agent-Native Research Artifacts, scoring six dimensions (evidence relevance, falsifiability, scope calibration, argument coherence, exploration integrity, methodological rigor) and producing a constructive, severity-ranked report with a Strong Accept-to-Reject recommendation. Use after Level 1 structural validation passes, when an ARA needs an objective epistemic critique before publication or release.

7,535SKILL.mdUpdated Apr 29, 2026

Orchestra-Research/ara-rigor-reviewer

Orchestra-Research/ara-research-manager

testing

VerifiedTrustedOfficial

Records research provenance as a post-task epilogue, scanning conversation history at the end of a coding or research session to extract decisions, experiments, dead ends, claims, heuristics, and pivots, and writing them into the ara/ directory with user-vs-AI provenance tags. Use as a session epilogue — never during execution — to maintain a faithful, auditable trace of how a research project actually evolved.

7,535SKILL.mdUpdated Apr 29, 2026

Orchestra-Research/ara-research-manager

Orchestra-Research/ara-compiler

development

VerifiedTrustedOfficial

Compiles any research input — PDF papers, GitHub repositories, experiment logs, code directories, or raw notes — into a complete Agent-Native Research Artifact (ARA) with cognitive layer (claims, concepts, heuristics), physical layer (configs, code stubs), exploration graph, and grounded evidence. Use when ingesting a paper or codebase into a structured, machine-executable knowledge package, building an ARA from scratch, or converting research outputs into a falsifiable, agent-traversable form.

7,535SKILL.mdUpdated Apr 29, 2026

Orchestra-Research/ara-compiler

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/Orchestra-Research/AI-Research-SKILLs.git

# Copy into Claude Code skills folder (global)
cp -r AI-Research-SKILLs/06-post-training/torchforge ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

Orchestra-Research/AI-Research-SKILLs

5,311 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT