06-post-training/torchforge/SKILL.md
Provides guidance for PyTorch-native agentic RL using torchforge, Meta's library separating infra from algorithms. Use when you want clean RL abstractions, easy algorithm experimentation, or scalable training with Monarch and TorchTitan.
npx skillsauth add Orchestra-Research/AI-Research-SKILLs torchforge-rl-trainingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
torchforge is Meta's PyTorch-native RL library that separates infrastructure concerns from algorithm concerns. It enables rapid RL research by letting you focus on algorithms while handling distributed training, inference, and weight sync automatically.
Choose torchforge when you need:
Consider alternatives when:
┌─────────────────────────────────────────────────────────┐
│ Application Layer (Your Code) │
│ - Define reward models, loss functions, sampling │
└─────────────────────┬───────────────────────────────────┘
│
┌─────────────────────▼───────────────────────────────────┐
│ Forge API Layer │
│ - Episode, Group dataclasses │
│ - Service interfaces (async/await) │
└─────────────────────┬───────────────────────────────────┘
│
┌─────────────────────▼───────────────────────────────────┐
│ Distributed Services (Monarch) │
│ ├── Trainer (TorchTitan FSDP) │
│ ├── Generator (vLLM inference) │
│ ├── Reference Model (frozen KL baseline) │
│ └── Reward Actors (compute rewards) │
└─────────────────────────────────────────────────────────┘
# Create environment
conda create -n forge python=3.12
conda activate forge
# Install (handles PyTorch nightly + dependencies)
./scripts/install.sh
# Verify
python -c "import torch, forge, vllm; print('OK')"
./scripts/install_rocm.sh
python -m apps.sft.main --config apps/sft/llama3_8b.yaml
python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml
Use this workflow for training reasoning models with group-relative advantages.
# config/grpo_math.yaml
model: "Qwen/Qwen2.5-7B-Instruct"
dataset:
path: "openai/gsm8k"
split: "train"
streaming: true
training:
batch_size: 4
learning_rate: 1e-6
seq_len: 4096
dtype: bfloat16
gradient_accumulation_steps: 4
grpo:
n_samples: 8 # Responses per prompt
clip_low: 0.2
clip_high: 0.28
beta: 0.1 # KL penalty coefficient
temperature: 0.7
services:
generator:
procs: 1
num_replicas: 1
with_gpus: true
trainer:
procs: 1
num_replicas: 1
with_gpus: true
ref_model:
procs: 1
num_replicas: 1
with_gpus: true
# rewards.py
# Reward functions are in forge.data.rewards
from forge.data.rewards import MathReward, ThinkingReward
import re
# Or define your own reward function
class CustomMathReward:
def __call__(self, prompt: str, response: str, target: str) -> float:
# Extract answer from response
match = re.search(r'\\boxed{([^}]+)}', response)
if not match:
return 0.0
answer = match.group(1).strip()
return 1.0 if answer == target else 0.0
python -m apps.grpo.main --config config/grpo_math.yaml
Use this workflow to implement new RL algorithms.
# src/forge/losses/custom_loss.py
import torch
import torch.nn as nn
class CustomLoss(nn.Module):
def __init__(self, clip_range: float = 0.2, beta: float = 0.1):
super().__init__()
self.clip_range = clip_range
self.beta = beta
def forward(
self,
logprobs: torch.Tensor,
ref_logprobs: torch.Tensor,
advantages: torch.Tensor,
padding_mask: torch.Tensor,
) -> torch.Tensor:
# Compute importance ratio
ratio = torch.exp(logprobs - ref_logprobs)
# Clipped policy gradient
clipped_ratio = torch.clamp(
ratio,
1 - self.clip_range,
1 + self.clip_range
)
pg_loss = -torch.min(ratio * advantages, clipped_ratio * advantages)
# KL penalty
kl = ref_logprobs - logprobs
# Apply mask and aggregate
masked_loss = (pg_loss + self.beta * kl) * padding_mask
loss = masked_loss.sum() / padding_mask.sum()
return loss
# apps/custom/main.py
from forge.losses.custom_loss import CustomLoss
loss_fn = CustomLoss(clip_range=0.2, beta=0.1)
# In training loop
loss = loss_fn(
logprobs=logprobs,
ref_logprobs=ref_logprobs,
advantages=advantages,
padding_mask=padding_mask,
)
Use this workflow for scaling to multiple GPUs or nodes.
# config/distributed.yaml
model: "meta-llama/Meta-Llama-3.1-8B-Instruct"
parallelism:
tensor_parallel_degree: 2 # Split model across GPUs
pipeline_parallel_degree: 1
data_parallel_shard_degree: 2
services:
generator:
procs: 2 # 2 processes for TP=2
num_replicas: 1
with_gpus: true
trainer:
procs: 2
num_replicas: 1
with_gpus: true
# Submit job
sbatch --nodes=2 --gpus-per-node=8 run_grpo.sh
# 8 GPU setup
python -m apps.grpo.main \
--config config/distributed.yaml \
--trainer.procs 4 \
--generator.procs 4
torchforge uses dictionary-based batches for training:
# inputs: list of dicts with torch.Tensor values
inputs = [{"tokens": torch.Tensor}]
# targets: list of dicts with training signals
targets = [{
"response": torch.Tensor,
"ref_logprobs": torch.Tensor,
"advantages": torch.Tensor,
"padding_mask": torch.Tensor
}]
# train_step returns loss as float
loss = trainer.train_step(inputs, targets)
Generated output from vLLM:
@dataclass
class Completion:
text: str # Generated text
token_ids: list[int] # Token IDs
logprobs: list[float] # Log probabilities
metadata: dict # Custom metadata
Loss functions are in the forge.losses module:
from forge.losses import SimpleGRPOLoss, ReinforceLoss
# SimpleGRPOLoss for GRPO training
loss_fn = SimpleGRPOLoss(beta=0.1)
# Forward pass
loss = loss_fn(
logprobs=logprobs,
ref_logprobs=ref_logprobs,
advantages=advantages,
padding_mask=padding_mask
)
from forge.losses.reinforce_loss import ReinforceLoss
# With optional importance ratio clipping
loss_fn = ReinforceLoss(clip_ratio=0.2)
Symptoms: "Insufficient GPU resources" error
Solutions:
# Reduce service requirements
services:
generator:
procs: 1
with_gpus: true
trainer:
procs: 1
with_gpus: true
# Remove ref_model (uses generator weights)
Or use CPU for reference model:
ref_model:
with_gpus: false
Symptoms: CUDA OOM in vLLM
Solutions:
# Reduce batch size
grpo:
n_samples: 4 # Reduce from 8
# Or reduce sequence length
training:
seq_len: 2048
Symptoms: Long pauses between training and generation
Solutions:
# Enable RDMA (if available)
export TORCHSTORE_USE_RDMA=1
# Or reduce sync frequency
training:
sync_interval: 10 # Sync every 10 steps
Symptoms: Entropy drops to zero, reward stops improving
Solutions:
# Increase KL penalty
grpo:
beta: 0.2 # Increase from 0.1
# Or add entropy bonus
training:
entropy_coef: 0.01
development
Performs ARA Seal Level 2 semantic epistemic review on Agent-Native Research Artifacts, scoring six dimensions (evidence relevance, falsifiability, scope calibration, argument coherence, exploration integrity, methodological rigor) and producing a constructive, severity-ranked report with a Strong Accept-to-Reject recommendation. Use after Level 1 structural validation passes, when an ARA needs an objective epistemic critique before publication or release.
testing
Records research provenance as a post-task epilogue, scanning conversation history at the end of a coding or research session to extract decisions, experiments, dead ends, claims, heuristics, and pivots, and writing them into the ara/ directory with user-vs-AI provenance tags. Use as a session epilogue — never during execution — to maintain a faithful, auditable trace of how a research project actually evolved.
development
Compiles any research input — PDF papers, GitHub repositories, experiment logs, code directories, or raw notes — into a complete Agent-Native Research Artifact (ARA) with cognitive layer (claims, concepts, heuristics), physical layer (configs, code stubs), exploration graph, and grounded evidence. Use when ingesting a paper or codebase into a structured, machine-executable knowledge package, building an ARA from scratch, or converting research outputs into a falsifiable, agent-traversable form.
testing
Comprehensive guide for writing systems papers targeting OSDI, SOSP, ASPLOS, NSDI, and EuroSys. Provides paragraph-level structural blueprints, writing patterns, venue-specific checklists, reviewer guidelines, LaTeX templates, and conference deadlines. Use this skill for all systems conference paper writing.