skills/skillxiv-v0.0.2-claude-opus-4.6/bro-rl-broad-rollout-scaling/SKILL.md
Overcome reasoning model training plateaus by increasing rollouts per prompt (N=512) rather than training steps, addressing unsampled coupling that destabilizes learning. Theoretical analysis shows broad exploration eliminates plateau bottleneck.
npx skillsauth add ADu2021/skillXiv bro-rl-broad-rollout-scalingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
BroRL addresses a fundamental plateau problem in reasoning model RL: training stops improving after ~3,000 steps. Rather than longer training, the solution is broader exploration. By increasing rollouts per prompt from 16 to 512, models escape the plateau and continue improving, grounded in theoretical analysis of unsampled coupling destabilization.
Configure BroRL rollout and learning rate strategy:
# Initialize BroRL trainer
from bro_rl import BroadRLTrainer, RolloutScaler
trainer = BroadRLTrainer(
model=your_reasoning_llm,
base_rollouts=16, # standard GRPO rollout count
target_rollouts=512, # broad exploration scaling
algorithm="GRPO" # or other on-policy RL
)
# Configure learning rate scaling
rollout_scaler = RolloutScaler(
base_learning_rate=1e-5,
rollout_schedule=[16, 32, 64, 128, 256, 512]
)
# Compute adjusted learning rates following scaling formula
learning_rates = rollout_scaler.compute_schedule()
# Example: [1e-5, 1.1e-5, 1.25e-5, 1.45e-5, 1.7e-5, 2e-5]
Execute broad rollout training:
# Training loop with increasing rollout counts
training_steps = 3000
epochs_per_rollout_config = 500 # 3000 steps / 6 configs = 500 steps each
for config_idx, (rollout_count, learning_rate) in enumerate(
zip(rollout_scaler.rollout_schedule, learning_rates)
):
print(f"Phase {config_idx}: {rollout_count} rollouts, LR={learning_rate:.2e}")
optimizer = torch.optim.AdamW(
model.parameters(),
lr=learning_rate,
betas=(0.9, 0.95),
weight_decay=0.01
)
# Train for fixed steps with current rollout configuration
for step in range(epochs_per_rollout_config):
for batch in training_dataloader:
prompts = batch["prompt"]
# Generate many rollouts per prompt
rollouts = []
for _ in range(rollout_count):
rollout = model.generate(
prompts,
max_length=512,
temperature=1.0,
top_p=0.95
)
rollouts.append(rollout)
# Verify solutions
rewards = verifier.evaluate_batch(rollouts)
# Compute advantages with all rollouts
advantages = compute_advantages(
rewards=rewards,
baseline_method="group_mean"
)
# GRPO loss over all rollouts
log_probs = model.compute_log_probs(rollouts)
policy_loss = -((log_probs * advantages).sum(dim=1)).mean()
# KL regularization
kl_loss = compute_kl_divergence(
model=model,
reference_model=reference_model,
kl_weight=0.1
)
total_loss = policy_loss + kl_loss
# Backward pass
optimizer.zero_grad()
total_loss.backward()
# Gradient clipping and step
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
# Update reference model
if step % 10 == 0:
reference_model.load_state_dict(model.state_dict())
# Logging
if step % 50 == 0:
success_rate = (rewards > 0.5).float().mean()
print(
f" Step {step}: Loss={total_loss:.4f}, "
f"Success={success_rate:.1%}"
)
When to use BroRL:
When NOT to use:
Hyperparameters:
Unsampled coupling analysis: The paper's Theorem 1 decomposes advantage estimation into:
BroRL suppresses unsampled coupling through:
Memory requirements:
Throughput:
Training time:
Escape plateau and continue improving:
Model scaling:
The broad rollout strategy empirically demonstrates that exploration breadth matters more than training depth for reasoning tasks—a contrast to many supervised learning domains.
Builds on policy gradient theory, advantage normalization, and empirical scaling studies.
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.