Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

ADu2021/bro-rl-broad-rollout-scaling

Name: bro-rl-broad-rollout-scaling
Author: ADu2021

skills/skillxiv-v0.0.2-claude-opus-4.6/bro-rl-broad-rollout-scaling/SKILL.md

npx skillsauth add ADu2021/skillXiv bro-rl-broad-rollout-scaling

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

BroRL: Scaling via Broad Exploration Rather Than Longer Training

BroRL addresses a fundamental plateau problem in reasoning model RL: training stops improving after ~3,000 steps. Rather than longer training, the solution is broader exploration. By increasing rollouts per prompt from 16 to 512, models escape the plateau and continue improving, grounded in theoretical analysis of unsampled coupling destabilization.

Core Architecture

Broad exploration principle: Increase N (rollouts per prompt) instead of training steps
Theoretical foundation: Unsampled coupling term analysis explains plateau mechanism
Scaling formula: Learning rate adjustments based on N following principled schedule
Memory-bound to compute-bound: Shift from GPU memory bottleneck to compute bottleneck

Implementation Steps

Configure BroRL rollout and learning rate strategy:

# Initialize BroRL trainer
from bro_rl import BroadRLTrainer, RolloutScaler

trainer = BroadRLTrainer(
    model=your_reasoning_llm,
    base_rollouts=16,  # standard GRPO rollout count
    target_rollouts=512,  # broad exploration scaling
    algorithm="GRPO"  # or other on-policy RL
)

# Configure learning rate scaling
rollout_scaler = RolloutScaler(
    base_learning_rate=1e-5,
    rollout_schedule=[16, 32, 64, 128, 256, 512]
)

# Compute adjusted learning rates following scaling formula
learning_rates = rollout_scaler.compute_schedule()
# Example: [1e-5, 1.1e-5, 1.25e-5, 1.45e-5, 1.7e-5, 2e-5]

Execute broad rollout training:

# Training loop with increasing rollout counts
training_steps = 3000
epochs_per_rollout_config = 500  # 3000 steps / 6 configs = 500 steps each

for config_idx, (rollout_count, learning_rate) in enumerate(
    zip(rollout_scaler.rollout_schedule, learning_rates)
):
    print(f"Phase {config_idx}: {rollout_count} rollouts, LR={learning_rate:.2e}")

    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=learning_rate,
        betas=(0.9, 0.95),
        weight_decay=0.01
    )

    # Train for fixed steps with current rollout configuration
    for step in range(epochs_per_rollout_config):
        for batch in training_dataloader:
            prompts = batch["prompt"]

            # Generate many rollouts per prompt
            rollouts = []
            for _ in range(rollout_count):
                rollout = model.generate(
                    prompts,
                    max_length=512,
                    temperature=1.0,
                    top_p=0.95
                )
                rollouts.append(rollout)

            # Verify solutions
            rewards = verifier.evaluate_batch(rollouts)

            # Compute advantages with all rollouts
            advantages = compute_advantages(
                rewards=rewards,
                baseline_method="group_mean"
            )

            # GRPO loss over all rollouts
            log_probs = model.compute_log_probs(rollouts)
            policy_loss = -((log_probs * advantages).sum(dim=1)).mean()

            # KL regularization
            kl_loss = compute_kl_divergence(
                model=model,
                reference_model=reference_model,
                kl_weight=0.1
            )

            total_loss = policy_loss + kl_loss

            # Backward pass
            optimizer.zero_grad()
            total_loss.backward()

            # Gradient clipping and step
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()

            # Update reference model
            if step % 10 == 0:
                reference_model.load_state_dict(model.state_dict())

            # Logging
            if step % 50 == 0:
                success_rate = (rewards > 0.5).float().mean()
                print(
                    f"  Step {step}: Loss={total_loss:.4f}, "
                    f"Success={success_rate:.1%}"
                )

Practical Guidance

When to use BroRL:

Reasoning model training plateaus after 3K+ steps with standard GRPO
Compute resources available (N=512 requires more GPU hours)
Mathematical reasoning, code generation, logical tasks
Scenarios where continuing improvement matters more than initial speed

When NOT to use:

Memory-constrained settings (N=512 requires large batch capacity)
Tasks where initial performance sufficient (plateau not a problem)
Fast iteration needed (broad training slower per-step than standard GRPO)
Models already exceeding baseline at plateau point

Hyperparameters:

Max rollouts (512): Theoretical optimum for most 1.5B models. Test 256-512 range
Rollout schedule: Gradual increase (16→32→64→128→256→512) prevents instability
Learning rate scale: Increase ~15% per 2x rollout increase
Training steps per phase (500): Decrease to 200 for rapid prototyping; 1000 for refinement
Temperature (1.0): Keep high for diversity; affects rollout quality distribution

Theoretical Insights

Unsampled coupling analysis: The paper's Theorem 1 decomposes advantage estimation into:

Sampled coupling: Controlled by advantage normalization
Unsampled coupling: Destabilizes when group size N too small

BroRL suppresses unsampled coupling through:

Large N increases empirical advantage concentration
Reduces variance of group-relative estimation
Enables stable training despite non-stationary policy

Computational Characteristics

Memory requirements:

N=16: 1 A100 GPU typical
N=512: 8 A100 GPUs or equivalent (32x memory scaling)

Throughput:

Per-step: Memory-bound → Compute-bound
~2x GPU utilization improvement despite 32x rollout increase

Training time:

Same total steps, but slower wall-clock (more rollouts per step)
Benefits amortize over continued improvement beyond plateau

Performance Results

Escape plateau and continue improving:

Standard GRPO: Plateaus at ~3K steps
BroRL: Improves linearly through 8K+ steps
Final performance: +5-8% above standard GRPO plateau

Model scaling:

1.5B models: Most dramatic improvement (plateau at lower performance)
7B models: Moderate improvement (already near convergence)
34B models: Minimal improvement (rarely plateau with standard training)

Architecture Notes

The broad rollout strategy empirically demonstrates that exploration breadth matters more than training depth for reasoning tasks—a contrast to many supervised learning domains.

References

Builds on policy gradient theory, advantage normalization, and empirical scaling studies.

ADu2021/bro-rl-broad-rollout-scaling

skills/skillxiv-v0.0.2-claude-opus-4.6/bro-rl-broad-rollout-scaling/SKILL.md

Overcome reasoning model training plateaus by increasing rollouts per prompt (N=512) rather than training steps, addressing unsampled coupling that destabilizes learning. Theoretical analysis shows broad exploration eliminates plateau bottleneck.

2 stars

data-ai

Updated Apr 16, 2026

$ install --global

skillsauth

npx skillsauth add ADu2021/skillXiv bro-rl-broad-rollout-scaling

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 24, 2026, 8:55 PM1.8s1 file scanned

SKILL.md

name:: bro-rl-broad-rollout-scaling
title:: BroRL: Scaling via Broad Exploration Rather Than Longer Training
version:: 0.0.2
engine:: skillxiv-v0.0.2-claude-opus-4.6
license:: MIT
url:: https://arxiv.org/abs/2510.01180
keywords:: [RLVR, scaling, exploration, reasoning, efficiency]
description:: Overcome reasoning model training plateaus by increasing rollouts per prompt (N=512) rather than training steps, addressing unsampled coupling that destabilizes learning. Theoretical analysis shows broad exploration eliminates plateau bottleneck.

BroRL: Scaling via Broad Exploration Rather Than Longer Training

Core Architecture

Broad exploration principle: Increase N (rollouts per prompt) instead of training steps
Theoretical foundation: Unsampled coupling term analysis explains plateau mechanism
Scaling formula: Learning rate adjustments based on N following principled schedule
Memory-bound to compute-bound: Shift from GPU memory bottleneck to compute bottleneck

Implementation Steps

Configure BroRL rollout and learning rate strategy:

# Initialize BroRL trainer
from bro_rl import BroadRLTrainer, RolloutScaler

trainer = BroadRLTrainer(
    model=your_reasoning_llm,
    base_rollouts=16,  # standard GRPO rollout count
    target_rollouts=512,  # broad exploration scaling
    algorithm="GRPO"  # or other on-policy RL
)

# Configure learning rate scaling
rollout_scaler = RolloutScaler(
    base_learning_rate=1e-5,
    rollout_schedule=[16, 32, 64, 128, 256, 512]
)

# Compute adjusted learning rates following scaling formula
learning_rates = rollout_scaler.compute_schedule()
# Example: [1e-5, 1.1e-5, 1.25e-5, 1.45e-5, 1.7e-5, 2e-5]

Execute broad rollout training:

# Training loop with increasing rollout counts
training_steps = 3000
epochs_per_rollout_config = 500  # 3000 steps / 6 configs = 500 steps each

for config_idx, (rollout_count, learning_rate) in enumerate(
    zip(rollout_scaler.rollout_schedule, learning_rates)
):
    print(f"Phase {config_idx}: {rollout_count} rollouts, LR={learning_rate:.2e}")

    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=learning_rate,
        betas=(0.9, 0.95),
        weight_decay=0.01
    )

    # Train for fixed steps with current rollout configuration
    for step in range(epochs_per_rollout_config):
        for batch in training_dataloader:
            prompts = batch["prompt"]

            # Generate many rollouts per prompt
            rollouts = []
            for _ in range(rollout_count):
                rollout = model.generate(
                    prompts,
                    max_length=512,
                    temperature=1.0,
                    top_p=0.95
                )
                rollouts.append(rollout)

            # Verify solutions
            rewards = verifier.evaluate_batch(rollouts)

            # Compute advantages with all rollouts
            advantages = compute_advantages(
                rewards=rewards,
                baseline_method="group_mean"
            )

            # GRPO loss over all rollouts
            log_probs = model.compute_log_probs(rollouts)
            policy_loss = -((log_probs * advantages).sum(dim=1)).mean()

            # KL regularization
            kl_loss = compute_kl_divergence(
                model=model,
                reference_model=reference_model,
                kl_weight=0.1
            )

            total_loss = policy_loss + kl_loss

            # Backward pass
            optimizer.zero_grad()
            total_loss.backward()

            # Gradient clipping and step
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()

            # Update reference model
            if step % 10 == 0:
                reference_model.load_state_dict(model.state_dict())

            # Logging
            if step % 50 == 0:
                success_rate = (rewards > 0.5).float().mean()
                print(
                    f"  Step {step}: Loss={total_loss:.4f}, "
                    f"Success={success_rate:.1%}"
                )

Practical Guidance

When to use BroRL:

Reasoning model training plateaus after 3K+ steps with standard GRPO
Compute resources available (N=512 requires more GPU hours)
Mathematical reasoning, code generation, logical tasks
Scenarios where continuing improvement matters more than initial speed

When NOT to use:

Memory-constrained settings (N=512 requires large batch capacity)
Tasks where initial performance sufficient (plateau not a problem)
Fast iteration needed (broad training slower per-step than standard GRPO)
Models already exceeding baseline at plateau point

Hyperparameters:

Max rollouts (512): Theoretical optimum for most 1.5B models. Test 256-512 range
Rollout schedule: Gradual increase (16→32→64→128→256→512) prevents instability
Learning rate scale: Increase ~15% per 2x rollout increase
Training steps per phase (500): Decrease to 200 for rapid prototyping; 1000 for refinement
Temperature (1.0): Keep high for diversity; affects rollout quality distribution

Theoretical Insights

Unsampled coupling analysis: The paper's Theorem 1 decomposes advantage estimation into:

Sampled coupling: Controlled by advantage normalization
Unsampled coupling: Destabilizes when group size N too small

BroRL suppresses unsampled coupling through:

Large N increases empirical advantage concentration
Reduces variance of group-relative estimation
Enables stable training despite non-stationary policy

Computational Characteristics

Memory requirements:

N=16: 1 A100 GPU typical
N=512: 8 A100 GPUs or equivalent (32x memory scaling)

Throughput:

Per-step: Memory-bound → Compute-bound
~2x GPU utilization improvement despite 32x rollout increase

Training time:

Same total steps, but slower wall-clock (more rollouts per step)
Benefits amortize over continued improvement beyond plateau

Performance Results

Escape plateau and continue improving:

Standard GRPO: Plateaus at ~3K steps
BroRL: Improves linearly through 8K+ steps
Final performance: +5-8% above standard GRPO plateau

Model scaling:

1.5B models: Most dramatic improvement (plateau at lower performance)
7B models: Moderate improvement (already near convergence)
34B models: Minimal improvement (rarely plateau with standard training)

Architecture Notes

The broad rollout strategy empirically demonstrates that exploration breadth matters more than training depth for reasoning tasks—a contrast to many supervised learning domains.

References

Builds on policy gradient theory, advantage normalization, and empirical scaling studies.

Related Skills

ADu2021/flow-map-trajectory-tilting

testing

VerifiedTrustedCommunity

Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flow-map-trajectory-tilting

ADu2021/flexible-data-mixture-of-experts

testing

VerifiedTrustedCommunity

Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flexible-data-mixture-of-experts

ADu2021/flexibility-trap-diffusion-reasoning

data-ai

VerifiedTrustedCommunity

Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flexibility-trap-diffusion-reasoning

ADu2021/flex-continuous-agent-evolution

devops

VerifiedTrustedCommunity

Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flex-continuous-agent-evolution

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/ADu2021/skillXiv.git

# Copy into Claude Code skills folder (global)
cp -r skillXiv/skills/skillxiv-v0.0.2-claude-opus-4.6/bro-rl-broad-rollout-scaling ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

ADu2021/skillXiv

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT