skills/skillxiv-v0.0.2-claude-opus-4.6/exgrpo-experience-replay-reasoning/SKILL.md
Improve RLVR training efficiency by selectively replaying trajectories based on correctness and entropy. Medium-difficulty questions and low-entropy solutions are most valuable; selective replay yields +3.5-7.6% improvements.
npx skillsauth add ADu2021/skillXiv exgrpo-experience-replay-reasoningInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Standard on-policy RLVR discards all experience after one update, losing valuable learning signals. ExGRPO identifies that trajectories vary dramatically in training value: correctness rate and entropy serve as effective indicators. Medium-difficulty questions with low-entropy solutions provide the best learning signal.
Setup selective experience replay for GRPO:
# Initialize ExGRPO trajectory management
from exgrpo import TrajectoryBuffer, DifficultyBucketer, SelectiveReplay
trajectory_buffer = TrajectoryBuffer(
max_size=100_000,
difficulty_tiers=5,
entropy_percentiles=[25, 50, 75]
)
difficulty_bucketer = DifficultyBucketer(
metric="correctness_rate",
window_size=100 # recent trajectories only
)
replay_manager = SelectiveReplay(
strategy="entropy_guided",
replay_probability=0.3, # 30% of updates use replay
medium_difficulty_range=(0.3, 0.7) # 30-70% success rate
)
Execute GRPO training with experience replay:
# Training loop with selective trajectory replay
trajectory_buffer.clear()
for epoch in range(num_epochs):
# On-policy phase: standard GRPO without replay
if epoch < replay_start_epoch:
for batch in on_policy_dataloader:
prompts = batch["prompt"]
# Generate rollouts
rollouts = model.rollout(
prompts=prompts,
num_rollouts=4,
temperature=1.0
)
# Verify and compute rewards
rewards = verifier.evaluate(rollouts)
# Compute GRPO loss (standard)
loss = compute_grpo_loss(rollouts, rewards)
loss.backward()
optimizer.step()
# Store trajectories for potential replay
for prompt, rollout, reward in zip(prompts, rollouts, rewards):
trajectory = {
"prompt": prompt,
"rollout": rollout,
"reward": reward,
"is_correct": reward > 0.5,
"entropy": compute_entropy(rollout)
}
trajectory_buffer.add(trajectory)
# Replay phase: selectively replay valuable trajectories
else:
# Mixed on-policy + replay updates
for batch in on_policy_dataloader:
use_replay = np.random.rand() < replay_manager.replay_probability
if use_replay and len(trajectory_buffer) > 1000:
# Sample replay batch with difficulty/entropy bias
replay_batch = trajectory_buffer.sample(
strategy="medium_difficulty_low_entropy",
size=len(batch["prompt"]),
difficulty_range=(0.3, 0.7),
entropy_percentile=25 # bottom 25% entropy
)
prompts = replay_batch["prompts"]
rollouts = replay_batch["rollouts"]
rewards = replay_batch["rewards"]
# Compute importance weights (for off-policy correction)
importance_weights = compute_importance_weights(
old_policy=trajectory_buffer.policy_snapshot,
current_policy=model,
rollouts=rollouts
)
# GRPO loss with importance correction
loss = compute_grpo_loss(
rollouts=rollouts,
rewards=rewards,
importance_weights=importance_weights
)
else:
# Standard on-policy GRPO
prompts = batch["prompt"]
rollouts = model.rollout(prompts, num_rollouts=4)
rewards = verifier.evaluate(rollouts)
loss = compute_grpo_loss(rollouts, rewards)
loss.backward()
optimizer.step()
# Update trajectory buffer with new experiences
for prompt, rollout, reward in zip(prompts, rollouts, rewards):
trajectory = {
"prompt": prompt,
"rollout": rollout,
"reward": reward,
"is_correct": reward > 0.5,
"entropy": compute_entropy(rollout)
}
trajectory_buffer.add(trajectory)
When to use ExGRPO:
When NOT to use:
Hyperparameters:
Per-model improvements:
Key finding: Replay stabilizes training on weaker models; stronger models already stable.
High-value trajectories:
Low-value trajectories:
Difficulty tracking:
Entropy computation:
Builds on curriculum learning, experience replay in RL, and trajectory-based learning for language models.
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.