skills/skillxiv-v0.0.2-claude-opus-4.6/blockwise-advantage-estimation/SKILL.md
Improve credit assignment in multi-objective RL by decomposing advantages into segment-specific values. Use Outcome-Conditioned Baselines to reduce cross-objective interference without expensive rollouts, enabling better training signals for multi-step completions with different reward functions per segment.
npx skillsauth add ADu2021/skillXiv blockwise-advantage-estimationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Language model training often requires optimizing multiple objectives sequentially within a single completion: a math solution requires correct intermediate reasoning before a correct final answer. Standard RL treats each completion as a single unit with one advantage signal, causing objective interference where optimizing one segment undermines progress on another. The temporal structure of problems demands that later segments depend on earlier ones—a segmentation-aware credit assignment mechanism is needed.
Blockwise Advantage Estimation (BAE) decomposes a completion into K blocks (segments), where each block k receives an advantage A^k(i) computed from only its own objective signal r_k(i). Rather than computing advantages over the entire completion, BAE applies block-specific advantages to block-specific tokens, reducing noise and interference between different task requirements.
The key innovation is the Outcome-Conditioned Baseline (OCB), which stratifies samples by intermediate outcomes (e.g., correctness of the prefix) and computes baselines within outcome groups, avoiding expensive conditional inference.
Identify segments and the outcomes that determine baseline stratification.
from dataclasses import dataclass
from typing import List, Dict, Tuple
import torch
@dataclass
class BlockStructure:
"""Define multi-block segmentation for completions."""
block_names: List[str] # e.g., ["reasoning", "answer", "verification"]
block_start_indices: List[int] # token index where each block starts
block_end_indices: List[int] # token index where each block ends
outcome_extractor: callable # function to extract intermediate outcome
class MultiObjectiveSegmentation:
"""Setup blocks and outcome extraction."""
def __init__(self, block_structure: BlockStructure):
self.block_structure = block_structure
self.num_blocks = len(block_structure.block_names)
def extract_outcomes(self, completions: List[str]) -> Dict[str, bool]:
"""
Extract intermediate outcomes that determine baseline groups.
Example: reasoning_correct=True, answer_correct=False
"""
outcomes = []
for completion in completions:
outcome = self.block_structure.outcome_extractor(completion)
outcomes.append(outcome)
return outcomes
def create_block_masks(self, sequence_length: int) -> List[torch.Tensor]:
"""Create binary masks indicating which tokens belong to each block."""
masks = []
for i in range(self.num_blocks):
mask = torch.zeros(sequence_length, dtype=torch.bool)
start_idx = self.block_structure.block_start_indices[i]
end_idx = self.block_structure.block_end_indices[i]
mask[start_idx:end_idx] = True
masks.append(mask)
return masks
Stratify samples and compute block-wise baselines from same-outcome groups.
class OutcomeConditionedBaseline:
"""
Compute conditional baselines by stratifying samples on intermediate outcomes.
Avoids expensive rollouts while reducing variance.
"""
def __init__(self, num_blocks: int, group_by_outcome: bool = True):
self.num_blocks = num_blocks
self.group_by_outcome = group_by_outcome
def compute_baselines(
self,
block_rewards: List[torch.Tensor], # [num_samples, num_blocks]
outcomes: List[Tuple[bool, ...]], # intermediate outcomes per sample
group_size_min: int = 2
) -> List[torch.Tensor]:
"""
Compute block-wise baselines by grouping samples with same outcome.
Args:
block_rewards: Rewards for each block of each sample
outcomes: Tuple of intermediate outcomes (e.g., (reasoning_correct, answer_correct))
group_size_min: Minimum group size for baseline estimation
Returns:
List[Tensor]: Baseline values for each block [num_blocks, num_samples]
"""
num_samples = len(outcomes)
baselines = [torch.zeros(num_samples) for _ in range(self.num_blocks)]
if not self.group_by_outcome:
# Simple mean baseline across all samples
for block_idx in range(self.num_blocks):
block_rewards_tensor = torch.stack(
[r[block_idx] if isinstance(r, torch.Tensor) else torch.tensor(r)
for r in block_rewards]
)
baselines[block_idx] = block_rewards_tensor.mean()
return baselines
# Stratify by outcome
outcome_groups: Dict[Tuple[bool, ...], List[int]] = {}
for sample_idx, outcome in enumerate(outcomes):
if outcome not in outcome_groups:
outcome_groups[outcome] = []
outcome_groups[outcome].append(sample_idx)
# Compute baselines within outcome groups
for outcome, group_indices in outcome_groups.items():
if len(group_indices) < group_size_min:
continue # Skip groups too small for reliable estimation
for block_idx in range(self.num_blocks):
group_block_rewards = torch.tensor([
float(block_rewards[idx][block_idx])
for idx in group_indices
])
baseline_value = group_block_rewards.mean()
# Assign baseline to all samples in this outcome group
for sample_idx in group_indices:
baselines[block_idx][sample_idx] = baseline_value
return baselines
Apply block-specific advantages to block-specific tokens.
def compute_blockwise_advantages(
block_rewards: List[torch.Tensor], # [num_blocks, num_samples]
baselines: List[torch.Tensor], # [num_blocks, num_samples]
block_masks: List[torch.Tensor], # [num_blocks, seq_length]
num_samples: int,
seq_length: int
) -> torch.Tensor:
"""
Compute advantages per block and tile across tokens in each block.
Args:
block_rewards: Scalar reward per block per sample
baselines: Baseline per block per sample
block_masks: Binary masks indicating token membership
num_samples: Number of samples in batch
seq_length: Sequence length (tokens)
Returns:
Tensor of shape [num_samples, seq_length] with block-wise advantages
"""
num_blocks = len(block_rewards)
# Compute advantages per block
block_advantages = []
for block_idx in range(num_blocks):
advantage = block_rewards[block_idx] - baselines[block_idx]
block_advantages.append(advantage)
# Expand advantages to token level
advantages_token_level = torch.zeros(num_samples, seq_length)
for sample_idx in range(num_samples):
for block_idx in range(num_blocks):
# Get mask for this block
block_mask = block_masks[block_idx]
# Get advantage for this sample-block pair
adv_value = block_advantages[block_idx][sample_idx]
# Assign advantage to all tokens in this block
advantages_token_level[sample_idx, block_mask] = adv_value
return advantages_token_level
Apply block-wise advantages in GRPO training.
class BlockwiseGRPO:
"""
Group Relative Policy Optimization with blockwise advantage estimation.
"""
def __init__(
self,
model,
optimizer,
num_blocks: int,
group_size: int = 8,
block_segmentation: BlockStructure = None
):
self.model = model
self.optimizer = optimizer
self.num_blocks = num_blocks
self.group_size = group_size
self.block_segmentation = block_segmentation
self.ocb = OutcomeConditionedBaseline(num_blocks, group_by_outcome=True)
def compute_loss(
self,
log_probs: torch.Tensor, # [num_samples, seq_length]
block_rewards: List[torch.Tensor], # [num_blocks] each [num_samples]
outcomes: List[Tuple[bool, ...]], # intermediate outcomes
block_masks: List[torch.Tensor], # [num_blocks, seq_length]
log_probs_ref: torch.Tensor = None
) -> Tuple[torch.Tensor, Dict]:
"""
Compute GRPO loss with blockwise advantage estimation.
"""
num_samples = log_probs.shape[0]
seq_length = log_probs.shape[1]
if log_probs_ref is None:
log_probs_ref = log_probs.detach()
# Step 1: Compute outcome-conditioned baselines
baselines = self.ocb.compute_baselines(block_rewards, outcomes)
# Step 2: Compute token-level advantages
advantages = compute_blockwise_advantages(
block_rewards, baselines, block_masks,
num_samples, seq_length
)
# Step 3: Standard GRPO loss with blockwise advantages
log_prob_ratio = log_probs - log_probs_ref
ratio = torch.exp(log_prob_ratio)
# Group relative computation
num_groups = num_samples // self.group_size
losses = []
for group_idx in range(num_groups):
group_start = group_idx * self.group_size
group_end = (group_idx + 1) * self.group_size
group_log_probs = log_probs[group_start:group_end]
group_advantages = advantages[group_start:group_end]
group_log_prob_ratio = log_prob_ratio[group_start:group_end]
group_ratio = ratio[group_start:group_end]
# Group relative baseline
group_mean_reward = group_advantages.mean(dim=0, keepdim=True)
relative_advantages = group_advantages - group_mean_reward
# Clipped ratio loss
clipped_ratio = torch.clamp(group_ratio, 0.5, 2.0)
loss = -torch.min(
group_log_prob_ratio * relative_advantages,
torch.log(clipped_ratio) * relative_advantages
).sum()
losses.append(loss)
total_loss = torch.stack(losses).mean()
return total_loss, {
'baseline_values': [b.mean().item() for b in baselines],
'advantage_mean': advantages.mean().item(),
'advantage_std': advantages.std().item()
}
Full training pipeline using blockwise advantage estimation.
def train_with_blockwise_advantage(
model, train_loader, verifier, optimizer,
block_segmentation: BlockStructure,
num_epochs: int = 3,
group_size: int = 8,
device: str = 'cuda'
):
"""
Train LLM with blockwise advantage estimation for multi-objective completion.
Args:
model: Language model to train
train_loader: Iterable of prompts
verifier: Function computing block rewards (returns list of block rewards)
optimizer: PyTorch optimizer
block_segmentation: BlockStructure defining blocks and outcome extraction
num_epochs: Number training epochs
group_size: GRPO group size
device: Training device
"""
blockwise_grpo = BlockwiseGRPO(
model, optimizer, block_segmentation.block_names.__len__(),
group_size=group_size,
block_segmentation=block_segmentation
)
segmentation = MultiObjectiveSegmentation(block_segmentation)
block_masks = segmentation.create_block_masks(seq_length=512)
for epoch in range(num_epochs):
total_loss = 0.0
num_batches = 0
for batch_idx, batch in enumerate(train_loader):
prompts = batch['prompts']
batch_size = len(prompts)
# Generate completions
completions = []
log_probs_list = []
for prompt in prompts:
completion, log_prob = model.generate_with_logprobs(
prompt, max_tokens=512
)
completions.append(completion)
log_probs_list.append(log_prob)
log_probs = torch.stack(log_probs_list).to(device)
# Extract intermediate outcomes
outcomes = segmentation.extract_outcomes(completions)
# Compute block-wise rewards
block_rewards = [[] for _ in range(segmentation.num_blocks)]
for completion in completions:
rewards = verifier(completion) # Returns list of per-block rewards
for block_idx, reward in enumerate(rewards):
block_rewards[block_idx].append(torch.tensor(reward, dtype=torch.float32))
block_rewards = [torch.stack(br).to(device) for br in block_rewards]
# Compute loss with blockwise advantages
loss, metrics = blockwise_grpo.compute_loss(
log_probs, block_rewards, outcomes, block_masks
)
# Backward pass
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
total_loss += loss.item()
num_batches += 1
if (batch_idx + 1) % 10 == 0:
print(f" Batch {batch_idx + 1}: loss={loss.item():.4f}, "
f"baseline_vals={metrics['baseline_values']}")
avg_loss = total_loss / num_batches
print(f"Epoch {epoch + 1}: Loss={avg_loss:.4f}")
return model
When to use: Multi-step tasks with sequential objectives (math solutions, code generation with correctness at different stages, verification-augmented reasoning)
Hyperparameters:
Key advantages:
Common pitfalls:
Scaling: Negligible overhead. Outcome extraction should be fast (regex, simple parsing).
Paper: https://arxiv.org/abs/2602.10231 Related work: GRPO, multi-task RL, credit assignment, verifiable rewards Benchmarks: Math problem solving, code generation with intermediate verification
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.