skills/skillxiv-v0.0.2-claude-opus-4.6/encoder-pretraining-strategy/SKILL.md
Choose optimal pretraining strategy for text encoders: pure MLM, pure CLM, or biphasic CLM-then-MLM training, with empirical guidance on performance across downstream tasks.
npx skillsauth add ADu2021/skillXiv encoder-pretraining-strategyInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
The question of how to pretrain text encoders has evolved since BERT's introduction of masked language modeling. As decoder-based models like GPT demonstrate strong transfer learning capabilities, researchers now question whether MLM remains optimal for encoders. This work conducts a large-scale controlled study (38 models, 15,000+ evaluation runs) isolating the effects of learning objectives while controlling for model size and data exposure. The findings reveal that neither pure MLM nor pure CLM dominates universally, but biphasic pretraining—sequential CLM followed by MLM—outperforms both under fixed compute budgets.
The practical implication is significant: practitioners building encoders can leverage pretrained decoder checkpoints from existing open-source models, fine-tune them with MLM, and achieve better results than training encoders from scratch. This reuses existing infrastructure while gaining the benefits of both causal and bidirectional training.
Bidirectional and causal pretraining optimize different aspects of representation learning. Causal language modeling (predicting next tokens sequentially) produces strong representations early in training with good data efficiency and fine-tuning stability. Masked language modeling (bidirectional context) converges to better ultimate performance on tasks requiring full contextual understanding. Sequential training that combines both objectives balances these trade-offs, with optimal splits varying by compute budget.
The insight is that the choice of pretraining objective should depend on downstream task requirements and compute constraints. For sequence classification and question-answering, bidirectional pretraining dominates. For early stopping or parameter-constrained settings, CLM's efficiency matters more. The biphasic strategy elegantly handles both regimes.
The experimental setup compares three pretraining configurations:
The same model architecture (transformer encoders, 210M to 1B parameters) is used across conditions to isolate objective effects. Evaluation covers four task categories: sequence classification, token classification, question answering, and information retrieval. All models trained on 100B uniform tokens from FineWeb-Edu dataset.
Set up the training framework with configurable objectives:
from typing import Literal
import torch
from torch import nn
from transformers import AutoConfig, AutoTokenizer
import numpy as np
class ObjectiveConfig:
"""
Configuration for different pretraining objectives.
Supports pure CLM, pure MLM, and biphasic mixed training
with flexible phase splits and learning rate schedules.
"""
def __init__(self,
objective: Literal["clm", "mlm", "biphasic"],
mlm_probability: float = 0.15,
clm_weight: float = 1.0,
mlm_weight: float = 1.0,
biphasic_split: float = 0.25):
"""
Initialize objective configuration.
Args:
objective: "clm" (causal), "mlm" (masked), "biphasic" (mixed)
mlm_probability: Fraction of tokens to mask in MLM phase
clm_weight: Loss weight for CLM component
mlm_weight: Loss weight for MLM component
biphasic_split: For biphasic, fraction of training as CLM
(e.g., 0.25 = 25% CLM, 75% MLM)
"""
self.objective = objective
self.mlm_probability = mlm_probability
self.clm_weight = clm_weight
self.mlm_weight = mlm_weight
self.biphasic_split = biphasic_split if objective == "biphasic" else None
class PretrainingObjective(nn.Module):
"""
Flexible loss module supporting CLM, MLM, and biphasic training.
Computes appropriate loss based on objective type, with support
for mixed objectives during biphasic pretraining.
"""
def __init__(self, vocab_size: int, config: ObjectiveConfig):
super().__init__()
self.config = config
self.vocab_size = vocab_size
self.lm_head = nn.Linear(768, vocab_size) # Assuming hidden dim 768
self.loss_fn = nn.CrossEntropyLoss()
def forward_clm(self, logits, labels, loss_mask=None):
"""
Compute causal language modeling loss.
Standard next-token prediction loss, ignoring padding tokens.
"""
# Flatten batch and sequence dimensions
logits_flat = logits[..., :-1, :].contiguous().view(-1, self.vocab_size)
labels_flat = labels[..., 1:].contiguous().view(-1)
if loss_mask is not None:
loss_mask_flat = loss_mask[..., 1:].contiguous().view(-1)
loss = self.loss_fn(logits_flat, labels_flat)
loss = (loss * loss_mask_flat).sum() / loss_mask_flat.sum()
else:
loss = self.loss_fn(logits_flat, labels_flat)
return loss
def forward_mlm(self, logits, labels, mlm_mask):
"""
Compute masked language modeling loss.
Loss only on masked token positions, with special tokens and
padding ignored.
"""
# Only compute loss on masked positions
logits_masked = logits[mlm_mask]
labels_masked = labels[mlm_mask]
logits_flat = logits_masked.view(-1, self.vocab_size)
labels_flat = labels_masked.view(-1)
loss = self.loss_fn(logits_flat, labels_flat)
return loss
def forward(self, logits, labels, mlm_mask=None, current_step=None, total_steps=None):
"""
Compute loss based on configured objective.
Supports pure CLM, pure MLM, and biphasic mixed training
with automatic phase switching.
"""
if self.config.objective == "clm":
return self.forward_clm(logits, labels)
elif self.config.objective == "mlm":
return self.forward_mlm(logits, labels, mlm_mask)
elif self.config.objective == "biphasic":
# Determine current phase based on training progress
if current_step is not None and total_steps is not None:
progress = current_step / total_steps
clm_phase_end = self.config.biphasic_split
else:
clm_phase_end = self.config.biphasic_split
if progress < clm_phase_end:
# Phase 1: CLM training
return self.forward_clm(logits, labels) * self.config.clm_weight
else:
# Phase 2: MLM training
return self.forward_mlm(logits, labels, mlm_mask) * self.config.mlm_weight
else:
raise ValueError(f"Unknown objective: {self.config.objective}")
Implement training loop with proper evaluation methodology:
from torch.utils.data import DataLoader
from transformers import AdamW
import wandb
class PretrainingTrainer:
"""
Trainer for controlled encoder pretraining studies.
Trains models with different objectives on identical data,
controlling for learning rates, optimizers, and schedules.
"""
def __init__(self, model, config: ObjectiveConfig, tokenizer,
learning_rate: float = 1e-4):
self.model = model
self.config = config
self.tokenizer = tokenizer
self.lr = learning_rate
self.objective = PretrainingObjective(tokenizer.vocab_size, config)
self.optimizer = AdamW(list(model.parameters()) +
list(self.objective.parameters()),
lr=learning_rate)
def prepare_mlm_batch(self, batch):
"""
Create MLM training targets by masking random tokens.
Masks tokens following BERT strategy: 80% [MASK], 10% random token,
10% unchanged. Creates loss mask to track masked positions.
"""
input_ids = batch['input_ids'].clone()
mlm_mask = torch.rand(input_ids.shape) < self.config.mlm_probability
# Replace with [MASK] token (typically ID 103)
mask_token_id = self.tokenizer.mask_token_id
masked_input_ids = input_ids.clone()
masked_input_ids[mlm_mask] = mask_token_id
batch['input_ids'] = masked_input_ids
batch['mlm_mask'] = mlm_mask
batch['mlm_labels'] = input_ids
return batch
def train_step(self, batch, current_step, total_steps):
"""
Single training step with appropriate loss computation.
Handles data preparation based on objective type, computes loss,
and performs optimization step.
"""
input_ids = batch['input_ids']
attention_mask = batch.get('attention_mask', None)
# Prepare MLM targets if needed
if self.config.objective in ['mlm', 'biphasic']:
batch = self.prepare_mlm_batch(batch)
mlm_mask = batch['mlm_mask']
else:
mlm_mask = None
# Forward pass
outputs = self.model(input_ids, attention_mask=attention_mask)
logits = self.objective.lm_head(outputs.last_hidden_state)
# Compute loss based on objective
loss = self.objective(
logits=logits,
labels=input_ids,
mlm_mask=mlm_mask,
current_step=current_step,
total_steps=total_steps
)
# Backward pass
self.optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)
self.optimizer.step()
return loss.item()
def train(self, train_dataloader, num_epochs, total_steps,
eval_dataloader=None):
"""
Full training loop with optional evaluation.
Trains with specified objective and logs metrics for downstream
task performance evaluation.
"""
self.model.train()
step = 0
for epoch in range(num_epochs):
for batch in train_dataloader:
loss = self.train_step(batch, step, total_steps)
step += 1
# Log progress
if step % 100 == 0:
print(f"Epoch {epoch}, Step {step}, Loss: {loss:.4f}")
print(f" Objective: {self.config.objective}")
if self.config.objective == "biphasic":
phase = "CLM" if (step / total_steps) < self.config.biphasic_split else "MLM"
print(f" Current phase: {phase}")
Create evaluation methodology following the paper's rigorous approach:
def evaluate_downstream_tasks(model, tokenizer, task_type: str,
num_seeds: int = 5) -> dict:
"""
Evaluate pretrained model on downstream task with multiple seeds.
Following paper methodology: grid search over learning rates,
5 random seeds per configuration, official metrics.
"""
results = {'task': task_type, 'seeds': []}
learning_rates = [1e-5, 3e-5, 5e-5, 1e-4, 3e-4]
for seed in range(num_seeds):
seed_results = []
for lr in learning_rates:
# Fine-tune on task
task_metric = finetune_and_evaluate(
model=model,
tokenizer=tokenizer,
task=task_type,
lr=lr,
seed=seed
)
seed_results.append(task_metric)
# Take best learning rate for this seed
best_metric = max(seed_results)
results['seeds'].append(best_metric)
# Report mean and std across seeds
results['mean'] = np.mean(results['seeds'])
results['std'] = np.std(results['seeds'])
return results
Hyperparameter Table:
| Parameter | CLM | MLM | Biphasic | Notes | |-----------|-----|-----|----------|-------| | Learning rate | 1e-4 | 1e-4 | 1e-4 | Keep constant across objectives for fair comparison | | MLM probability | N/A | 0.15 | 0.15 | Fraction of tokens to mask | | Biphasic split | N/A | N/A | 0.25 | 25% CLM, 75% MLM optimal in paper | | Batch size | 256 | 256 | 256 | Same for all conditions | | Warmup steps | 10K | 10K | 10K | As fraction of total steps | | Weight decay | 0.01 | 0.01 | 0.01 | L2 regularization |
When to Use:
When NOT to Use:
Common Pitfalls:
Authors (2025). Should We Still Pretrain Encoders with Masked Language Modeling? arXiv preprint arXiv:2507.00994. https://arxiv.org/abs/2507.00994
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.