skills/skillxiv-v0.0.2-claude-opus-4.6/evolution-strategies-llm-finetuning/SKILL.md
Scale Evolution Strategies to billion-parameter LLMs without backpropagation for superior robustness and stability across diverse models, reward horizons, and evaluation tasks. Outperforms RL methods while eliminating gradient computation overhead.
npx skillsauth add ADu2021/skillXiv evolution-strategies-llm-finetuningInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Fine-tune large language models through population-based direct parameter search, achieving robust model improvements across diverse architectures with 15.5× lower training variance than gradient-based RL methods and resistance to reward hacking without explicit penalties.
Current LLM fine-tuning relies on backpropagation through gradient-based reinforcement learning (PPO, GRPO), which struggles with:
Evolution Strategies offer an alternative: direct parameter space search using only reward signals, no gradients required.
Evolution Strategies treat model parameters as a genome subject to evolutionary pressure. The algorithm repeatedly:
Key insight: ES needs only reward values, not gradients, enabling response-level supervision (did the model solve the problem?) rather than loss gradients. This decouples optimization from model architecture and enables effective search in sparse reward regimes.
At billion-parameter scale, seven engineering optimizations make ES tractable: noise reproducibility via random seeds, parallel GPU evaluation, in-place perturbation, reward normalization, greedy decoding, decomposed updates, and simplified learning rates.
Population-Based Search
Reward-Driven Parameter Updates
Memory & Compute Efficiency
Stability Properties
Prepare the Python environment and install dependencies for distributed GPU evaluation.
# Create and activate virtual environment
python3.10 -m venv es_env
source es_env/bin/activate
# Install dependencies (from repository)
pip install -r requirements.txt
# Key packages:
# - torch>=2.0.0
# - transformers>=4.40.0
# - accelerate>=0.27.0 (distributed training)
# - datasets>=2.18.0 (data loading)
# - numpy, pandas (utilities)
The reward function takes a model and returns a scalar score. ES optimizes this directly—no gradients needed.
def compute_reward(model, tokenizer, examples):
"""
Evaluate model on a task and return scalar reward.
Args:
model: LLM instance (already loaded)
tokenizer: Tokenizer for the model
examples: List of {input, expected_output} dicts
Returns:
float: Aggregated reward (0-1 range recommended)
"""
correct = 0
for example in examples:
# Generate response with greedy decoding
inputs = tokenizer(example["input"], return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=256,
do_sample=False, # greedy
pad_token_id=tokenizer.eos_token_id
)
response = tokenizer.decode(output[0], skip_special_tokens=True)
# Check correctness (task-specific)
if is_correct(response, example["expected_output"]):
correct += 1
# Return fraction correct
return correct / len(examples)
def is_correct(response, expected):
"""Task-specific correctness check."""
# Example: exact match
return response.strip() == expected.strip()
Set up the ES state: mean parameters, step size, and population utilities.
import torch
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model and tokenizer
model_name = "Qwen/Qwen2.5-7B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Flatten parameters into a single vector (for ES state)
params_init = torch.nn.utils.parameters_to_vector(model.parameters()).detach().clone()
num_params = params_init.numel()
# ES hyperparameters
population_size = 30 # Small population due to engineering optimizations
learning_rate = 0.001
sigma = 0.017 # Standard deviation of perturbations (tune per task)
print(f"Model parameters: {num_params:,} | Population: {population_size}")
# Initialize utilities (per-member weighting)
utilities = np.array([max(0, np.log(population_size/2 + 1) - np.log(i+1))
for i in range(population_size)])
utilities /= np.sum(utilities) # Normalize
Run ES for multiple generations, accumulating rewards and updating parameters.
def es_train_loop(
model, tokenizer, params_init, reward_fn,
generations=100, population_size=30, sigma=0.017, lr=0.001,
seed_base=42, device="cuda"
):
"""
Main Evolution Strategies training loop.
Args:
model: LLM to fine-tune
tokenizer: Model tokenizer
params_init: Initial parameter vector
reward_fn: Function(model, tokenizer) -> float
generations: Number of ES iterations
population_size: Population members per iteration
sigma: Perturbation std dev (controls exploration)
lr: Natural gradient step size
seed_base: RNG seed for reproducibility
device: "cuda" or "cpu"
"""
params_current = params_init.clone()
rewards_history = []
for gen in range(generations):
gen_rewards = []
param_updates = np.zeros(params_init.numel())
# Generate and evaluate population
for member_id in range(population_size):
# Deterministic noise from seed (no storage overhead)
seed = seed_base + gen * population_size + member_id
np.random.seed(seed)
noise = torch.tensor(
np.random.randn(params_init.numel()),
dtype=params_init.dtype,
device=device
)
# Perturbed parameters
params_perturbed = params_current + sigma * noise
# Update model weights in-place (layer by layer)
offset = 0
for param in model.parameters():
param_size = param.numel()
param.data = params_perturbed[offset:offset+param_size].reshape(param.shape)
offset += param_size
# Evaluate (reward only, no gradients)
reward = reward_fn(model, tokenizer)
gen_rewards.append(reward)
# Accumulate utility-weighted noise for update
param_updates += utilities[member_id] * noise.cpu().numpy()
# Normalize rewards and update parameters
rewards_array = np.array(gen_rewards)
rewards_normalized = (rewards_array - np.mean(rewards_array)) / (np.std(rewards_array) + 1e-8)
# Natural gradient update: θ ← θ + α * (1/σ) * Σ util_i * noise_i * (r_i - mean_r)
param_updates_weighted = np.zeros_like(param_updates)
for member_id in range(population_size):
seed = seed_base + gen * population_size + member_id
np.random.seed(seed)
noise_update = np.random.randn(params_init.numel())
param_updates_weighted += utilities[member_id] * noise_update * rewards_normalized[member_id]
params_current = params_current.cpu() + (lr / sigma) * torch.tensor(param_updates_weighted, dtype=params_current.dtype)
params_current = params_current.to(device)
# Log progress
best_reward = np.max(gen_rewards)
mean_reward = np.mean(gen_rewards)
rewards_history.append(best_reward)
if (gen + 1) % 10 == 0:
print(f"Gen {gen+1:3d} | Best: {best_reward:.4f} | Mean: {mean_reward:.4f} | Std: {np.std(gen_rewards):.4f}")
return params_current, rewards_history
After training, restore final parameters and test performance.
def save_finetuned_model(model, params_final, output_path):
"""
Write final parameters back to model and save to disk.
Args:
model: LLM with architecture to save
params_final: Final parameter vector from ES
output_path: Directory to save (will create via model.save_pretrained)
"""
# Restore final parameters
offset = 0
for param in model.parameters():
param_size = param.numel()
param.data = params_final[offset:offset+param_size].reshape(param.shape)
offset += param_size
# Save to disk
model.save_pretrained(output_path)
print(f"Fine-tuned model saved to {output_path}")
# Example usage
if __name__ == "__main__":
# Load data (example: math reasoning)
train_examples = [
{"input": "Solve: 2x + 3 = 7", "expected_output": "x = 2"},
# ... more examples
]
# Define reward function
def reward_fn(m, t):
return compute_reward(m, t, train_examples[:20]) # Subset for speed
# Run ES fine-tuning
params_final, history = es_train_loop(
model, tokenizer, params_init, reward_fn,
generations=100,
population_size=30,
sigma=0.017,
lr=0.001
)
# Save and evaluate
save_finetuned_model(model, params_final, "./model_finetuned")
For large models, distribute population evaluation across multiple GPUs or machines.
from accelerate import Accelerator
def es_train_distributed(
model_name, reward_fn,
generations=100, population_size=30,
num_processes=2, gpu_threads=15
):
"""
Multi-GPU ES training using Hugging Face Accelerate.
Total parallel evaluations = num_processes * gpu_threads.
Args:
model_name: HuggingFace model ID
reward_fn: Reward function (called per process)
num_processes: Number of GPUs (or machines)
gpu_threads: Threads per GPU (model copies per GPU)
"""
accelerator = Accelerator()
# Each process loads model independently
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = accelerator.prepare(model)
# Each process evaluates a subset of population
local_pop_size = population_size // num_processes
# Main ES loop (same as single-GPU, but rewards aggregated)
# ...
print(f"Rank {accelerator.process_index}: evaluating {local_pop_size} members")
| Parameter | Typical Range | Notes |
|-----------|---------------|-------|
| population_size | 20–50 | Smaller than RL batch sizes; 30 is default. Increase for harder tasks. |
| sigma (noise std) | 0.01–0.05 | Controls exploration vs. exploitation. Start at 0.017; lower for final refinement. |
| learning_rate | 0.0001–0.01 | Step size for parameter updates. 0.001 is standard; reduce if oscillating. |
| generations | 50–500 | Task-dependent; monitor reward curve to detect plateau. |
| seed_base | any | Ensures reproducibility; increment per run if multiple trials needed. |
population_size full model evaluations; if evaluation is expensive (e.g., human-in-the-loop), use smaller populations or RL with importance weightingΣ too high or low: If noise is too large, updates become random. If too small, stuck in local optima. Adapt σ per task (start 0.017, halve if rewards plateau).
Ignoring reward scale: Normalizing rewards per generation is critical for stable updates. If rewards are 0–1 vs. 0–1000, learning rate must adjust; the algorithm handles this via z-score normalization.
Small population on large tasks: With population_size < 15, gradient estimates become noisy. For complex reasoning, use 30+.
Not greedy decoding: ES assumes deterministic reward (same input → same output). Sampling during generation adds noise; use greedy decoding or fix seed.
Starting from mid-training checkpoint: ES searches from the current parameter point; if base model is undertrained, ES may optimize for weak behaviors. Fine-tune strong base models.
Incorrect utility weights: The utility vector ranks population members by reward. Ensure it's recalculated per generation (don't reuse across different tasks).
Paper: Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning Authors: Xin Qiu, Yulu Gan, Conor F. Hayes, Qiyao Liang, Elliot Meyerson, Babak Hodjat, Risto Miikkulainen ArXiv: 2509.24372 Code: GitHub – Cognizant AI Lab
Cited Baselines: PPO (Schulman et al., 2017), GRPO (Xu et al., 2024), DPO (Rafailov et al., 2023)
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.