Evolution Strategies Fine-Tuning: Direct Parameter Optimization at Billion Scale

Outcome

Fine-tune large language models through population-based direct parameter search, achieving robust model improvements across diverse architectures with 15.5× lower training variance than gradient-based RL methods and resistance to reward hacking without explicit penalties.

Problem Context

Current LLM fine-tuning relies on backpropagation through gradient-based reinforcement learning (PPO, GRPO), which struggles with:

Sparse, long-horizon rewards: Intermediate supervision often unavailable for reasoning tasks; gradients through long sequences become unstable
Reward hacking: Gradient-based optimization exploits loopholes (short-but-nonsensical outputs) without explicit KL constraints
Cross-model brittleness: Fine-tuning success varies dramatically across base model architectures; GRPO failed entirely on certain models
Training instability: High variance across runs (15.5× higher than ES) makes expensive fine-tuning unreliable for large deployments
Computational overhead: Backpropagation and KL penalty computation add substantial memory and compute burden

Evolution Strategies offer an alternative: direct parameter space search using only reward signals, no gradients required.

Core Concept

Evolution Strategies treat model parameters as a genome subject to evolutionary pressure. The algorithm repeatedly:

Sample parameter perturbations from a normal distribution
Evaluate perturbed models on the target task to obtain rewards
Update parameters in the direction of high-reward perturbations (natural gradient)

Key insight: ES needs only reward values, not gradients, enabling response-level supervision (did the model solve the problem?) rather than loss gradients. This decouples optimization from model architecture and enables effective search in sparse reward regimes.

At billion-parameter scale, seven engineering optimizations make ES tractable: noise reproducibility via random seeds, parallel GPU evaluation, in-place perturbation, reward normalization, greedy decoding, decomposed updates, and simplified learning rates.

Architecture Overview

Population-Based Search

Small fixed population (30 members vs. 10,000+ in prior work) evaluates perturbations in parallel
Each member: base weights + scaled Gaussian noise sampled from seed
Parallel evaluation across GPUs; single machines or distributed clusters via Hugging Face Accelerate

Reward-Driven Parameter Updates

Collect reward signal (scalar, delayed OK) from each population member
Normalize rewards to zero-mean unit-variance
Compute utility-weighted average of perturbations: Δθ ∝ Σ(utility_i × noise_i)
Apply learning rate: θ_new = θ_old + α × Δθ

Memory & Compute Efficiency

Noise retrieval: reconstruct perturbations from random seeds on-the-fly (no storage overhead)
Layer-level in-place perturbation: modify weights sequentially, evaluate, restore (single copy in memory)
Batch GPU evaluation: evaluate multiple perturbed models per GPU via threading
No backpropagation: ~50% memory reduction vs. gradient methods

Stability Properties

ES update is rank-based utility weighting (robust to reward outliers and scale)
No explicit KL penalties; ES naturally avoids reward hacking through population diversity
Variance reduction: 15.5× lower than GRPO across runs on identical problems

Implementation

1. Environment Setup

Prepare the Python environment and install dependencies for distributed GPU evaluation.

# Create and activate virtual environment
python3.10 -m venv es_env
source es_env/bin/activate

# Install dependencies (from repository)
pip install -r requirements.txt

# Key packages:
# - torch>=2.0.0
# - transformers>=4.40.0
# - accelerate>=0.27.0 (distributed training)
# - datasets>=2.18.0 (data loading)
# - numpy, pandas (utilities)

2. Define the Reward Function

The reward function takes a model and returns a scalar score. ES optimizes this directly—no gradients needed.

def compute_reward(model, tokenizer, examples):
    """
    Evaluate model on a task and return scalar reward.

    Args:
        model: LLM instance (already loaded)
        tokenizer: Tokenizer for the model
        examples: List of {input, expected_output} dicts

    Returns:
        float: Aggregated reward (0-1 range recommended)
    """
    correct = 0
    for example in examples:
        # Generate response with greedy decoding
        inputs = tokenizer(example["input"], return_tensors="pt").to(model.device)
        with torch.no_grad():
            output = model.generate(
                **inputs,
                max_new_tokens=256,
                do_sample=False,  # greedy
                pad_token_id=tokenizer.eos_token_id
            )
        response = tokenizer.decode(output[0], skip_special_tokens=True)

        # Check correctness (task-specific)
        if is_correct(response, example["expected_output"]):
            correct += 1

    # Return fraction correct
    return correct / len(examples)


def is_correct(response, expected):
    """Task-specific correctness check."""
    # Example: exact match
    return response.strip() == expected.strip()

3. Initialize Population and State

Set up the ES state: mean parameters, step size, and population utilities.

import torch
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model_name = "Qwen/Qwen2.5-7B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Flatten parameters into a single vector (for ES state)
params_init = torch.nn.utils.parameters_to_vector(model.parameters()).detach().clone()
num_params = params_init.numel()

# ES hyperparameters
population_size = 30  # Small population due to engineering optimizations
learning_rate = 0.001
sigma = 0.017  # Standard deviation of perturbations (tune per task)

print(f"Model parameters: {num_params:,} | Population: {population_size}")

# Initialize utilities (per-member weighting)
utilities = np.array([max(0, np.log(population_size/2 + 1) - np.log(i+1))
                       for i in range(population_size)])
utilities /= np.sum(utilities)  # Normalize

4. Main ES Loop: Mutation, Evaluation, and Update

Run ES for multiple generations, accumulating rewards and updating parameters.

def es_train_loop(
    model, tokenizer, params_init, reward_fn,
    generations=100, population_size=30, sigma=0.017, lr=0.001,
    seed_base=42, device="cuda"
):
    """
    Main Evolution Strategies training loop.

    Args:
        model: LLM to fine-tune
        tokenizer: Model tokenizer
        params_init: Initial parameter vector
        reward_fn: Function(model, tokenizer) -> float
        generations: Number of ES iterations
        population_size: Population members per iteration
        sigma: Perturbation std dev (controls exploration)
        lr: Natural gradient step size
        seed_base: RNG seed for reproducibility
        device: "cuda" or "cpu"
    """
    params_current = params_init.clone()
    rewards_history = []

    for gen in range(generations):
        gen_rewards = []
        param_updates = np.zeros(params_init.numel())

        # Generate and evaluate population
        for member_id in range(population_size):
            # Deterministic noise from seed (no storage overhead)
            seed = seed_base + gen * population_size + member_id
            np.random.seed(seed)
            noise = torch.tensor(
                np.random.randn(params_init.numel()),
                dtype=params_init.dtype,
                device=device
            )

            # Perturbed parameters
            params_perturbed = params_current + sigma * noise

            # Update model weights in-place (layer by layer)
            offset = 0
            for param in model.parameters():
                param_size = param.numel()
                param.data = params_perturbed[offset:offset+param_size].reshape(param.shape)
                offset += param_size

            # Evaluate (reward only, no gradients)
            reward = reward_fn(model, tokenizer)
            gen_rewards.append(reward)

            # Accumulate utility-weighted noise for update
            param_updates += utilities[member_id] * noise.cpu().numpy()

        # Normalize rewards and update parameters
        rewards_array = np.array(gen_rewards)
        rewards_normalized = (rewards_array - np.mean(rewards_array)) / (np.std(rewards_array) + 1e-8)

        # Natural gradient update: θ ← θ + α * (1/σ) * Σ util_i * noise_i * (r_i - mean_r)
        param_updates_weighted = np.zeros_like(param_updates)
        for member_id in range(population_size):
            seed = seed_base + gen * population_size + member_id
            np.random.seed(seed)
            noise_update = np.random.randn(params_init.numel())
            param_updates_weighted += utilities[member_id] * noise_update * rewards_normalized[member_id]

        params_current = params_current.cpu() + (lr / sigma) * torch.tensor(param_updates_weighted, dtype=params_current.dtype)
        params_current = params_current.to(device)

        # Log progress
        best_reward = np.max(gen_rewards)
        mean_reward = np.mean(gen_rewards)
        rewards_history.append(best_reward)

        if (gen + 1) % 10 == 0:
            print(f"Gen {gen+1:3d} | Best: {best_reward:.4f} | Mean: {mean_reward:.4f} | Std: {np.std(gen_rewards):.4f}")

    return params_current, rewards_history

5. Save and Evaluate Fine-Tuned Model

After training, restore final parameters and test performance.

def save_finetuned_model(model, params_final, output_path):
    """
    Write final parameters back to model and save to disk.

    Args:
        model: LLM with architecture to save
        params_final: Final parameter vector from ES
        output_path: Directory to save (will create via model.save_pretrained)
    """
    # Restore final parameters
    offset = 0
    for param in model.parameters():
        param_size = param.numel()
        param.data = params_final[offset:offset+param_size].reshape(param.shape)
        offset += param_size

    # Save to disk
    model.save_pretrained(output_path)
    print(f"Fine-tuned model saved to {output_path}")


# Example usage
if __name__ == "__main__":
    # Load data (example: math reasoning)
    train_examples = [
        {"input": "Solve: 2x + 3 = 7", "expected_output": "x = 2"},
        # ... more examples
    ]

    # Define reward function
    def reward_fn(m, t):
        return compute_reward(m, t, train_examples[:20])  # Subset for speed

    # Run ES fine-tuning
    params_final, history = es_train_loop(
        model, tokenizer, params_init, reward_fn,
        generations=100,
        population_size=30,
        sigma=0.017,
        lr=0.001
    )

    # Save and evaluate
    save_finetuned_model(model, params_final, "./model_finetuned")

6. Distributed Multi-GPU Setup (via Accelerate)

For large models, distribute population evaluation across multiple GPUs or machines.

from accelerate import Accelerator

def es_train_distributed(
    model_name, reward_fn,
    generations=100, population_size=30,
    num_processes=2, gpu_threads=15
):
    """
    Multi-GPU ES training using Hugging Face Accelerate.
    Total parallel evaluations = num_processes * gpu_threads.

    Args:
        model_name: HuggingFace model ID
        reward_fn: Reward function (called per process)
        num_processes: Number of GPUs (or machines)
        gpu_threads: Threads per GPU (model copies per GPU)
    """
    accelerator = Accelerator()

    # Each process loads model independently
    model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = accelerator.prepare(model)

    # Each process evaluates a subset of population
    local_pop_size = population_size // num_processes

    # Main ES loop (same as single-GPU, but rewards aggregated)
    # ...

    print(f"Rank {accelerator.process_index}: evaluating {local_pop_size} members")

Practical Guidance

Hyperparameter Recommendations

| Parameter | Typical Range | Notes | |-----------|---------------|-------| | population_size | 20–50 | Smaller than RL batch sizes; 30 is default. Increase for harder tasks. | | sigma (noise std) | 0.01–0.05 | Controls exploration vs. exploitation. Start at 0.017; lower for final refinement. | | learning_rate | 0.0001–0.01 | Step size for parameter updates. 0.001 is standard; reduce if oscillating. | | generations | 50–500 | Task-dependent; monitor reward curve to detect plateau. | | seed_base | any | Ensures reproducibility; increment per run if multiple trials needed. |

When to Use ES Fine-Tuning

Reasoning tasks with sparse, delayed rewards (math, logic, puzzle solving)
Heterogeneous base models: Need a method that works across Qwen, Llama, Mistral, etc.
Robustness critical: Training stability matters more than marginal reward gains
Reward specification difficult: You have outcome labels but not intermediate supervision
Small datasets: ES is sample-efficient (often < 20% of RL data needed)
Long-horizon tasks: Few intermediate steps; only final answer is evaluable

When NOT to Use ES Fine-Tuning

Dense reward signals: If you have loss gradients or detailed intermediate supervision, gradient-based RL (PPO, DPO) will be faster
Continuous action spaces: ES excels at large discrete parameter spaces; for action fine-tuning, RL is more direct
Extreme speed required: ES requires multiple forward passes per update; if latency is critical, SFT or single-pass methods preferred
Highly model-specific optimization: If you're tuning for a single model and have unlimited compute for gradient tuning, RL may squeeze out extra performance
Limited evaluation budget: Each generation requires population_size full model evaluations; if evaluation is expensive (e.g., human-in-the-loop), use smaller populations or RL with importance weighting

Common Pitfalls

Σ too high or low: If noise is too large, updates become random. If too small, stuck in local optima. Adapt σ per task (start 0.017, halve if rewards plateau).
Ignoring reward scale: Normalizing rewards per generation is critical for stable updates. If rewards are 0–1 vs. 0–1000, learning rate must adjust; the algorithm handles this via z-score normalization.
Small population on large tasks: With population_size < 15, gradient estimates become noisy. For complex reasoning, use 30+.
Not greedy decoding: ES assumes deterministic reward (same input → same output). Sampling during generation adds noise; use greedy decoding or fix seed.
Starting from mid-training checkpoint: ES searches from the current parameter point; if base model is undertrained, ES may optimize for weak behaviors. Fine-tune strong base models.
Incorrect utility weights: The utility vector ranks population members by reward. Ensure it's recalculated per generation (don't reuse across different tasks).

Reference

Paper: Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning Authors: Xin Qiu, Yulu Gan, Conor F. Hayes, Qiyao Liang, Elliot Meyerson, Babak Hodjat, Risto Miikkulainen ArXiv: 2509.24372 Code: GitHub – Cognizant AI Lab

Cited Baselines: PPO (Schulman et al., 2017), GRPO (Xu et al., 2024), DPO (Rafailov et al., 2023)

Evolution Strategies Fine-Tuning: Direct Parameter Optimization at Billion Scale

Outcome

Problem Context

Current LLM fine-tuning relies on backpropagation through gradient-based reinforcement learning (PPO, GRPO), which struggles with:

Sparse, long-horizon rewards: Intermediate supervision often unavailable for reasoning tasks; gradients through long sequences become unstable
Reward hacking: Gradient-based optimization exploits loopholes (short-but-nonsensical outputs) without explicit KL constraints
Cross-model brittleness: Fine-tuning success varies dramatically across base model architectures; GRPO failed entirely on certain models
Training instability: High variance across runs (15.5× higher than ES) makes expensive fine-tuning unreliable for large deployments
Computational overhead: Backpropagation and KL penalty computation add substantial memory and compute burden

Evolution Strategies offer an alternative: direct parameter space search using only reward signals, no gradients required.

Core Concept

Evolution Strategies treat model parameters as a genome subject to evolutionary pressure. The algorithm repeatedly:

Sample parameter perturbations from a normal distribution
Evaluate perturbed models on the target task to obtain rewards
Update parameters in the direction of high-reward perturbations (natural gradient)

Architecture Overview

Population-Based Search

Small fixed population (30 members vs. 10,000+ in prior work) evaluates perturbations in parallel
Each member: base weights + scaled Gaussian noise sampled from seed
Parallel evaluation across GPUs; single machines or distributed clusters via Hugging Face Accelerate

Reward-Driven Parameter Updates

Collect reward signal (scalar, delayed OK) from each population member
Normalize rewards to zero-mean unit-variance
Compute utility-weighted average of perturbations: Δθ ∝ Σ(utility_i × noise_i)
Apply learning rate: θ_new = θ_old + α × Δθ

Memory & Compute Efficiency

Noise retrieval: reconstruct perturbations from random seeds on-the-fly (no storage overhead)
Layer-level in-place perturbation: modify weights sequentially, evaluate, restore (single copy in memory)
Batch GPU evaluation: evaluate multiple perturbed models per GPU via threading
No backpropagation: ~50% memory reduction vs. gradient methods

Stability Properties

ES update is rank-based utility weighting (robust to reward outliers and scale)
No explicit KL penalties; ES naturally avoids reward hacking through population diversity
Variance reduction: 15.5× lower than GRPO across runs on identical problems

Implementation

1. Environment Setup

Prepare the Python environment and install dependencies for distributed GPU evaluation.

# Create and activate virtual environment
python3.10 -m venv es_env
source es_env/bin/activate

# Install dependencies (from repository)
pip install -r requirements.txt

# Key packages:
# - torch>=2.0.0
# - transformers>=4.40.0
# - accelerate>=0.27.0 (distributed training)
# - datasets>=2.18.0 (data loading)
# - numpy, pandas (utilities)

2. Define the Reward Function

The reward function takes a model and returns a scalar score. ES optimizes this directly—no gradients needed.

def compute_reward(model, tokenizer, examples):
    """
    Evaluate model on a task and return scalar reward.

    Args:
        model: LLM instance (already loaded)
        tokenizer: Tokenizer for the model
        examples: List of {input, expected_output} dicts

    Returns:
        float: Aggregated reward (0-1 range recommended)
    """
    correct = 0
    for example in examples:
        # Generate response with greedy decoding
        inputs = tokenizer(example["input"], return_tensors="pt").to(model.device)
        with torch.no_grad():
            output = model.generate(
                **inputs,
                max_new_tokens=256,
                do_sample=False,  # greedy
                pad_token_id=tokenizer.eos_token_id
            )
        response = tokenizer.decode(output[0], skip_special_tokens=True)

        # Check correctness (task-specific)
        if is_correct(response, example["expected_output"]):
            correct += 1

    # Return fraction correct
    return correct / len(examples)


def is_correct(response, expected):
    """Task-specific correctness check."""
    # Example: exact match
    return response.strip() == expected.strip()

3. Initialize Population and State

Set up the ES state: mean parameters, step size, and population utilities.

import torch
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model_name = "Qwen/Qwen2.5-7B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Flatten parameters into a single vector (for ES state)
params_init = torch.nn.utils.parameters_to_vector(model.parameters()).detach().clone()
num_params = params_init.numel()

# ES hyperparameters
population_size = 30  # Small population due to engineering optimizations
learning_rate = 0.001
sigma = 0.017  # Standard deviation of perturbations (tune per task)

print(f"Model parameters: {num_params:,} | Population: {population_size}")

# Initialize utilities (per-member weighting)
utilities = np.array([max(0, np.log(population_size/2 + 1) - np.log(i+1))
                       for i in range(population_size)])
utilities /= np.sum(utilities)  # Normalize

4. Main ES Loop: Mutation, Evaluation, and Update

Run ES for multiple generations, accumulating rewards and updating parameters.

def es_train_loop(
    model, tokenizer, params_init, reward_fn,
    generations=100, population_size=30, sigma=0.017, lr=0.001,
    seed_base=42, device="cuda"
):
    """
    Main Evolution Strategies training loop.

    Args:
        model: LLM to fine-tune
        tokenizer: Model tokenizer
        params_init: Initial parameter vector
        reward_fn: Function(model, tokenizer) -> float
        generations: Number of ES iterations
        population_size: Population members per iteration
        sigma: Perturbation std dev (controls exploration)
        lr: Natural gradient step size
        seed_base: RNG seed for reproducibility
        device: "cuda" or "cpu"
    """
    params_current = params_init.clone()
    rewards_history = []

    for gen in range(generations):
        gen_rewards = []
        param_updates = np.zeros(params_init.numel())

        # Generate and evaluate population
        for member_id in range(population_size):
            # Deterministic noise from seed (no storage overhead)
            seed = seed_base + gen * population_size + member_id
            np.random.seed(seed)
            noise = torch.tensor(
                np.random.randn(params_init.numel()),
                dtype=params_init.dtype,
                device=device
            )

            # Perturbed parameters
            params_perturbed = params_current + sigma * noise

            # Update model weights in-place (layer by layer)
            offset = 0
            for param in model.parameters():
                param_size = param.numel()
                param.data = params_perturbed[offset:offset+param_size].reshape(param.shape)
                offset += param_size

            # Evaluate (reward only, no gradients)
            reward = reward_fn(model, tokenizer)
            gen_rewards.append(reward)

            # Accumulate utility-weighted noise for update
            param_updates += utilities[member_id] * noise.cpu().numpy()

        # Normalize rewards and update parameters
        rewards_array = np.array(gen_rewards)
        rewards_normalized = (rewards_array - np.mean(rewards_array)) / (np.std(rewards_array) + 1e-8)

        # Natural gradient update: θ ← θ + α * (1/σ) * Σ util_i * noise_i * (r_i - mean_r)
        param_updates_weighted = np.zeros_like(param_updates)
        for member_id in range(population_size):
            seed = seed_base + gen * population_size + member_id
            np.random.seed(seed)
            noise_update = np.random.randn(params_init.numel())
            param_updates_weighted += utilities[member_id] * noise_update * rewards_normalized[member_id]

        params_current = params_current.cpu() + (lr / sigma) * torch.tensor(param_updates_weighted, dtype=params_current.dtype)
        params_current = params_current.to(device)

        # Log progress
        best_reward = np.max(gen_rewards)
        mean_reward = np.mean(gen_rewards)
        rewards_history.append(best_reward)

        if (gen + 1) % 10 == 0:
            print(f"Gen {gen+1:3d} | Best: {best_reward:.4f} | Mean: {mean_reward:.4f} | Std: {np.std(gen_rewards):.4f}")

    return params_current, rewards_history

5. Save and Evaluate Fine-Tuned Model

After training, restore final parameters and test performance.

def save_finetuned_model(model, params_final, output_path):
    """
    Write final parameters back to model and save to disk.

    Args:
        model: LLM with architecture to save
        params_final: Final parameter vector from ES
        output_path: Directory to save (will create via model.save_pretrained)
    """
    # Restore final parameters
    offset = 0
    for param in model.parameters():
        param_size = param.numel()
        param.data = params_final[offset:offset+param_size].reshape(param.shape)
        offset += param_size

    # Save to disk
    model.save_pretrained(output_path)
    print(f"Fine-tuned model saved to {output_path}")


# Example usage
if __name__ == "__main__":
    # Load data (example: math reasoning)
    train_examples = [
        {"input": "Solve: 2x + 3 = 7", "expected_output": "x = 2"},
        # ... more examples
    ]

    # Define reward function
    def reward_fn(m, t):
        return compute_reward(m, t, train_examples[:20])  # Subset for speed

    # Run ES fine-tuning
    params_final, history = es_train_loop(
        model, tokenizer, params_init, reward_fn,
        generations=100,
        population_size=30,
        sigma=0.017,
        lr=0.001
    )

    # Save and evaluate
    save_finetuned_model(model, params_final, "./model_finetuned")

6. Distributed Multi-GPU Setup (via Accelerate)

For large models, distribute population evaluation across multiple GPUs or machines.

from accelerate import Accelerator

def es_train_distributed(
    model_name, reward_fn,
    generations=100, population_size=30,
    num_processes=2, gpu_threads=15
):
    """
    Multi-GPU ES training using Hugging Face Accelerate.
    Total parallel evaluations = num_processes * gpu_threads.

    Args:
        model_name: HuggingFace model ID
        reward_fn: Reward function (called per process)
        num_processes: Number of GPUs (or machines)
        gpu_threads: Threads per GPU (model copies per GPU)
    """
    accelerator = Accelerator()

    # Each process loads model independently
    model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = accelerator.prepare(model)

    # Each process evaluates a subset of population
    local_pop_size = population_size // num_processes

    # Main ES loop (same as single-GPU, but rewards aggregated)
    # ...

    print(f"Rank {accelerator.process_index}: evaluating {local_pop_size} members")

Practical Guidance

Hyperparameter Recommendations

When to Use ES Fine-Tuning

Reasoning tasks with sparse, delayed rewards (math, logic, puzzle solving)
Heterogeneous base models: Need a method that works across Qwen, Llama, Mistral, etc.
Robustness critical: Training stability matters more than marginal reward gains
Reward specification difficult: You have outcome labels but not intermediate supervision
Small datasets: ES is sample-efficient (often < 20% of RL data needed)
Long-horizon tasks: Few intermediate steps; only final answer is evaluable

When NOT to Use ES Fine-Tuning

Dense reward signals: If you have loss gradients or detailed intermediate supervision, gradient-based RL (PPO, DPO) will be faster
Continuous action spaces: ES excels at large discrete parameter spaces; for action fine-tuning, RL is more direct
Extreme speed required: ES requires multiple forward passes per update; if latency is critical, SFT or single-pass methods preferred
Highly model-specific optimization: If you're tuning for a single model and have unlimited compute for gradient tuning, RL may squeeze out extra performance
Limited evaluation budget: Each generation requires population_size full model evaluations; if evaluation is expensive (e.g., human-in-the-loop), use smaller populations or RL with importance weighting

Common Pitfalls

Σ too high or low: If noise is too large, updates become random. If too small, stuck in local optima. Adapt σ per task (start 0.017, halve if rewards plateau).
Ignoring reward scale: Normalizing rewards per generation is critical for stable updates. If rewards are 0–1 vs. 0–1000, learning rate must adjust; the algorithm handles this via z-score normalization.
Small population on large tasks: With population_size < 15, gradient estimates become noisy. For complex reasoning, use 30+.
Not greedy decoding: ES assumes deterministic reward (same input → same output). Sampling during generation adds noise; use greedy decoding or fix seed.
Starting from mid-training checkpoint: ES searches from the current parameter point; if base model is undertrained, ES may optimize for weak behaviors. Fine-tune strong base models.
Incorrect utility weights: The utility vector ranks population members by reward. Ensure it's recalculated per generation (don't reuse across different tasks).

Reference

Cited Baselines: PPO (Schulman et al., 2017), GRPO (Xu et al., 2024), DPO (Rafailov et al., 2023)

Adoption

ADu2021/evolution-strategies-llm-finetuning

$ install --global

Security Scan Results

SKILL.md

Evolution Strategies Fine-Tuning: Direct Parameter Optimization at Billion Scale

Outcome

Problem Context

Core Concept

Architecture Overview

Implementation

1. Environment Setup

2. Define the Reward Function

3. Initialize Population and State

4. Main ES Loop: Mutation, Evaluation, and Update

5. Save and Evaluate Fine-Tuned Model

6. Distributed Multi-GPU Setup (via Accelerate)

Practical Guidance

Hyperparameter Recommendations

When to Use ES Fine-Tuning

When NOT to Use ES Fine-Tuning

Common Pitfalls

Reference

Related Skills

ADu2021/flow-map-trajectory-tilting

ADu2021/flexible-data-mixture-of-experts

ADu2021/flexibility-trap-diffusion-reasoning

ADu2021/flex-continuous-agent-evolution

ADu2021/evolution-strategies-llm-finetuning

$ install --global

Security Scan Results

SKILL.md

Evolution Strategies Fine-Tuning: Direct Parameter Optimization at Billion Scale

Outcome

Problem Context

Core Concept

Architecture Overview

Implementation

1. Environment Setup

2. Define the Reward Function

3. Initialize Population and State

4. Main ES Loop: Mutation, Evaluation, and Update

5. Save and Evaluate Fine-Tuned Model

6. Distributed Multi-GPU Setup (via Accelerate)

Practical Guidance

Hyperparameter Recommendations

When to Use ES Fine-Tuning

When NOT to Use ES Fine-Tuning

Common Pitfalls

Reference

Related Skills

ADu2021/flow-map-trajectory-tilting

ADu2021/flexible-data-mixture-of-experts

ADu2021/flexibility-trap-diffusion-reasoning

ADu2021/flex-continuous-agent-evolution