cli-tool/components/skills/ai-research/post-training-simpo/SKILL.md
Simple Preference Optimization for LLM alignment. Reference-free alternative to DPO with better performance (+6.4 points on AlpacaEval 2.0). No reference model needed, more efficient than DPO. Use for preference alignment when want simpler, faster training than DPO/PPO.
npx skillsauth add davila7/claude-code-templates simpo-trainingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
SimPO is a reference-free preference optimization method that outperforms DPO without needing a reference model.
Installation:
# Create environment
conda create -n simpo python=3.10 && conda activate simpo
# Install PyTorch 2.2.2
# Visit: https://pytorch.org/get-started/locally/
# Install alignment-handbook
git clone https://github.com/huggingface/alignment-handbook.git
cd alignment-handbook
python -m pip install .
# Install Flash Attention 2
python -m pip install flash-attn --no-build-isolation
Training (Mistral 7B):
ACCELERATE_LOG_LEVEL=info accelerate launch \
--config_file accelerate_configs/deepspeed_zero3.yaml \
scripts/run_simpo.py \
training_configs/mistral-7b-base-simpo.yaml
Config (mistral-7b-base-simpo.yaml):
# Model
model_name_or_path: mistralai/Mistral-7B-v0.1
torch_dtype: bfloat16
# Dataset
dataset_mixer:
HuggingFaceH4/ultrafeedback_binarized: 1.0
dataset_splits:
- train_prefs
- test_prefs
# SimPO hyperparameters
beta: 2.0 # Reward scaling (2.0-10.0)
gamma_beta_ratio: 0.5 # Target margin (0-1)
loss_type: sigmoid # sigmoid or hinge
sft_weight: 0.0 # Optional SFT regularization
# Training
learning_rate: 5e-7 # Critical: 3e-7 to 1e-6
num_train_epochs: 1
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
# Output
output_dir: ./outputs/mistral-7b-simpo
Launch training:
accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml \
scripts/run_simpo.py training_configs/mistral-7b-base-simpo.yaml
Config (llama3-8b-instruct-simpo.yaml):
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
dataset_mixer:
argilla/ultrafeedback-binarized-preferences-cleaned: 1.0
beta: 2.5
gamma_beta_ratio: 0.5
learning_rate: 5e-7
sft_weight: 0.1 # Add SFT loss to preserve capabilities
num_train_epochs: 1
per_device_train_batch_size: 2
gradient_accumulation_steps: 4
output_dir: ./outputs/llama3-8b-simpo
Launch:
accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml \
scripts/run_simpo.py training_configs/llama3-8b-instruct-simpo.yaml
For math/code tasks:
model_name_or_path: deepseek-ai/deepseek-math-7b-base
dataset_mixer:
argilla/distilabel-math-preference-dpo: 1.0
beta: 5.0 # Higher for stronger signal
gamma_beta_ratio: 0.7 # Larger margin
learning_rate: 3e-7 # Lower LR for reasoning
sft_weight: 0.0
num_train_epochs: 1
per_device_train_batch_size: 1
gradient_accumulation_steps: 16
Use SimPO when:
Algorithm selection:
Use alternatives instead:
Issue: Loss divergence
Reduce learning rate:
learning_rate: 3e-7 # Reduce from 5e-7
Reduce beta:
beta: 1.0 # Reduce from 2.0
Issue: Model forgets capabilities
Add SFT regularization:
sft_weight: 0.1 # Add SFT loss component
Issue: Poor preference separation
Increase beta and margin:
beta: 5.0 # Increase from 2.0
gamma_beta_ratio: 0.8 # Increase from 0.5
Issue: OOM during training
Reduce batch size:
per_device_train_batch_size: 1
gradient_accumulation_steps: 16 # Maintain effective batch
Enable gradient checkpointing:
gradient_checkpointing: true
Loss functions: See references/loss-functions.md for sigmoid vs hinge loss, mathematical formulations, and when to use each.
Hyperparameter tuning: See references/hyperparameters.md for beta, gamma, learning rate selection guide, and model-size-specific recommendations.
Dataset preparation: See references/datasets.md for preference data formats, quality filtering, and custom dataset creation.
Memory optimization:
tools
No-code automation democratizes workflow building. Zapier and Make (formerly Integromat) let non-developers automate business processes without writing code. But no-code doesn't mean no-complexity - these platforms have their own patterns, pitfalls, and breaking points. This skill covers when to use which platform, how to build reliable automations, and when to graduate to code-based solutions. Key insight: Zapier optimizes for simplicity and integrations (7000+ apps), Make optimizes for power
tools
Use only when the user explicitly asks to stage, commit, push, and open a GitHub pull request in one flow using the GitHub CLI (`gh`).
tools
Workflow automation is the infrastructure that makes AI agents reliable. Without durable execution, a network hiccup during a 10-step payment flow means lost money and angry customers. With it, workflows resume exactly where they left off. This skill covers the platforms (n8n, Temporal, Inngest) and patterns (sequential, parallel, orchestrator-worker) that turn brittle scripts into production-grade automation. Key insight: The platforms make different tradeoffs. n8n optimizes for accessibility
development
Trigger.dev expert for background jobs, AI workflows, and reliable async execution with excellent developer experience and TypeScript-first design. Use when: trigger.dev, trigger dev, background task, ai background job, long running task.