cli-tool/components/skills/ai-research/post-training-miles/SKILL.md
Provides guidance for enterprise-grade RL training using miles, a production-ready fork of slime. Use when training large MoE models with FP8/INT4, needing train-inference alignment, or requiring speculative RL for maximum throughput.
npx skillsauth add davila7/claude-code-templates miles-rl-trainingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
miles is a high-performance, enterprise-ready RL framework optimized for large-scale model post-training. Built as a production fork of slime, it addresses critical challenges in MoE training stability, low-precision training, and train-inference alignment.
Choose miles when you need:
Consider alternatives when:
# Recommended: Docker
docker pull radixark/miles:latest
docker run --rm --gpus all --ipc=host --shm-size=16g \
-it radixark/miles:latest /bin/bash
# From source
git clone https://github.com/radixark/miles.git
cd miles
pip install -r requirements.txt
pip install -e .
miles inherits slime's configuration system. Basic training:
python train.py \
--advantage-estimator grpo \
--model-name qwen3-30b-a3b \
--hf-checkpoint /path/to/qwen3-30b-a3b-hf \
--rollout-batch-size 512 \
--n-samples-per-prompt 8
Use this workflow for training large MoE models like DeepSeek V3 or Qwen3-MoE.
# FP8 block scaling (recommended for stability)
export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
python train.py \
--actor-num-gpus-per-node 8 \
--rollout-num-gpus 8 \
--hf-checkpoint /path/to/deepseek-v3 \
--advantage-estimator grpo \
--tensor-model-parallel-size 8 \
--expert-model-parallel-size 4 \
--prompt-data /path/to/data.jsonl \
--num-rollout 3000
Use this workflow for maximum rollout throughput with EAGLE speculative decoding.
miles supports EAGLE speculative decoding via SGLang:
python train.py \
--actor-num-gpus-per-node 8 \
--hf-checkpoint /path/to/target-model \
--sglang-speculative-algorithm EAGLE \
--sglang-speculative-num-steps 3 \
--sglang-speculative-eagle-topk 1 \
--sglang-speculative-num-draft-tokens 4 \
--sglang-speculative-draft-model-path /path/to/draft-model \
--advantage-estimator grpo \
--prompt-data /path/to/data.jsonl
For online SFT of draft model during training:
--mtp-num-layers 1 \
--enable-mtp-training \
--mtp-loss-scaling-factor 0.2
Note: Online MTP training requires a torch dist checkpoint with MTP weights. Add --mtp-num-layers 1 during checkpoint conversion from HuggingFace.
miles inherits all slime arguments. See slime API Reference for the complete list.
--actor-num-nodes 1
--actor-num-gpus-per-node 8
--rollout-num-gpus 8
--rollout-num-gpus-per-engine 2
--colocate
--tensor-model-parallel-size 8
--pipeline-model-parallel-size 2
--expert-model-parallel-size 4 # MoE expert parallelism
--sglang-speculative-algorithm EAGLE
--sglang-speculative-num-steps 3
--sglang-speculative-eagle-topk 1
--sglang-speculative-num-draft-tokens 4
--sglang-enable-draft-weights-cpu-backup
--sglang-speculative-draft-model-path /your/draft/model/path
--mtp-num-layers 1
--enable-mtp-training
--mtp-loss-scaling-factor 0.2
The following features are documented in miles but specific CLI flags may vary. Consult the miles repository for latest configuration.
End-to-end FP8 sampling and training that eliminates quantization-induced discrepancy causing RL collapse in MoE models.
Records expert routing decisions during SGLang inference and replays them during Megatron training for bit-wise expert alignment.
How R3 Works:
sample.rollout_routed_expertsEnables single-machine deployment of 1TB+ models (e.g., on H200).
Memory Savings with INT4:
| Model Size | BF16 VRAM | INT4 VRAM | Reduction | |------------|-----------|-----------|-----------| | 70B | 140GB | 45GB | 3.1x | | 235B | 470GB | 150GB | 3.1x | | 671B | 1.3TB | 420GB | 3.1x |
miles achieves "exactly 0 KL divergence" between training and inference through:
torch.compile integrationmiles uses the same Sample dataclass as slime with the rollout_routed_experts field for MoE routing replay:
@dataclass
class Sample:
prompt: str | list[dict]
tokens: list[int]
response: str
reward: float | dict
loss_mask: list[int]
status: Status
metadata: dict
rollout_log_probs: list[float]
rollout_routed_experts: list[list[int]] # MoE routing for R3
See slime API Reference for the complete Sample definition.
Symptoms: Loss explodes, NaN values
Solutions:
export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1--lr 5e-7Symptoms: Low acceptance rate over time
Solutions:
--sglang-speculative-num-steps 2--sglang-enable-draft-weights-cpu-backupSymptoms: Policy divergence, reward collapse
Solutions:
--use-tis --tis-threshold 0.9| Family | Models | MoE Support | |--------|--------|-------------| | DeepSeek | R1, V3, V3.2 | Full | | Qwen | 2, 2.5, 3 (including MoE) | Full | | Llama | 3, 3.1, 3.3, 4 | Dense only | | Gemma | 2, 3, 3N | Dense only | | GLM | 4.5, 4.6, 4.7 | Dense only | | MiniMax | M2, M2.1 | Full |
tools
No-code automation democratizes workflow building. Zapier and Make (formerly Integromat) let non-developers automate business processes without writing code. But no-code doesn't mean no-complexity - these platforms have their own patterns, pitfalls, and breaking points. This skill covers when to use which platform, how to build reliable automations, and when to graduate to code-based solutions. Key insight: Zapier optimizes for simplicity and integrations (7000+ apps), Make optimizes for power
tools
Use only when the user explicitly asks to stage, commit, push, and open a GitHub pull request in one flow using the GitHub CLI (`gh`).
tools
Workflow automation is the infrastructure that makes AI agents reliable. Without durable execution, a network hiccup during a 10-step payment flow means lost money and angry customers. With it, workflows resume exactly where they left off. This skill covers the platforms (n8n, Temporal, Inngest) and patterns (sequential, parallel, orchestrator-worker) that turn brittle scripts into production-grade automation. Key insight: The platforms make different tradeoffs. n8n optimizes for accessibility
development
Trigger.dev expert for background jobs, AI workflows, and reliable async execution with excellent developer experience and TypeScript-first design. Use when: trigger.dev, trigger dev, background task, ai background job, long running task.