cli-tool/components/skills/ai-research/distributed-training-megatron-core/SKILL.md
Trains large language models (2B-462B parameters) using NVIDIA Megatron-Core with advanced parallelism strategies. Use when training models >1B parameters, need maximum GPU efficiency (47% MFU on H100), or require tensor/pipeline/sequence/context/expert parallelism. Production-ready framework used for Nemotron, LLaMA, DeepSeek.
npx skillsauth add davila7/claude-code-templates training-llms-megatronInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Megatron-Core trains LLMs from 2B to 462B parameters with up to 47% Model FLOP Utilization on H100 GPUs through advanced parallelism strategies.
Installation:
# Docker (recommended)
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:25.04-py3
# Or pip
pip install megatron-core
Simple distributed training:
# Train with 2 GPUs using data parallelism
torchrun --nproc_per_node=2 examples/run_simple_mcore_train_loop.py
# Or LLaMA-3 8B training
./examples/llama/train_llama3_8b_fp8.sh
Copy this checklist:
LLaMA Training Setup:
- [ ] Step 1: Choose parallelism configuration
- [ ] Step 2: Configure training hyperparameters
- [ ] Step 3: Launch distributed training
- [ ] Step 4: Monitor performance metrics
Step 1: Choose parallelism configuration
Model size determines parallelism strategy:
| Model Size | GPUs | Tensor Parallel | Pipeline Parallel | Data Parallel | Context Parallel | |------------|------|-----------------|-------------------|---------------|------------------| | 7B | 8 | 1 | 1 | 8 | 1 | | 13B | 8 | 2 | 1 | 4 | 1 | | 70B | 64 | 4 | 4 | 4 | 1 | | 405B | 128 | 8 | 8 | 2 | 2 |
Step 2: Configure training hyperparameters
#!/bin/bash
# train_llama_70b.sh
GPUS_PER_NODE=8
NNODES=8 # 64 GPUs total
TP=4 # Tensor parallel
PP=4 # Pipeline parallel
CP=1 # Context parallel
# LLaMA 70B configuration
MODEL_SIZE=70 # Billion parameters
HIDDEN_SIZE=8192
NUM_LAYERS=80
NUM_HEADS=64
SEQ_LENGTH=4096
# Training hyperparameters
MICRO_BATCH=1
GLOBAL_BATCH=1024
LR=3e-4
torchrun \
--nproc_per_node=$GPUS_PER_NODE \
--nnodes=$NNODES \
pretrain_gpt.py \
--tensor-model-parallel-size $TP \
--pipeline-model-parallel-size $PP \
--context-parallel-size $CP \
--sequence-parallel \
--num-layers $NUM_LAYERS \
--hidden-size $HIDDEN_SIZE \
--num-attention-heads $NUM_HEADS \
--seq-length $SEQ_LENGTH \
--max-position-embeddings $SEQ_LENGTH \
--micro-batch-size $MICRO_BATCH \
--global-batch-size $GLOBAL_BATCH \
--lr $LR \
--train-iters 100000 \
--lr-decay-style cosine \
--lr-warmup-iters 2000 \
--weight-decay 0.1 \
--clip-grad 1.0 \
--bf16 \
--use-mcore-models \
--transformer-impl transformer_engine \
--data-path /path/to/data \
--vocab-file /path/to/vocab.json \
--merge-file /path/to/merges.txt
Step 3: Launch distributed training
# Single node (8 GPUs)
bash train_llama_70b.sh
# Multi-node with SLURM
sbatch --nodes=8 --gpus-per-node=8 train_llama_70b.sh
Step 4: Monitor performance metrics
Key metrics to track:
Model FLOP Utilization (MFU): Target >40% on H100
Throughput: Tokens/sec/GPU
Memory usage: <80GB per GPU for 70B model
Loss: Should decrease steadily
For sparse MoE models like Mixtral.
MoE Training:
- [ ] Step 1: Configure expert parallelism
- [ ] Step 2: Set MoE hyperparameters
- [ ] Step 3: Launch training with EP
Step 1: Configure expert parallelism
# Mixtral 8x7B example
TENSOR_PARALLEL=2
PIPELINE_PARALLEL=1
EXPERT_PARALLEL=4 # Split 8 experts across 4 GPUs
DATA_PARALLEL=4
TOTAL_GPUS=$((TENSOR_PARALLEL * PIPELINE_PARALLEL * EXPERT_PARALLEL * DATA_PARALLEL))
# = 2 * 1 * 4 * 4 = 32 GPUs
Step 2: Set MoE hyperparameters
torchrun \
--nproc_per_node=8 \
pretrain_gpt.py \
--tensor-model-parallel-size 2 \
--pipeline-model-parallel-size 1 \
--expert-model-parallel-size 4 \
--num-experts 8 \
--moe-router-topk 2 \
--moe-router-load-balancing-type aux_loss \
--moe-aux-loss-coeff 0.01 \
--hidden-size 4096 \
--num-layers 32 \
--num-attention-heads 32 \
--seq-length 4096 \
--max-position-embeddings 4096 \
--bf16 \
--use-mcore-models \
--transformer-impl transformer_engine \
--data-path /path/to/data \
--vocab-file /path/to/vocab.json \
--merge-file /path/to/merges.txt
Step 3: Launch training with EP
Expert parallelism distributes different experts across GPUs, reducing memory while maintaining capacity.
Memory without EP: 8 experts × 7B = 56GB per GPU
Memory with EP=4: 2 experts × 7B = 14GB per GPU
Savings: 75% memory reduction
Achieve 47% MFU on H100.
Performance Optimization:
- [ ] Step 1: Enable Flash Attention
- [ ] Step 2: Use FP8 precision (H100)
- [ ] Step 3: Optimize micro-batch size
- [ ] Step 4: Tune parallelism degrees
Step 1: Enable optimizations
--use-mcore-models # Use Megatron Core models
--transformer-impl transformer_engine # Use Transformer Engine
--sequence-parallel # Reduce activation memory (use with TP)
Step 2: Use FP8 precision (H100 only)
--fp8-hybrid # FP8 mixed precision training
# Transformer Engine handles FP8 automatically
Result: 1.5-2x speedup on H100 vs BF16.
Step 3: Optimize micro-batch size
Find largest micro-batch that fits in memory:
# Start with 1, increase until OOM
for MBS in 1 2 4 8; do
echo "Testing micro-batch-size=$MBS"
torchrun ... --micro-batch-size $MBS
done
Typical values:
Step 4: Tune parallelism degrees
Rules of thumb:
Tensor Parallel: Use ≤8 (limited by NVLink within node)
Pipeline Parallel: Use for >70B models
Context Parallel: Use for sequences >8K tokens
Data Parallel: Fill remaining GPUs
Example 405B on 128 H100s:
TP=8 (1 node)
PP=8 (across nodes)
CP=2 (long sequences)
DP=1
Total = 8 × 8 × 2 × 1 = 128 GPUs
Use Megatron-Core when:
Use alternatives instead:
Issue: Low GPU utilization (<30% MFU)
Causes:
Fixes:
# Increase micro-batch
--micro-batch-size 4 # Was 1
# Enable optimizations
--use-flash-attn
--sequence-parallel
# Reduce TP if >8
--tensor-model-parallel-size 4 # Was 16
Issue: Out of memory
Reduce memory with:
--tensor-model-parallel-size 2 # Split model across GPUs
--recompute-granularity full # Gradient checkpointing
--recompute-method block # Checkpoint transformer blocks
--recompute-num-layers 1 # Checkpoint every layer
Or use CPU/NVMe offloading:
--cpu-optimizer # Offload optimizer to CPU
--cpu-optimizer-type ADAM # CPU Adam variant
Issue: Training slower than expected
Check:
--num-layers-per-virtual-pipeline-stage 2
--dataloader-type cyclic
Issue: Diverging loss
Stabilize training:
--lr-warmup-iters 2000 # Longer warmup
--clip-grad 1.0 # Gradient clipping
--init-method-std 0.006 # Smaller init
--attention-dropout 0.0 # No dropout in attention
--hidden-dropout 0.0 # No dropout in FFN
Parallelism strategies: See references/parallelism-guide.md for detailed comparison of TP/PP/DP/CP/EP with performance analysis and when to use each.
Performance benchmarks: See references/benchmarks.md for MFU numbers across different model sizes and GPU configurations.
Production configurations: See references/production-examples.md for real-world setups from LLaMA 3 405B, Nemotron-4 340B, and DeepSeek-V3 671B.
Training recipes: See references/training-recipes.md for complete hyperparameter configurations for GPT/LLaMA/Mixtral architectures.
tools
No-code automation democratizes workflow building. Zapier and Make (formerly Integromat) let non-developers automate business processes without writing code. But no-code doesn't mean no-complexity - these platforms have their own patterns, pitfalls, and breaking points. This skill covers when to use which platform, how to build reliable automations, and when to graduate to code-based solutions. Key insight: Zapier optimizes for simplicity and integrations (7000+ apps), Make optimizes for power
tools
Use only when the user explicitly asks to stage, commit, push, and open a GitHub pull request in one flow using the GitHub CLI (`gh`).
tools
Workflow automation is the infrastructure that makes AI agents reliable. Without durable execution, a network hiccup during a 10-step payment flow means lost money and angry customers. With it, workflows resume exactly where they left off. This skill covers the platforms (n8n, Temporal, Inngest) and patterns (sequential, parallel, orchestrator-worker) that turn brittle scripts into production-grade automation. Key insight: The platforms make different tradeoffs. n8n optimizes for accessibility
development
Trigger.dev expert for background jobs, AI workflows, and reliable async execution with excellent developer experience and TypeScript-first design. Use when: trigger.dev, trigger dev, background task, ai background job, long running task.