engineering/advanced-ml-engineering/skills/distributed-training/SKILL.md
This skill should be used when the user asks about "distributed training", "multi-GPU training", "data parallelism", "model parallelism", "pipeline parallelism", "tensor parallelism", "DDP", "FSDP", "ZeRO", "DeepSpeed", "Megatron-LM", "GPU utilization", "NCCL", "torchrun", "gradient communication", "checkpoint recovery", "spot instance preemption", "NVLink", "InfiniBand", "training throughput", or when a model is too large for a single GPU or training speed needs to be scaled.
npx skillsauth add harsh040506/claude-code-unified-skill-plugin-library distributed-trainingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Provides systematic guidance for scaling ML training across multiple GPUs, multiple nodes, and heterogeneous hardware configurations. Covers parallelism taxonomy, communication efficiency, fault tolerance, and GPU utilization optimization.
| Strategy | Model Fits on 1 GPU? | Reduces Memory? | Communication Overhead | |---|---|---|---| | Data Parallel (DDP) | Yes | No | Low (gradient sync) | | FSDP | No | Yes (shards all states) | Medium (all-gather) | | Tensor Parallel (TP) | No | Yes (shards weight matrices) | High (within-layer) | | Pipeline Parallel (PP) | No | Yes (splits layers) | Medium (activation transfer) | | 3D Hybrid (TP+PP+DP) | No | Maximum | Highest |
Quick heuristics:
FULL_SHARD strategy70B parameters → 3D Hybrid parallelism (Megatron-LM / DeepSpeed)
# Launch: torchrun --nproc_per_node=8 train.py
model = DistributedDataParallel(model, device_ids=[local_rank])
sampler = DistributedSampler(dataset) # ensures no overlap across GPUs
All-reduce gradient synchronization happens automatically via NCCL ring-allreduce.
Bucket tuning: bucket_cap_mb=100 (default: 25MB) — larger buckets reduce communication rounds but increase memory. Tune based on model size and interconnect bandwidth.
Shards model parameters, gradients, and optimizer states across all ranks:
transformer_auto_wrap_policy to wrap Transformer blocks for efficient shardingFSDP enables training of 70B+ parameter models on 8× A100s (80GB) without offloading.
See references/parallelism-strategies.md for TP, PP, and 3D hybrid implementation patterns.
Gradient compression (for bandwidth-limited clusters):
DeepSpeed ZeRO Optimizer:
torch.compile(): Dynamo + Inductor compilation, 1.5–2.5× throughput without code changes.torchrun --max_restarts=3 for automatic fault recovery.See references/cluster-management.md for Kubernetes ML cluster setup, autoscaling, and multi-cloud training patterns.
| Utilization | Status | Primary Cause | |---|---|---| | ≥ 85% | Excellent | — | | 70–85% | Acceptable | Minor data loading or scheduling overhead | | 50–70% | Poor | CPU/IO bottleneck or small batch size | | < 50% | Critical | Synchronous data loading, GPU idle waiting |
Profile with torch.profiler.profile() for the first 100 steps; identify the bottleneck category before optimizing.
testing
Performs quality control on single-cell RNA-seq data (.h5ad or .h5 files) using scverse best practices with MAD-based filtering and comprehensive visualizations. Use when users request QC analysis, filtering low-quality cells, assessing data quality, or following scverse/scanpy best practices for single-cell analysis.
tools
Deep learning for single-cell analysis using scvi-tools. This skill should be used when users need (1) data integration and batch correction with scVI/scANVI, (2) ATAC-seq analysis with PeakVI, (3) CITE-seq multi-modal analysis with totalVI, (4) multiome RNA+ATAC analysis with MultiVI, (5) spatial transcriptomics deconvolution with DestVI, (6) label transfer and reference mapping with scANVI/scArches, (7) RNA velocity with veloVI, or (8) any deep learning-based single-cell method. Triggers include mentions of scVI, scANVI, totalVI, PeakVI, MultiVI, DestVI, veloVI, sysVI, scArches, variational autoencoder, VAE, batch correction, data integration, multi-modal, CITE-seq, multiome, reference mapping, latent space.
testing
This skill should be used when scientists need help with research problem selection, project ideation, troubleshooting stuck projects, or strategic scientific decisions. Use this skill when users ask to pitch a new research idea, work through a project problem, evaluate project risks, plan research strategy, navigate decision trees, or get help choosing what scientific problem to work on. Typical requests include "I have an idea for a project", "I'm stuck on my research", "help me evaluate this project", "what should I work on", or "I need strategic advice about my research".
development
Run nf-core bioinformatics pipelines (rnaseq, sarek, atacseq) on sequencing data. Use when analyzing RNA-seq, WGS/WES, or ATAC-seq data—either local FASTQs or public datasets from GEO/SRA. Triggers on nf-core, Nextflow, FASTQ analysis, variant calling, gene expression, differential expression, GEO reanalysis, GSE/GSM/SRR accessions, or samplesheet creation.