Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

harsh040506/distributed-training

Name: distributed-training
Author: harsh040506

engineering/advanced-ml-engineering/skills/distributed-training/SKILL.md

npx skillsauth add harsh040506/claude-code-unified-skill-plugin-library distributed-training

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Distributed Training — Parallelism Strategies and Cluster Management

Provides systematic guidance for scaling ML training across multiple GPUs, multiple nodes, and heterogeneous hardware configurations. Covers parallelism taxonomy, communication efficiency, fault tolerance, and GPU utilization optimization.

Parallelism Strategy Selection

| Strategy | Model Fits on 1 GPU? | Reduces Memory? | Communication Overhead | |---|---|---|---| | Data Parallel (DDP) | Yes | No | Low (gradient sync) | | FSDP | No | Yes (shards all states) | Medium (all-gather) | | Tensor Parallel (TP) | No | Yes (shards weight matrices) | High (within-layer) | | Pipeline Parallel (PP) | No | Yes (splits layers) | Medium (activation transfer) | | 3D Hybrid (TP+PP+DP) | No | Maximum | Highest |

Quick heuristics:

< 7B parameters, fits in VRAM → DDP with automatic mixed precision
7B–70B parameters → FSDP with FULL_SHARD strategy
70B parameters → 3D Hybrid parallelism (Megatron-LM / DeepSpeed)

Data Parallel (DDP) — Standard Multi-GPU

# Launch: torchrun --nproc_per_node=8 train.py
model = DistributedDataParallel(model, device_ids=[local_rank])
sampler = DistributedSampler(dataset)  # ensures no overlap across GPUs

All-reduce gradient synchronization happens automatically via NCCL ring-allreduce.

Bucket tuning: bucket_cap_mb=100 (default: 25MB) — larger buckets reduce communication rounds but increase memory. Tune based on model size and interconnect bandwidth.

FSDP — Fully Sharded Data Parallel

Shards model parameters, gradients, and optimizer states across all ranks:

FULL_SHARD: maximum memory saving — parameter gathered only when needed
SHARD_GRAD_OP: shard gradients/optimizer states only — faster but more memory
Use transformer_auto_wrap_policy to wrap Transformer blocks for efficient sharding

FSDP enables training of 70B+ parameter models on 8× A100s (80GB) without offloading.

See references/parallelism-strategies.md for TP, PP, and 3D hybrid implementation patterns.

Communication Efficiency

Gradient compression (for bandwidth-limited clusters):

PowerSGD: low-rank gradient compression, 10–100× compression ratio
1-bit Adam: quantize gradients to 1-bit; apply error feedback accumulation
Use only when NCCL all-reduce is the bottleneck (profile first)

DeepSpeed ZeRO Optimizer:

Stage 1: shard optimizer states only (4× memory reduction)
Stage 2: + shard gradients (8× reduction)
Stage 3: + shard parameters (linear reduction in N GPUs)
CPU offload: ZeRO-Infinity offloads to CPU/NVMe (enabling trillion-parameter models)

Memory Optimization Techniques

Gradient checkpointing: discard activations during forward pass; recompute during backward. Saves O(√L) activations (L = number of layers). ~33% compute overhead, 60–80% VRAM savings.
Activation offloading: swap activations to CPU RAM when VRAM is full.
Flash Attention: I/O-aware attention algorithm — O(n) memory vs. O(n²) for standard attention.
torch.compile(): Dynamo + Inductor compilation, 1.5–2.5× throughput without code changes.

Fault Tolerance

Checkpointing cadence: every 500 steps to shared storage (NFS or S3); retain 3 most recent.
Checkpoint contents: weights, optimizer state, scheduler state, step count, RNG state, Python/library versions.
Elastic training: torchrun --max_restarts=3 for automatic fault recovery.
Spot instance resilience: catch SIGTERM (15s warning), flush checkpoint, exit cleanly, allow elasticity layer to restart.

See references/cluster-management.md for Kubernetes ML cluster setup, autoscaling, and multi-cloud training patterns.

GPU Utilization Targets

| Utilization | Status | Primary Cause | |---|---|---| | ≥ 85% | Excellent | — | | 70–85% | Acceptable | Minor data loading or scheduling overhead | | 50–70% | Poor | CPU/IO bottleneck or small batch size | | < 50% | Critical | Synchronous data loading, GPU idle waiting |

Profile with torch.profiler.profile() for the first 100 steps; identify the bottleneck category before optimizing.

harsh040506/distributed-training

engineering/advanced-ml-engineering/skills/distributed-training/SKILL.md

This skill should be used when the user asks about "distributed training", "multi-GPU training", "data parallelism", "model parallelism", "pipeline parallelism", "tensor parallelism", "DDP", "FSDP", "ZeRO", "DeepSpeed", "Megatron-LM", "GPU utilization", "NCCL", "torchrun", "gradient communication", "checkpoint recovery", "spot instance preemption", "NVLink", "InfiniBand", "training throughput", or when a model is too large for a single GPU or training speed needs to be scaled.

2 stars

testing

Updated Apr 5, 2026

$ install --global

skillsauth

npx skillsauth add harsh040506/claude-code-unified-skill-plugin-library distributed-training

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 5, 2026, 5:10 PM6.6s3 files scanned

SKILL.md

name:: distributed-training
description:: This skill should be used when the user asks about "distributed training", "multi-GPU training", "data parallelism", "model parallelism", "pipeline parallelism", "tensor parallelism", "DDP", "FSDP", "ZeRO", "DeepSpeed", "Megatron-LM", "GPU utilization", "NCCL", "torchrun", "gradient communication", "checkpoint recovery", "spot instance preemption", "NVLink", "InfiniBand", "training throughput", or when a model is too large for a single GPU or training speed needs to be scaled.
version:: 1.0.0

Distributed Training — Parallelism Strategies and Cluster Management

Parallelism Strategy Selection

Quick heuristics:

< 7B parameters, fits in VRAM → DDP with automatic mixed precision
7B–70B parameters → FSDP with FULL_SHARD strategy
70B parameters → 3D Hybrid parallelism (Megatron-LM / DeepSpeed)

Data Parallel (DDP) — Standard Multi-GPU

# Launch: torchrun --nproc_per_node=8 train.py
model = DistributedDataParallel(model, device_ids=[local_rank])
sampler = DistributedSampler(dataset)  # ensures no overlap across GPUs

All-reduce gradient synchronization happens automatically via NCCL ring-allreduce.

Bucket tuning: bucket_cap_mb=100 (default: 25MB) — larger buckets reduce communication rounds but increase memory. Tune based on model size and interconnect bandwidth.

FSDP — Fully Sharded Data Parallel

Shards model parameters, gradients, and optimizer states across all ranks:

FULL_SHARD: maximum memory saving — parameter gathered only when needed
SHARD_GRAD_OP: shard gradients/optimizer states only — faster but more memory
Use transformer_auto_wrap_policy to wrap Transformer blocks for efficient sharding

FSDP enables training of 70B+ parameter models on 8× A100s (80GB) without offloading.

See references/parallelism-strategies.md for TP, PP, and 3D hybrid implementation patterns.

Communication Efficiency

Gradient compression (for bandwidth-limited clusters):

PowerSGD: low-rank gradient compression, 10–100× compression ratio
1-bit Adam: quantize gradients to 1-bit; apply error feedback accumulation
Use only when NCCL all-reduce is the bottleneck (profile first)

DeepSpeed ZeRO Optimizer:

Stage 1: shard optimizer states only (4× memory reduction)
Stage 2: + shard gradients (8× reduction)
Stage 3: + shard parameters (linear reduction in N GPUs)
CPU offload: ZeRO-Infinity offloads to CPU/NVMe (enabling trillion-parameter models)

Memory Optimization Techniques

Gradient checkpointing: discard activations during forward pass; recompute during backward. Saves O(√L) activations (L = number of layers). ~33% compute overhead, 60–80% VRAM savings.
Activation offloading: swap activations to CPU RAM when VRAM is full.
Flash Attention: I/O-aware attention algorithm — O(n) memory vs. O(n²) for standard attention.
torch.compile(): Dynamo + Inductor compilation, 1.5–2.5× throughput without code changes.

Fault Tolerance

Checkpointing cadence: every 500 steps to shared storage (NFS or S3); retain 3 most recent.
Checkpoint contents: weights, optimizer state, scheduler state, step count, RNG state, Python/library versions.
Elastic training: torchrun --max_restarts=3 for automatic fault recovery.
Spot instance resilience: catch SIGTERM (15s warning), flush checkpoint, exit cleanly, allow elasticity layer to restart.

See references/cluster-management.md for Kubernetes ML cluster setup, autoscaling, and multi-cloud training patterns.

GPU Utilization Targets

Profile with torch.profiler.profile() for the first 100 steps; identify the bottleneck category before optimizing.

Related Skills

harsh040506/single-cell-rna-qc

testing

VerifiedTrustedCommunity

Performs quality control on single-cell RNA-seq data (.h5ad or .h5 files) using scverse best practices with MAD-based filtering and comprehensive visualizations. Use when users request QC analysis, filtering low-quality cells, assessing data quality, or following scverse/scanpy best practices for single-cell analysis.

2SKILL.mdUpdated Apr 5, 2026

harsh040506/single-cell-rna-qc

harsh040506/scvi-tools

tools

VerifiedTrustedCommunity

Deep learning for single-cell analysis using scvi-tools. This skill should be used when users need (1) data integration and batch correction with scVI/scANVI, (2) ATAC-seq analysis with PeakVI, (3) CITE-seq multi-modal analysis with totalVI, (4) multiome RNA+ATAC analysis with MultiVI, (5) spatial transcriptomics deconvolution with DestVI, (6) label transfer and reference mapping with scANVI/scArches, (7) RNA velocity with veloVI, or (8) any deep learning-based single-cell method. Triggers include mentions of scVI, scANVI, totalVI, PeakVI, MultiVI, DestVI, veloVI, sysVI, scArches, variational autoencoder, VAE, batch correction, data integration, multi-modal, CITE-seq, multiome, reference mapping, latent space.

2SKILL.mdUpdated Apr 5, 2026

harsh040506/scvi-tools

harsh040506/scientific-problem-selection

testing

VerifiedTrustedCommunity

This skill should be used when scientists need help with research problem selection, project ideation, troubleshooting stuck projects, or strategic scientific decisions. Use this skill when users ask to pitch a new research idea, work through a project problem, evaluate project risks, plan research strategy, navigate decision trees, or get help choosing what scientific problem to work on. Typical requests include "I have an idea for a project", "I'm stuck on my research", "help me evaluate this project", "what should I work on", or "I need strategic advice about my research".

2SKILL.mdUpdated Apr 5, 2026

harsh040506/scientific-problem-selection

harsh040506/nextflow-development

development

VerifiedTrustedCommunity

Run nf-core bioinformatics pipelines (rnaseq, sarek, atacseq) on sequencing data. Use when analyzing RNA-seq, WGS/WES, or ATAC-seq data—either local FASTQs or public datasets from GEO/SRA. Triggers on nf-core, Nextflow, FASTQ analysis, variant calling, gene expression, differential expression, GEO reanalysis, GSE/GSM/SRR accessions, or samplesheet creation.

2SKILL.mdUpdated Apr 5, 2026

harsh040506/nextflow-development

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/harsh040506/claude-code-unified-skill-plugin-library.git

# Copy into Claude Code skills folder (global)
cp -r claude-code-unified-skill-plugin-library/engineering/advanced-ml-engineering/skills/distributed-training ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

harsh040506/claude-code-unified-skill-plugin-library

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT