plugins/ml-master/skills/ml-training/SKILL.md
This skill should be used when the user asks to train, debug, scale, or improve ML models. PROACTIVELY activate for: (1) PyTorch, TensorFlow/Keras, JAX, Flax, Hugging Face Trainer/Accelerate training loops, (2) distributed training, DDP/FSDP/DeepSpeed, TPU/GPU setup, (3) mixed precision AMP/bf16, gradient accumulation, checkpointing, seeding, (4) overfitting, imbalance, loss functions, regularization, LR schedules, warmup, (5) memory optimization, gradient checkpointing, offloading, quantization-aware training. Provides: reproducible training best practices across deep learning and classical ML.
npx skillsauth add JosiahSiegel/claude-plugin-marketplace ml-trainingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use this skill for model training across PyTorch, TensorFlow/Keras, JAX/Flax, Hugging Face Transformers/Diffusers/Accelerate/PEFT, scikit-learn, XGBoost, LightGBM, CatBoost, Spark MLlib, and Ray. Optimize for correctness first: validated data, leakage-safe splits, reproducible configuration, meaningful metrics, and a simple baseline before complex distributed or accelerator-heavy runs.
For deep learning, every training loop should make the forward pass, loss computation, backward pass, optimizer step, scheduler step, gradient zeroing, metric logging, validation, checkpointing, and early stopping explicit.
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torch.cuda.amp import autocast, GradScaler
def train_one_epoch(
model: nn.Module,
dataloader: DataLoader,
optimizer: torch.optim.Optimizer,
criterion: nn.Module,
device: torch.device,
scaler: GradScaler,
scheduler = None,
grad_accum_steps: int = 1,
max_grad_norm: float = 1.0
):
model.train()
optimizer.zero_grad(set_to_none=True)
total_loss = 0.0
for step, (inputs, targets) in enumerate(dataloader):
inputs, targets = inputs.to(device, non_blocking=True), targets.to(device, non_blocking=True)
# Mixed precision forward pass
with autocast(dtype=torch.float16): # or torch.bfloat16 if hardware supports it
outputs = model(inputs)
loss = criterion(outputs, targets)
# Scale loss for gradient accumulation
loss = loss / grad_accum_steps
# Scaled backpropagation
scaler.scale(loss).backward()
if (step + 1) % grad_accum_steps == 0 or (step + 1) == len(dataloader):
# Unscales gradients before clipping
scaler.unscale_(optimizer)
# Gradient clipping to prevent exploding gradients
torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
# Step optimizer and update scale factor
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad(set_to_none=True)
if scheduler is not None:
scheduler.step()
total_loss += loss.item() * grad_accum_steps
return total_loss / len(dataloader)
@torch.inference_mode()
def validate(model: nn.Module, dataloader: DataLoader, criterion: nn.Module, device: torch.device):
model.eval()
total_loss = 0.0
correct = 0
total = 0
for inputs, targets in dataloader:
inputs, targets = inputs.to(device, non_blocking=True), targets.to(device, non_blocking=True)
outputs = model(inputs)
loss = criterion(outputs, targets)
total_loss += loss.item()
_, predicted = outputs.max(1)
total += targets.size(0)
correct += predicted.eq(targets).sum().item()
return total_loss / len(dataloader), correct / total
In TensorFlow/Keras, prefer built-in fit when callbacks and distribution strategies are sufficient; use tf.GradientTape for custom control. In JAX/Flax, keep state explicit, use pure update functions, and separate PRNG keys for dropout, augmentation, and sampling.
For classical ML, build preprocessing inside a pipeline object so cross-validation applies transformations only on training folds.
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from xgboost import XGBClassifier
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(transformers=[
('num', numeric_transformer, numeric_cols),
('cat', categorical_transformer, categorical_cols)
])
clf_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', XGBClassifier(use_label_encoder=False, eval_metric='logloss'))
])
Use AdamW as a strong deep-learning default when weight decay should be decoupled; use SGD with momentum for some vision regimes; use Adafactor or 8-bit optimizers only when memory pressure justifies it. Exclude bias and normalization parameters from weight decay for transformer-style models unless the architecture documentation says otherwise.
Learning-rate strategy usually matters more than optimizer micro-tuning. Common defaults:
Detect overfitting with train/validation curves, not final metrics alone. Use data augmentation, dropout, label smoothing, weight decay, early stopping, mixup/cutmix, stronger validation splits, or smaller models. For underfitting, check labels, feature signal, loss scaling, optimizer learning rate, capacity, and preprocessing bugs before adding complexity.
Effective batch size equals per-device batch size times devices times gradient accumulation steps. Increase batch size for throughput only if validation quality remains stable. If memory is limited, use gradient accumulation, activation checkpointing, shorter sequences, smaller images, bucketing, parameter-efficient fine-tuning, optimizer state sharding, or offloading.
Use mixed precision for modern accelerators. Prefer bf16 on hardware with native bf16 support because it reduces loss-scaling issues; use fp16 AMP with dynamic loss scaling otherwise. Keep numerically sensitive operations such as some reductions, softmax/logits handling, and metric accumulation in fp32 when needed. Watch for NaNs, overflow, underflow, and unstable normalization.
Choose the simplest distribution mode that satisfies the bottleneck:
| Pattern | Use when | Caveats | |---|---|---| | Data parallel / DDP | Model fits on one device; need faster throughput | Needs correct global batch, sampler sharding, synchronized metrics | | FSDP / ZeRO | Model fits only with sharded parameters/optimizer state | More checkpoint complexity; tune wrapping and offload | | Tensor parallel | Individual layers exceed one device or need large transformer scale | Requires framework/runtime support; communication-heavy | | Pipeline parallel | Very deep models need layer partitioning | Bubble overhead; microbatch scheduling matters | | Expert/model parallel | Mixture-of-effects or huge models | Routing/load-balance complexity |
import os
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
def setup_distributed():
# Environment variables set by torchrun
world_size = int(os.environ["WORLD_SIZE"])
rank = int(os.environ["RANK"])
local_rank = int(os.environ["LOCAL_RANK"])
dist.init_process_group(
backend="nccl",
init_method="env://",
world_size=world_size,
rank=rank
)
torch.cuda.set_device(local_rank)
return rank, local_rank, world_size
def cleanup():
dist.destroy_process_group()
# Usage DDP:
# rank, local_rank, _ = setup_distributed()
# model = MyModel().to(local_rank)
# model = DDP(model, device_ids=[local_rank])
# Usage FSDP:
# rank, local_rank, _ = setup_distributed()
# model = MyModel().to(local_rank)
# model = FSDP(model)
deepspeed_config.json){
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"contiguous_gradients": true,
"cpu_offload": false
},
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": [0.9, 0.999],
"eps": 1e-8,
"weight_decay": "auto"
}
}
}
A recoverable checkpoint includes model weights, optimizer state, scheduler state, scaler state, epoch/step, RNG states where feasible, tokenizer/preprocessor, label mapping, config, metrics history, and code or commit identifier. Save best-by-validation and last checkpoints separately. Test resume before long training. Store artifacts in durable storage rather than ephemeral worker disks.
def save_checkpoint(state, is_best, checkpoint_dir="checkpoints"):
os.makedirs(checkpoint_dir, exist_ok=True)
filepath = os.path.join(checkpoint_dir, "last_checkpoint.pt")
torch.save(state, filepath)
if is_best:
shutil.copyfile(filepath, os.path.join(checkpoint_dir, "best_model.pt"))
def load_checkpoint(filepath, model, optimizer=None, scheduler=None, scaler=None):
checkpoint = torch.load(filepath, map_location="cpu")
model.load_state_dict(checkpoint["state_dict"])
if optimizer is not None and "optimizer" in checkpoint:
optimizer.load_state_dict(checkpoint["optimizer"])
if scheduler is not None and "scheduler" in checkpoint:
scheduler.load_state_dict(checkpoint["scheduler"])
if scaler is not None and "scaler" in checkpoint:
scaler.load_state_dict(checkpoint["scaler"])
return checkpoint.get("epoch", 0), checkpoint.get("best_metric", float("inf"))
For class imbalance, first choose metrics that reflect the objective: PR-AUC, F1 variants, balanced accuracy, recall at precision, cost-weighted utility, or calibration. Use class weights, focal loss, balanced sampling, threshold tuning, or data collection. Avoid oversampling before splitting. For label noise, inspect high-loss examples, use robust loss functions, relabel critical samples, and report uncertainty.
Apply scaling laws pragmatically: estimate whether quality is data-limited, model-limited, or compute-limited before scaling. Track loss versus compute, data size, and parameter count. For large models, consider curriculum learning, data quality filtering, deduplication, sequence packing, token budget accounting, and compute-optimal trade-offs. For RLHF or reinforcement learning, verify reward quality, offline evaluation, safety constraints, and distribution shift before online optimization.
train() mode, broken input pipeline.development
This skill should be used when the user asks to productionize, track, version, govern, monitor, or automate ML systems. PROACTIVELY activate for: (1) MLflow, Weights & Biases, Neptune, Comet, ClearML experiment tracking, (2) model registry, model versioning, artifact lineage, reproducibility, (3) Kubeflow, SageMaker Pipelines, Vertex AI Pipelines, Azure ML pipelines, Databricks workflows, (4) CI/CD, continuous training/evaluation, A/B tests, canary/shadow deployments, (5) drift detection, model monitoring, data validation, responsible AI governance. Provides: end-to-end MLOps architecture and operational safeguards.
development
This skill should be used when the user asks to optimize, export, serve, compress, or accelerate ML inference. PROACTIVELY activate for: (1) latency, throughput, p95/p99, batching, concurrency, KV cache, memory, or cost issues, (2) quantization INT8/INT4, GPTQ, AWQ, bitsandbytes, pruning, sparsity, distillation, (3) ONNX export, ONNX Runtime, TensorRT, TorchScript, torch.compile, XLA, OpenVINO, Core ML, TFLite, (4) Triton, TorchServe, TF Serving, BentoML, Seldon, KServe configuration, (5) edge deployment, CPU/GPU/TPU/Inferentia serving. Provides: hardware-aware inference optimization and safe benchmarking.
testing
This skill should be used when the user asks to tune hyperparameters, run sweeps, optimize search spaces, or use AutoML. PROACTIVELY activate for: (1) Optuna, Ray Tune, FLAML, AutoGluon, Hyperopt, Nevergrad, KerasTuner, W&B sweeps, (2) grid search, random search, Bayesian optimization, TPE, Gaussian processes, evolutionary search, (3) ASHA, Hyperband, successive halving, multi-fidelity optimization, population-based training, (4) learning-rate finder, batch-size search, early stopping, pruning, (5) reproducible sweep design and experiment analysis. Provides: budget-aware hyperparameter search strategy.
testing
This skill should be used when the user asks to adapt pretrained, foundation, language, vision, multimodal, or embedding models. PROACTIVELY activate for: (1) transfer learning, full fine-tuning, frozen backbones, adapters, LoRA, QLoRA, AdaLoRA, PEFT, (2) Hugging Face Transformers, Diffusers, Accelerate, TRL, RLHF, DPO, preference tuning, alignment, (3) dataset preparation, instruction tuning, chat templates, tokenization, packing, catastrophic forgetting, (4) RAG, vector databases, embedding optimization, reranking, (5) multimodal and edge fine-tuning. Provides: safe, efficient fine-tuning and adaptation guidance.