engineering/ai-ml-engineering/skills/ml-engineering/SKILL.md
This skill should be used when the user asks about "machine learning", "deep learning", "neural network", "train a model", "PyTorch", "TensorFlow", "JAX", "scikit-learn", "XGBoost", "LightGBM", "fine-tune", "transfer learning", "model architecture", "loss function", "optimizer", "learning rate", "batch size", "epoch", "overfitting", "underfitting", "regularization", "dropout", "batch normalization", "gradient descent", "backpropagation", "training loop", "validation", "hyperparameter tuning", "Optuna", "Ray Tune", "Weights & Biases", "MLflow", "model checkpoint", "early stopping", "mixed precision", "distributed training", "GPU training", "CUDA", "model serving", "TorchServe", "ONNX", or "model deployment". Also trigger for "my model isn't converging", "loss is NaN", "training is slow", "model is overfitting", or "how do I improve my model accuracy".
npx skillsauth add harsh040506/claude-code-unified-skill-plugin-library ml-engineeringInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Production-grade guidance for training, optimizing, and deploying machine learning models.
Problem definition → Data collection → EDA → Feature engineering
→ Model selection → Training → Evaluation → Hyperparameter tuning
→ Error analysis → Deployment → Monitoring → Retraining
Never skip the problem definition or data stages. The most common ML failures are:
For time-series or temporal data: Split by time, not randomly.
Random splits on time-series cause data leakage — future information leaks into training.
For i.i.d. data:
Leakage = test-set information contaminating the training set. It makes models look better than they are.
Common leakage sources:
# WRONG — leaks test statistics
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # Fits on entire dataset
X_train, X_test = train_test_split(X_scaled)
# CORRECT
X_train, X_test = train_test_split(X)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # Fit only on train
X_test_scaled = scaler.transform(X_test) # Transform test with train statistics
| Task | Start here | Then try | |------|-----------|---------| | Tabular classification | LightGBM / XGBoost | Neural network if still underperforming | | Tabular regression | LightGBM / XGBoost | Neural network | | Image classification | ResNet50 (fine-tune) | EfficientNet, ConvNeXt | | Text classification | DistilBERT / RoBERTa (fine-tune) | Larger model if needed | | Text generation | Fine-tune instruction-tuned LLM (Mistral 7B, Llama 3) | GPT-4 API if quality > cost | | Sequence-to-sequence | T5/mT5 fine-tune | LLM fine-tune | | Tabular anomaly detection | Isolation Forest | Autoencoder | | Embeddings | sentence-transformers/all-MiniLM-L6-v2 | Fine-tune on domain data |
Always establish a simple baseline first (logistic regression, rule-based heuristic, random). Beat the baseline before adding complexity.
import torch
import torch.nn as nn
from torch.cuda.amp import GradScaler, autocast
import wandb
from pathlib import Path
def train(
model: nn.Module,
train_loader,
val_loader,
config: dict,
output_dir: str = "./checkpoints",
):
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
optimizer = torch.optim.AdamW(
model.parameters(),
lr=config["lr"],
weight_decay=config.get("weight_decay", 0.01),
)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=config["epochs"]
)
scaler = GradScaler() # Mixed precision training
criterion = nn.CrossEntropyLoss()
wandb.init(project=config["project"], config=config)
best_val_loss = float("inf")
patience_counter = 0
output_path = Path(output_dir)
output_path.mkdir(parents=True, exist_ok=True)
for epoch in range(config["epochs"]):
# ── Training ──────────────────────────────
model.train()
train_loss = 0.0
for batch_idx, (inputs, targets) in enumerate(train_loader):
inputs, targets = inputs.to(device), targets.to(device)
optimizer.zero_grad()
with autocast(): # Mixed precision — 2x speedup on modern GPUs
outputs = model(inputs)
loss = criterion(outputs, targets)
scaler.scale(loss).backward()
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) # Gradient clipping
scaler.step(optimizer)
scaler.update()
train_loss += loss.item()
# ── Validation ────────────────────────────
model.eval()
val_loss = 0.0
correct = 0
total = 0
with torch.no_grad():
for inputs, targets in val_loader:
inputs, targets = inputs.to(device), targets.to(device)
with autocast():
outputs = model(inputs)
loss = criterion(outputs, targets)
val_loss += loss.item()
_, predicted = outputs.max(1)
correct += predicted.eq(targets).sum().item()
total += targets.size(0)
train_loss /= len(train_loader)
val_loss /= len(val_loader)
val_acc = correct / total
wandb.log({
"epoch": epoch,
"train/loss": train_loss,
"val/loss": val_loss,
"val/accuracy": val_acc,
"lr": scheduler.get_last_lr()[0],
})
print(f"Epoch {epoch+1}/{config['epochs']} | "
f"Train Loss: {train_loss:.4f} | "
f"Val Loss: {val_loss:.4f} | "
f"Val Acc: {val_acc:.4f}")
# ── Checkpointing ─────────────────────────
if val_loss < best_val_loss:
best_val_loss = val_loss
patience_counter = 0
torch.save({
"epoch": epoch,
"model_state_dict": model.state_dict(),
"optimizer_state_dict": optimizer.state_dict(),
"val_loss": val_loss,
"config": config,
}, output_path / "best_model.pt")
wandb.save(str(output_path / "best_model.pt"))
else:
patience_counter += 1
if patience_counter >= config.get("patience", 10):
print(f"Early stopping at epoch {epoch+1}")
break
scheduler.step()
wandb.finish()
return output_path / "best_model.pt"
df.isnull().any(), np.isnan(X).any()clip_grad_norm_(model.parameters(), 1.0)log(x + 1e-8)dataloader[0]Gap between training and validation metrics is growing:
GradScaler + autocast → 2–3× speedupnvidia-smi)num_workers=4 (or num CPU cores)DataLoader(pin_memory=True) for faster CPU→GPU transfertorch.profiler to identify the bottleneckUse Optuna for efficient hyperparameter search. Prefer Tree-structured Parzen Estimators (TPE) over grid search — it finds good parameters 10× faster.
import optuna
def objective(trial):
config = {
"lr": trial.suggest_float("lr", 1e-5, 1e-2, log=True),
"batch_size": trial.suggest_categorical("batch_size", [16, 32, 64, 128]),
"dropout": trial.suggest_float("dropout", 0.1, 0.5),
"hidden_dim": trial.suggest_categorical("hidden_dim", [128, 256, 512]),
"epochs": 20, # Short runs for search
"patience": 5,
}
# Train with these params and return val metric
val_loss = train_and_evaluate(config)
return val_loss
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=50, timeout=3600) # 50 trials or 1 hour
print(f"Best params: {study.best_params}")
print(f"Best val loss: {study.best_value}")
import torch.onnx
model.eval()
dummy_input = torch.randn(1, input_size)
torch.onnx.export(
model, dummy_input,
"model.onnx",
export_params=True,
opset_version=17,
input_names=["input"],
output_names=["output"],
dynamic_axes={"input": {0: "batch_size"}, "output": {0: "batch_size"}},
)
# Validate export
import onnxruntime as ort
session = ort.InferenceSession("model.onnx")
output = session.run(None, {"input": dummy_input.numpy()})
from fastapi import FastAPI
from pydantic import BaseModel
import onnxruntime as ort
import numpy as np
app = FastAPI()
session = ort.InferenceSession("model.onnx", providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
class PredictRequest(BaseModel):
features: list[float]
class PredictResponse(BaseModel):
prediction: int
confidence: float
@app.post("/predict", response_model=PredictResponse)
def predict(request: PredictRequest):
input_array = np.array([request.features], dtype=np.float32)
logits = session.run(None, {"input": input_array})[0]
probs = softmax(logits[0])
return PredictResponse(
prediction=int(probs.argmax()),
confidence=float(probs.max()),
)
For complete training recipes and hyperparameter tuning guides, see:
references/training-recipes.md — end-to-end PyTorch and HuggingFace training scripts with mixed-precision, gradient checkpointing, and distributed trainingreferences/hyperparameter-guide.md — learning rate schedules, batch size scaling rules, regularization strategies, and Optuna/Ray Tune search configurationstesting
Performs quality control on single-cell RNA-seq data (.h5ad or .h5 files) using scverse best practices with MAD-based filtering and comprehensive visualizations. Use when users request QC analysis, filtering low-quality cells, assessing data quality, or following scverse/scanpy best practices for single-cell analysis.
tools
Deep learning for single-cell analysis using scvi-tools. This skill should be used when users need (1) data integration and batch correction with scVI/scANVI, (2) ATAC-seq analysis with PeakVI, (3) CITE-seq multi-modal analysis with totalVI, (4) multiome RNA+ATAC analysis with MultiVI, (5) spatial transcriptomics deconvolution with DestVI, (6) label transfer and reference mapping with scANVI/scArches, (7) RNA velocity with veloVI, or (8) any deep learning-based single-cell method. Triggers include mentions of scVI, scANVI, totalVI, PeakVI, MultiVI, DestVI, veloVI, sysVI, scArches, variational autoencoder, VAE, batch correction, data integration, multi-modal, CITE-seq, multiome, reference mapping, latent space.
testing
This skill should be used when scientists need help with research problem selection, project ideation, troubleshooting stuck projects, or strategic scientific decisions. Use this skill when users ask to pitch a new research idea, work through a project problem, evaluate project risks, plan research strategy, navigate decision trees, or get help choosing what scientific problem to work on. Typical requests include "I have an idea for a project", "I'm stuck on my research", "help me evaluate this project", "what should I work on", or "I need strategic advice about my research".
development
Run nf-core bioinformatics pipelines (rnaseq, sarek, atacseq) on sequencing data. Use when analyzing RNA-seq, WGS/WES, or ATAC-seq data—either local FASTQs or public datasets from GEO/SRA. Triggers on nf-core, Nextflow, FASTQ analysis, variant calling, gene expression, differential expression, GEO reanalysis, GSE/GSM/SRR accessions, or samplesheet creation.