plugins/ml-master/skills/ml-hyperparameter-tuning/SKILL.md
This skill should be used when the user asks to tune hyperparameters, run sweeps, optimize search spaces, or use AutoML. PROACTIVELY activate for: (1) Optuna, Ray Tune, FLAML, AutoGluon, Hyperopt, Nevergrad, KerasTuner, W&B sweeps, (2) grid search, random search, Bayesian optimization, TPE, Gaussian processes, evolutionary search, (3) ASHA, Hyperband, successive halving, multi-fidelity optimization, population-based training, (4) learning-rate finder, batch-size search, early stopping, pruning, (5) reproducible sweep design and experiment analysis. Provides: budget-aware hyperparameter search strategy.
npx skillsauth add JosiahSiegel/claude-plugin-marketplace ml-hyperparameter-tuningInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use this skill for designing and running hyperparameter searches. Tuning should improve a valid baseline under a fixed evaluation protocol. Do not tune before data validation, leakage-safe splits, metric selection, reproducible training, and a simple baseline are in place.
| Strategy | Use when | Notes | |---|---|---| | Manual informed search | Early debugging or very small budgets | Best when guided by learning curves and domain knowledge | | Grid search | Few categorical/discrete parameters | Wasteful in high dimensions | | Random search | Strong default for broad spaces | Often beats grid when only some parameters matter | | Bayesian/TPE | Moderate budgets and expensive trials | Good for structured continuous/discrete spaces | | Hyperband/ASHA | Many deep-learning trials with early signal | Requires comparable learning curves and sensible early-stopping metric | | Population-based training | Schedules and nonstationary hyperparameters | More complex; useful for RL and large training budgets | | AutoML | Need strong baseline or tabular productivity | Validate leakage, explainability, and deployment constraints |
Optuna is a flexible default for Python search. Ray Tune is strong for distributed sweeps, schedulers, and Ray Train integration. FLAML emphasizes cost-effective AutoML. AutoGluon is productive for tabular, multimodal, and time-series baselines. W&B sweeps integrate well with experiment tracking.
Optuna allows pruning unpromising trials early in the training loop based on intermediate validation scores.
import optuna
import torch
import torch.nn as nn
import torch.optim as optim
def train_and_validate(config, trial, model, train_loader, val_loader, device):
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=config["lr"])
for epoch in range(10): # 10 Epochs max
model.train()
for inputs, targets in train_loader:
inputs, targets = inputs.to(device), targets.to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
# Validation epoch
model.eval()
val_loss = 0.0
with torch.no_grad():
for inputs, targets in val_loader:
inputs, targets = inputs.to(device), targets.to(device)
outputs = model(inputs)
val_loss += criterion(outputs, targets).item()
val_loss /= len(val_loader)
# Report intermediate value to Optuna for pruning check
trial.report(val_loss, epoch)
# Handle pruning (stop execution of this trial if criteria are met)
if trial.should_prune():
raise optuna.exceptions.TrialPruned()
return val_loss
def objective(trial: optuna.Trial, model_class, train_loader, val_loader, device):
# Search Space Configuration
config = {
"lr": trial.suggest_float("lr", 1e-5, 1e-2, log=True),
"optimizer": trial.suggest_categorical("optimizer", ["Adam", "SGD", "RMSprop"]),
"batch_size": trial.suggest_categorical("batch_size", [16, 32, 64]),
"dropout_rate": trial.suggest_float("dropout_rate", 0.1, 0.5)
}
model = model_class(dropout_rate=config["dropout_rate"]).to(device)
return train_and_validate(config, trial, model, train_loader, val_loader, device)
# Running the study
# study = optuna.create_study(direction="minimize", pruner=optuna.pruners.MedianPruner())
# study.optimize(lambda t: objective(t, MyModel, train_loader, val_loader, device), n_trials=50)
Use distributions that reflect scale. Learning rate, weight decay, regularization strength, and tree min child weights usually need log-uniform or categorical log grids. Depth, layers, hidden sizes, batch size, and number of estimators are discrete. Optimizer, scheduler, augmentation policy, model family, and feature set are categorical.
Avoid searching invalid combinations. Encode conditional spaces: max_depth only for tree models, LoRA rank only for PEFT, warmup ratio only for scheduled optimizers. Keep the first search broad and shallow; narrow around promising regions.
sweep.yaml)W&B Sweeps run hyperparameter searches across multiple distributed agents.
program: train.py
method: bayes # Search algorithm: bayes, random, grid
metric:
name: val_loss
goal: minimize
parameters:
learning_rate:
distribution: log_uniform_values
min: 1e-5
max: 1e-2
batch_size:
values: [16, 32, 64]
epochs:
value: 20
dropout:
distribution: uniform
min: 0.1
max: 0.5
optimizer:
values: ["adam", "adamw", "sgd"]
early_terminate:
type: hyperband
min_iter: 3
eta: 2
Ray Tune easily distributes sweeps across a cluster, managing hyperparameter tuning at scale.
from ray import tune
from ray.tune.schedulers import ASHAScheduler
def train_fn(config):
# Training routine pulling from config e.g., config["lr"]
for epoch in range(100):
# ... training step ...
val_loss = run_validation()
# Report intermediate score back to Ray
tune.report(loss=val_loss, epoch=epoch)
# Define Async Successive Halving (ASHA) scheduler
asha_scheduler = ASHAScheduler(
time_attr="epoch",
metric="loss",
mode="min",
max_t=100,
grace_period=5,
reduction_factor=2
)
# Run distributed sweep
# analysis = tune.run(
# train_fn,
# resources_per_trial={"cpu": 2, "gpu": 0.5}, # Run two trials per GPU
# config={
# "lr": tune.loguniform(1e-5, 1e-2),
# "batch_size": tune.choice([16, 32, 64])
# },
# num_samples=30,
# scheduler=asha_scheduler
# )
Deep learning:
Tree/tabular models:
RAG and embedding systems:
Early stopping prevents wasted compute but can bias toward fast-starting configurations. Use patience and minimum resource thresholds. ASHA and Hyperband need a monotonically meaningful metric and comparable training curves. For noisy metrics, smooth or require multiple evaluations. Always run promising configurations to full budget before final selection.
A learning-rate range test can quickly find a useful LR interval for neural networks. Increase LR over a short run, plot loss, choose a value below divergence and often below the steepest descent point. Re-run proper training afterward; LR finder output is a guide, not a final experiment.
Log every trial's parameters, random seed, code version, data version, hardware, dependencies, metric, artifacts, and failure reason. Use deterministic trial IDs. Save top-k configs, not only the best. For distributed sweeps, make sure failed or preempted trials are marked correctly and that resumed trials do not duplicate results.
After a sweep, inspect parameter importance, parallel coordinate plots, metric distributions, and learning curves. Look for unstable regions, overfitting to validation, invalid trials, and interactions. Confirm the winning configuration on an untouched test set or repeated seeds. If the improvement is within noise, prefer the simpler or cheaper model.
development
This skill should be used when the user asks to train, debug, scale, or improve ML models. PROACTIVELY activate for: (1) PyTorch, TensorFlow/Keras, JAX, Flax, Hugging Face Trainer/Accelerate training loops, (2) distributed training, DDP/FSDP/DeepSpeed, TPU/GPU setup, (3) mixed precision AMP/bf16, gradient accumulation, checkpointing, seeding, (4) overfitting, imbalance, loss functions, regularization, LR schedules, warmup, (5) memory optimization, gradient checkpointing, offloading, quantization-aware training. Provides: reproducible training best practices across deep learning and classical ML.
development
This skill should be used when the user asks to productionize, track, version, govern, monitor, or automate ML systems. PROACTIVELY activate for: (1) MLflow, Weights & Biases, Neptune, Comet, ClearML experiment tracking, (2) model registry, model versioning, artifact lineage, reproducibility, (3) Kubeflow, SageMaker Pipelines, Vertex AI Pipelines, Azure ML pipelines, Databricks workflows, (4) CI/CD, continuous training/evaluation, A/B tests, canary/shadow deployments, (5) drift detection, model monitoring, data validation, responsible AI governance. Provides: end-to-end MLOps architecture and operational safeguards.
development
This skill should be used when the user asks to optimize, export, serve, compress, or accelerate ML inference. PROACTIVELY activate for: (1) latency, throughput, p95/p99, batching, concurrency, KV cache, memory, or cost issues, (2) quantization INT8/INT4, GPTQ, AWQ, bitsandbytes, pruning, sparsity, distillation, (3) ONNX export, ONNX Runtime, TensorRT, TorchScript, torch.compile, XLA, OpenVINO, Core ML, TFLite, (4) Triton, TorchServe, TF Serving, BentoML, Seldon, KServe configuration, (5) edge deployment, CPU/GPU/TPU/Inferentia serving. Provides: hardware-aware inference optimization and safe benchmarking.
testing
This skill should be used when the user asks to adapt pretrained, foundation, language, vision, multimodal, or embedding models. PROACTIVELY activate for: (1) transfer learning, full fine-tuning, frozen backbones, adapters, LoRA, QLoRA, AdaLoRA, PEFT, (2) Hugging Face Transformers, Diffusers, Accelerate, TRL, RLHF, DPO, preference tuning, alignment, (3) dataset preparation, instruction tuning, chat templates, tokenization, packing, catastrophic forgetting, (4) RAG, vector databases, embedding optimization, reranking, (5) multimodal and edge fine-tuning. Provides: safe, efficient fine-tuning and adaptation guidance.