distributions/codex/skills/ml-experiment-tracker/SKILL.md
Guides ML experiment logging, versioning, and reproducibility using tools like MLflow, Weights & Biases, and DVC for systematic model development.
npx skillsauth add a-organvm/a-i--skills ml-experiment-trackerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill provides guidance for systematic machine learning experimentation with proper tracking, versioning, and reproducibility practices.
Every experiment should log:
| Category | Items | Why | |----------|-------|-----| | Code | Git commit hash, branch, diff | Reproduce exact code state | | Data | Dataset version, hash, lineage | Know which data was used | | Environment | Python version, dependencies, hardware | Reproduce runtime | | Hyperparameters | All config values | Understand what changed | | Metrics | Loss, accuracy, custom metrics | Compare performance | | Artifacts | Models, plots, predictions | Preserve outputs |
project/
├── experiments/
│ ├── baseline/ # Initial experiments
│ ├── feature-engineering/ # Data improvements
│ ├── architecture/ # Model changes
│ └── hyperparameter/ # Tuning runs
├── data/
│ ├── raw/ # Original data (versioned)
│ ├── processed/ # Cleaned data
│ └── features/ # Feature store
└── models/
├── staging/ # Candidates
└── production/ # Deployed models
import mlflow
# Set experiment (creates if not exists)
mlflow.set_experiment("my-classification-project")
with mlflow.start_run(run_name="baseline-v1"):
# Log parameters
mlflow.log_param("learning_rate", 0.01)
mlflow.log_param("batch_size", 32)
mlflow.log_param("epochs", 100)
# Training loop
for epoch in range(epochs):
train_loss = train_epoch(model, train_loader)
val_loss, val_acc = evaluate(model, val_loader)
# Log metrics with step
mlflow.log_metrics({
"train_loss": train_loss,
"val_loss": val_loss,
"val_accuracy": val_acc
}, step=epoch)
# Log model
mlflow.pytorch.log_model(model, "model")
# Log artifacts (plots, configs)
mlflow.log_artifact("confusion_matrix.png")
mlflow.log_artifact("config.yaml")
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Training │───▶│ Staging │───▶│ Production │
│ Runs │ │ Review │ │ Deployed │
└──────────────┘ └──────────────┘ └──────────────┘
│ │ │
▼ ▼ ▼
Candidate Validated Monitored
Models Models Models
Stages:
import wandb
# Initialize with config
config = {
"learning_rate": 0.01,
"architecture": "ResNet50",
"dataset": "imagenet-subset",
"epochs": 100
}
run = wandb.init(
project="image-classification",
group="architecture-experiments", # Group related runs
tags=["baseline", "resnet"],
config=config,
notes="Testing ResNet50 baseline on subset"
)
# Training with automatic logging
for epoch in range(config["epochs"]):
metrics = train_and_eval(model, train_loader, val_loader)
wandb.log(metrics)
# Log media
wandb.log({"predictions": wandb.Image(pred_grid)})
wandb.log({"confusion_matrix": wandb.plot.confusion_matrix(...)})
wandb.finish()
# sweep_config.yaml
program: train.py
method: bayes # or grid, random
metric:
name: val_accuracy
goal: maximize
parameters:
learning_rate:
distribution: log_uniform_values
min: 0.0001
max: 0.1
batch_size:
values: [16, 32, 64, 128]
optimizer:
values: ["adam", "sgd", "adamw"]
early_terminate:
type: hyperband
min_iter: 10
# Initialize DVC in git repo
dvc init
# Track large files
dvc add data/training.csv
git add data/training.csv.dvc data/.gitignore
git commit -m "Add training data v1"
# Push to remote storage
dvc remote add -d storage s3://bucket/dvc
dvc push
# Create pipeline
dvc run -n preprocess \
-d src/preprocess.py -d data/raw \
-o data/processed \
python src/preprocess.py
# Reproduce pipeline
dvc repro
# dvc.yaml
stages:
preprocess:
cmd: python src/preprocess.py
deps:
- src/preprocess.py
- data/raw/
outs:
- data/processed/
train:
cmd: python src/train.py
deps:
- src/train.py
- data/processed/
params:
- train.epochs
- train.learning_rate
outs:
- models/model.pkl
metrics:
- metrics.json:
cache: false
experiment: {project}-{objective}
run: {date}-{description}-{variant}
model: {architecture}-{dataset}-{version}
Examples:
experiment: fraud-detection-baseline
run: 2024-01-15-xgboost-tuning-lr001
model: xgboost-transactions-v2.3.1
Track these metrics for model comparison:
Each significant experiment should document:
references/mlflow-setup.md - MLflow installation and configurationreferences/wandb-patterns.md - Advanced W&B features and sweepsreferences/reproducibility-checklist.md - Detailed reproducibility guidedevelopment
Create algorithmic and generative art using mathematical patterns, noise functions, particle systems, and procedural generation. Covers flow fields, L-systems, fractals, and creative coding foundations. Triggers on generative art, algorithmic art, creative coding, procedural generation, or mathematical visualization requests.
development
Audits web applications and architectures for compliance with GDPR, CCPA, and other privacy regulations, focusing on consent, data minimization, and user rights.
development
Optimize Google Cloud Platform resource allocation and manage cloud credits efficiently. Use when planning GCP deployments, analyzing cloud spend, maximizing value from expiring credits, right-sizing instances, or designing cost-effective architectures. Triggers on GCP cost optimization, credit management, resource allocation planning, or cloud budget concerns.
testing
Designs engaging gameplay loops, economies, and progression systems, balancing challenge and reward for interactive experiences.