.claude/skills/experiment-tracking/SKILL.md
# Experiment Tracking ## When to Load This Skill Load when working with: MLflow experiments, run logging, model registry, artifact management, experiment comparison, cross-validation with tracking. ## Core Concepts | Concept | Purpose | |---------|---------| | **Run** | Single training execution — logs params, metrics, artifacts | | **Experiment** | Named collection of runs — logical grouping by model type or task | | **Model Registry** | Versioned model store — stages: None → Staging → Prod
npx skillsauth add pyramidheadshark/ml-claude-infra .claude/skills/experiment-trackingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Load when working with: MLflow experiments, run logging, model registry, artifact management, experiment comparison, cross-validation with tracking.
| Concept | Purpose | |---------|---------| | Run | Single training execution — logs params, metrics, artifacts | | Experiment | Named collection of runs — logical grouping by model type or task | | Model Registry | Versioned model store — stages: None → Staging → Production | | Artifact | Any file output — model weights, plots, feature importance |
Always use context manager — never log outside a run:
import mlflow
import mlflow.sklearn
mlflow.set_experiment("my-experiment")
with mlflow.start_run(run_name="baseline-rf") as run:
mlflow.log_params({
"n_estimators": 100,
"max_depth": 5,
"random_state": 42,
})
model.fit(X_train, y_train)
score = model.score(X_val, y_val)
mlflow.log_metric("val_accuracy", score)
mlflow.sklearn.log_model(model, "model")
run_id = run.info.run_id
Use autolog for quick iteration — disable before production for explicit control:
mlflow.sklearn.autolog(
log_input_examples=True,
log_model_signatures=True,
log_models=True,
silent=True,
)
with mlflow.start_run():
model.fit(X_train, y_train)
Log CV results as metrics with step index:
from sklearn.model_selection import cross_val_score
import numpy as np
with mlflow.start_run():
mlflow.log_params({"cv_folds": 5, "model": "RandomForest"})
scores = cross_val_score(model, X, y, cv=5, scoring="f1_macro")
for i, score in enumerate(scores):
mlflow.log_metric("cv_f1", score, step=i)
mlflow.log_metric("cv_f1_mean", scores.mean())
mlflow.log_metric("cv_f1_std", scores.std())
model_uri = f"runs:/{run_id}/model"
registered = mlflow.register_model(model_uri, "my-classifier")
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
name="my-classifier",
version=registered.version,
stage="Staging",
)
Loading a registered model:
model = mlflow.sklearn.load_model("models:/my-classifier/Staging")
client = mlflow.tracking.MlflowClient()
runs = client.search_runs(
experiment_ids=["1"],
order_by=["metrics.val_f1 DESC"],
max_results=10,
)
for run in runs:
print(run.info.run_id, run.data.metrics.get("val_f1"))
mlflow models serve -m "models:/my-classifier/Production" --port 5001 --no-conda
Request format:
curl -X POST http://localhost:5001/invocations \
-H "Content-Type: application/json" \
-d '{"dataframe_records": [{"feature1": 1.0, "feature2": 2.0}]}'
with mlflow.start_run():
fig.savefig("confusion_matrix.png")
mlflow.log_artifact("confusion_matrix.png", artifact_path="plots")
mlflow.log_dict(feature_importance_dict, "feature_importance.json")
mlflow.log_text(classification_report_str, "classification_report.txt")
src/{project_name}/
├── training/
│ ├── train.py # entry point — sets experiment, calls fit
│ ├── evaluate.py # eval loop — logs metrics per epoch/fold
│ └── register.py # promotes best run to Model Registry
├── mlruns/ # local tracking store (gitignore this)
└── mlflow.db # local SQLite backend (gitignore this)
with mlflow.start_run(): — orphan runs (logged outside context) pollute the experiment registry and are hard to clean upmlflow.end_run() manually — the context manager handles it; manual calls can corrupt the run stateMLFLOW_TRACKING_URI env var in CI — default is ./mlruns (relative), which breaks across working directoriesmlflow.autolog() must be called BEFORE model.fit() — calling it after has no effecttesting
# Design Doc Creator ## When to Load This Skill Load when: design documents, requirements, new project start. Short fixture skill for testing (optional/meta skill).
development
# Windows Developer Guide ## When to Load Automatically loaded on Windows (`platform_trigger: "win32"`). Applies to: `.py`, `.ps1`, `.bat`, `.cmd` files and any Windows-specific workflow. ## Python on Windows ### Encoding (CRITICAL) Windows defaults to `cp1251` / `cp1252` for file I/O. Always specify UTF-8 explicitly: ```python with open("file.txt", "r", encoding="utf-8") as f: content = f.read() Path("file.txt").read_text(encoding="utf-8") Path("file.txt").write_text(content, encodin
development
# Test-First Patterns ## When to Load This Skill Load when writing tests, creating `.feature` files, setting up conftest, discussing test strategy, or reviewing coverage. ## Philosophy Tests are written BEFORE code. Always. No exceptions. The order is: Design Doc → BDD Scenarios → Unit Tests → Implementation. BDD scenarios come from the design document's use cases section — they are a direct translation of business requirements into executable specifications. This makes tests the living do
testing
# Skill: Supply Chain Auditor ## When to Load Auto-load when: adding dependencies, reviewing packages, updating versions, or discussing `requirements.txt`, `pyproject.toml`, `package.json`. Triggers on `dependency`, `install`, `package`, `CVE`, `audit`, `vulnerable` (≥2 keywords). ## Core Rules Every new dependency addition must pass this checklist before merging: 1. **Pinned** — exact version in production (`==1.2.3` for pip, `"1.2.3"` for npm, not `^` or `~`). 2. **Maintained** — last com