plugins/ml-master/skills/ml-cloud-deployment/SKILL.md
This skill should be used when the user asks to deploy, scale, or cost-optimize ML workloads on cloud platforms. PROACTIVELY activate for: (1) AWS SageMaker Studio, Training, Processing, Pipelines, Endpoints, Model Monitor, Feature Store, Clarify, Ground Truth, EC2 GPUs, EKS, Lambda, Inferentia, Trainium, (2) GCP Vertex AI Training, Pipelines, Endpoints, Feature Store, Model Monitoring, AutoML, Matching Engine, TPU, GKE, Cloud Run, (3) Azure ML workspaces, pipelines, managed endpoints, AutoML, Responsible ML, AKS/ACI, (4) Databricks, Modal, Replicate, RunPod, Lambda Labs, Anyscale. Provides: cloud ML architecture, autoscaling, hardware, security, and cost guidance.
npx skillsauth add JosiahSiegel/claude-plugin-marketplace ml-cloud-deploymentInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use this skill for deploying ML workloads to managed platforms, Kubernetes, serverless systems, GPU/TPU providers, and lakehouse environments. Start from workload requirements: training or inference, batch or online, latency SLO, throughput, model size, data gravity, compliance, region, hardware, team expertise, and budget.
| Requirement | Strong choices | |---|---| | AWS-native managed lifecycle | SageMaker Studio, Training, Processing, Pipelines, Model Registry, Endpoints, Feature Store, Clarify, Model Monitor | | GCP-native managed lifecycle | Vertex AI Training, Pipelines, Endpoints, Feature Store, Model Monitoring, AutoML, Matching Engine, TPUs | | Azure-native managed lifecycle | Azure ML workspaces, compute clusters, pipelines, registries, managed online/batch endpoints, AutoML, Responsible ML | | Lakehouse/Spark-centric ML | Databricks on AWS/Azure/GCP with MLflow, Delta, Feature Store, Workflows | | Kubernetes control | EKS/GKE/AKS with Kubeflow, KServe, Seldon, Ray, Triton, custom operators | | Serverless or fast GPU apps | Modal, Replicate, Cloud Run with GPU where available, Lambda for small CPU inference | | Flexible GPU rental | Lambda Labs, RunPod, self-managed cloud GPU VMs | | Ray-native scale-out | Anyscale or Ray clusters on Kubernetes/cloud VMs |
Prefer managed services when governance, observability, and team velocity matter more than runtime customization. Prefer Kubernetes or VMs when custom networking, specialized runtimes, or cost/performance tuning dominate.
SageMaker handles managed training jobs, feature stores, and real-time or serverless endpoints. The AWS CLI (aws sagemaker and aws sagemaker-runtime) is used to orchestrate these systems programmatically.
endpoint_config.json){
"EndpointConfigName": "production-llm-classifier-v1-cfg",
"ProductionVariants": [
{
"VariantName": "AllTraffic",
"ModelName": "llm-classifier-model-v1",
"InitialInstanceCount": 2,
"InstanceType": "ml.g5.2xlarge",
"InitialVariantWeight": 1.0,
"VolumeSizeInGB": 50,
"ManagedInstanceScaling": {
"MinInstanceCount": 2,
"MaxInstanceCount": 10,
"Status": "ENABLED"
},
"RoutingConfig": {
"RoutingStrategy": "LEAST_OUTSTANDING_REQUESTS"
}
}
]
}
# 1. Setup Domain and default IAM Role/VPC settings
aws sagemaker create-domain \
--domain-name ProdMLDomain \
--auth-mode IAM \
--default-user-settings ExecutionRole=arn:aws:iam::123456789012:role/SageMakerExecutionRole \
--subnet-ids subnet-1a2b3c,subnet-4d5e6f \
--vpc-id vpc-0abc123 \
--app-network-access-type PublicInternetOnly
# 2. Provision and start/stop an interactive notebook instance
aws sagemaker create-notebook-instance \
--notebook-instance-name dev-notebook-instance \
--instance-type ml.t3.medium \
--role-arn arn:aws:iam::123456789012:role/SageMakerExecutionRole
aws sagemaker start-notebook-instance --notebook-instance-name dev-notebook-instance
aws sagemaker stop-notebook-instance --notebook-instance-name dev-notebook-instance
# 3. Submit a managed GPU training job
aws sagemaker create-training-job \
--training-job-name custom-pytorch-train-job \
--algorithm-specification TrainingImage=763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.12.0-gpu-py38-cu113-ubuntu20.04,TrainingInputMode=File \
--role-arn arn:aws:iam::123456789012:role/SageMakerExecutionRole \
--input-data-config '[{"ChannelName": "training", "DataSource": {"S3DataSource": {"S3DataType": "S3Prefix", "S3Uri": "s3://my-ml-bucket/train-data/", "S3DataDistributionType": "FullyReplicated"}}}]' \
--output-data-config S3OutputPath=s3://my-ml-bucket/checkpoints/ \
--resource-config InstanceType=ml.g5.2xlarge,InstanceCount=1,VolumeSizeInGB=50 \
--stopping-condition MaxRuntimeInSeconds=86400
# 4. Register a model in the SageMaker Model Registry
aws sagemaker create-model-package-group \
--model-package-group-name credit-scoring-group \
--model-package-group-description "Model registry group for credit scoring neural nets"
aws sagemaker create-model \
--model-name credit-scoring-v1 \
--primary-container Image=763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.12.0-gpu-py38,ModelDataUrl=s3://my-ml-bucket/checkpoints/custom-pytorch-train-job/output/model.tar.gz \
--execution-role-arn arn:aws:iam::123456789012:role/SageMakerExecutionRole
# 5. Create Endpoint Configuration and Deploy Endpoint
aws sagemaker create-endpoint-config \
--endpoint-config-name credit-scoring-v1-cfg \
--production-variants '[{"VariantName": "AllTraffic", "ModelName": "credit-scoring-v1", "InitialInstanceCount": 2, "InstanceType": "ml.m5.xlarge", "InitialVariantWeight": 1.0}]'
aws sagemaker create-endpoint \
--endpoint-name credit-scoring-prod-endpoint \
--endpoint-config-name credit-scoring-v1-cfg
# 6. Invoke/Test Endpoint via CLI
aws sagemaker-runtime invoke-endpoint \
--endpoint-name credit-scoring-prod-endpoint \
--content-type application/json \
--body '{"features": [0.25, 1.4, 0.9]}' \
response_output.json
Vertex AI manages training pipelines, hyperparameter tuning sweeps, custom jobs, the model registry, and predictions via gcloud ai.
tpu_job.yaml)displayName: vertex-ai-tpu-training-job
studySpec:
metrics:
- metricId: val_loss
goal: MINIMIZE
parameters:
- parameterId: learning_rate
doubleValueSpec:
minValue: 1e-5
maxValue: 1e-3
scaleType: UNIT_LOG_SCALE
trialJobSpec:
workerPoolSpecs:
- machineSpec:
machineType: n1-standard-8
replicaCount: 1
containerSpec:
imageUri: gcr.io/my-project/ml-trainer:latest
- machineSpec:
machineType: cloud-tpu-v4-podslice
acceleratorType: TPU_V4
acceleratorCount: 4
replicaCount: 1
containerSpec:
imageUri: gcr.io/my-project/ml-tpu-trainer:latest
args: [
"--data_path", "gs://my-bucket/dataset-v1",
"--epochs", "10"
]
# 1. Create a user-managed Jupyter Workbench instance
gcloud notebooks instances create dev-workbench-instance \
--location=us-central1-a \
--vm-image-project=deeplearning-platform-release \
--vm-image-family=common-gpu \
--machine-type=n1-standard-4
# 2. Submit containerized custom training job
gcloud ai custom-jobs create \
--region=us-central1 \
--display-name=gpu-pytorch-training-run \
--worker-pool-spec=replica-count=1,machine-type=n1-standard-8,accelerator-type=nvidia-tesla-t4,accelerator-count=1,container-image-uri=gcr.io/my-gcp-project/pytorch-train-custom:v1 \
--args="--epochs=20,--data-path=gs://my-bucket/training-gold"
# 3. Upload model artifact to Model Registry
gcloud ai models upload \
--region=us-central1 \
--display-name=iris-classifier-v1 \
--container-image-uri=us-docker.pkg.dev/vertex-ai/prediction/pytorch-cpu.1-11:latest \
--artifact-uri=gs://my-bucket/models/iris-classifier/
# 4. Provision endpoint and deploy model with 100% traffic allocation
# Note: replace <ENDPOINT-ID> and <MODEL-ID> with their UUIDs
gcloud ai endpoints create \
--region=us-central1 \
--display-name=production-iris-endpoint
gcloud ai endpoints deploy-model <ENDPOINT-ID> \
--region=us-central1 \
--model=<MODEL-ID> \
--display-name=iris-v1-blue-deployment \
--machine-type=n1-standard-4 \
--min-replica-count=2 \
--max-replica-count=10 \
--traffic-split=0=100
# 5. Predict via Vertex AI Endpoint CLI
gcloud ai endpoints predict <ENDPOINT-ID> \
--region=us-central1 \
--json-request=sample_payload.json
Azure AI Foundry / Azure Machine Learning Workspace orchestrates the entire ML lifecycle. Workspace, computes, environments, data assets, training jobs, models, and endpoints can be managed natively using the Azure CLI ml extension. For Azure ML code asset registration from CI, ADF orchestration, ADF WebActivity networking, result.version propagation, pointer blobs, or private storage firewall handling, load ml-azureml-adf-automation.
azure_deploy.yaml)$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.schema.json
name: credit-risk-endpoint
auth_mode: key
---
$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.json
name: credit-risk-blue
endpoint_name: credit-risk-endpoint
model: azureml:credit-risk-model:1
code_configuration:
code: ./src
scoring_script: score.py
environment: azureml:credit-risk-env:1
instance_type: Standard_DS3_v2
instance_count: 2
request_settings:
request_timeout_ms: 3000
max_concurrent_requests_per_instance: 10
scale_settings:
type: default
az ml)# Add/update the Machine Learning CLI extension
az extension add -n ml -y
az login
# 1. Create a Resource Group and Azure ML Workspace (Azure AI Foundry Hub)
az group create --name my-ml-rg --location eastus
az ml workspace create --name my-ml-workspace --resource-group my-ml-rg --location eastus
# 2. Provision Computes (Instance, Training Cluster, CPU cluster)
az ml compute create --name dev-compute-ci --type ComputeInstance --size Standard_DS3_v2 -g my-ml-rg -w my-ml-workspace
az ml compute create --name gpu-train-cluster --type AmlCompute --size Standard_NC6s_v3 --min-instances 0 --max-instances 4 -g my-ml-rg -w my-ml-workspace
az ml compute create --name cpu-proc-cluster --type AmlCompute --size Standard_DS12_v2 --min-instances 0 --max-instances 8 -g my-ml-rg -w my-ml-workspace
# 3. Create a URI Data Asset
az ml data create --file data.yaml -g my-ml-rg -w my-ml-workspace
# 4. Create an Environment definition
az ml environment create --file env.yaml -g my-ml-rg -w my-ml-workspace
# 5. Submit, Monitor, and Stream a Command Job
az ml job create --file job.yaml -g my-ml-rg -w my-ml-workspace --web --stream
# View, list and monitor jobs
az ml job list -g my-ml-rg -w my-ml-workspace --output table
az ml job stream --name <job-id> -g my-ml-rg -w my-ml-workspace
az ml job download --name <job-id> --download-path ./outputs -g my-ml-rg -w my-ml-workspace
# 6. Register a Model
az ml model create --file model.yaml -g my-ml-rg -w my-ml-workspace
# 7. Create Endpoint, Deploy, Test, and Cleanup
az ml online-endpoint create --file endpoint.yaml -g my-ml-rg -w my-ml-workspace
az ml online-deployment create --file deployment.yaml --all-traffic -g my-ml-rg -w my-ml-workspace
# List and monitor endpoints:
az ml online-endpoint show --name credit-risk-endpoint -g my-ml-rg -w my-ml-workspace
# Test endpoint with a sample payload file:
az ml online-endpoint invoke --name credit-risk-endpoint --request-file ./sample_request.json -g my-ml-rg -w my-ml-workspace
# Delete endpoint:
az ml online-endpoint delete --name credit-risk-endpoint --yes -g my-ml-rg -w my-ml-workspace
For local development or custom multi-cloud Kubernetes clusters, MLflow and KServe provide standardization.
# 1. Start a local tracking and model registry server
mlflow server \
--host 127.0.0.1 \
--port 5000 \
--backend-store-uri sqlite:///mlflow.db \
--default-artifact-root ./mlflow_artifacts_store
# 2. Run a packaged local MLflow Project with runtime parameters
mlflow run . -P alpha=0.1 -P l1_ratio=0.2 --experiment-name iris-training
# 3. Serve a registered model locally
mlflow models serve \
--model-uri models:/iris-classifier/Production \
--port 5001 \
--no-conda
# 4. Invoke the local serving endpoint
curl -X POST -H "Content-Type: application/json" \
-d '{"dataframe_split": {"columns": ["sepal_len", "sepal_wid"], "data": [[5.1, 3.5]]}}' \
http://127.0.0.1:5001/invocations
# kserve-inference.yaml
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "sklearn-iris"
namespace: "ml-serving"
spec:
predictor:
model:
modelFormat:
name: sklearn
storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
resources:
limits:
cpu: "1"
memory: 2Gi
requests:
cpu: "500m"
memory: 1Gi
# 1. Apply inference manifests via kubectl
kubectl apply -f kserve-inference.yaml -n ml-serving
# 2. Monitor status and readiness of inference service
kubectl get inferenceservice sklearn-iris -n ml-serving
# 3. Port forward or direct payload query through Istio ingress gateway
SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -n ml-serving -o jsonpath='{.status.url}' | cut -d/ -f3)
INGRESS_IP=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].port}')
curl -v -H "Host: ${SERVICE_HOSTNAME}" \
http://${INGRESS_IP}:${INGRESS_PORT}/v1/models/sklearn-iris:predict \
-d @sample_payload.json
For serverless deployments on isolated GPUs with zero-cold-start optimization, use Modal.
import modal
app = modal.App("serverless-image-classifier")
# Pre-packaged CUDA image with dependencies
image = (
modal.Image.debian_slim()
.pip_install("torch", "torchvision", "transformers")
)
@app.cls(gpu="A10G", image=image)
class ImageClassifier:
@modal.enter()
def load_model(self):
import torch
from transformers import ViTForImageClassification, ViTImageProcessor
self.device = "cuda" if torch.cuda.is_available() else "cpu"
self.processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224")
self.model = ViTForImageClassification.from_pretrained("google/vit-base-patch16-224").to(self.device)
@modal.method()
def predict(self, image_bytes: bytes):
from PIL import Image
import io
image = Image.open(io.BytesIO(image_bytes))
inputs = self.processor(images=image, return_tensors="pt").to(self.device)
outputs = self.model(**inputs)
logits = outputs.logits
predicted_class_idx = logits.argmax(-1).item()
return self.model.config.id2label[predicted_class_idx]
Match hardware to bottleneck:
Estimate memory before launching: model weights, optimizer states, gradients, activations, KV cache, batch size, sequence length, and runtime overhead. For inference, benchmark p50/p95/p99 latency under realistic concurrency and payload sizes.
Define request schema, response schema, timeout, max payload, auth, rate limits, autoscaling policy, health checks, logging, and rollback. Use canary or blue/green releases for production. For GPU endpoints, tune dynamic batching and concurrency carefully; too much concurrency can increase tail latency or cause OOM. For multi-region deployments, replicate artifacts, keep model versions consistent, and plan data residency and failover.
Use private networking when models access sensitive data. Store secrets in cloud secret managers. Use least-privilege IAM/service accounts/managed identities. Encrypt data and artifacts. Log access and deployment events. Scan containers. Avoid baking credentials into images. For public endpoints, apply auth, input validation, abuse controls, rate limits, and prompt-injection defenses for LLM/RAG systems.
Use spot/preemptible instances for checkpointed training and stateless batch jobs. Use autoscaling endpoints with scale-to-zero when acceptable. Prefer batch inference for non-real-time workloads. Quantize or distill models before scaling replicas. Right-size GPU memory, not just GPU count. Use reserved/committed capacity only after workload shape is stable. Track cost per run, per model version, and per 1,000 predictions.
development
This skill should be used when the user asks to train, debug, scale, or improve ML models. PROACTIVELY activate for: (1) PyTorch, TensorFlow/Keras, JAX, Flax, Hugging Face Trainer/Accelerate training loops, (2) distributed training, DDP/FSDP/DeepSpeed, TPU/GPU setup, (3) mixed precision AMP/bf16, gradient accumulation, checkpointing, seeding, (4) overfitting, imbalance, loss functions, regularization, LR schedules, warmup, (5) memory optimization, gradient checkpointing, offloading, quantization-aware training. Provides: reproducible training best practices across deep learning and classical ML.
development
This skill should be used when the user asks to productionize, track, version, govern, monitor, or automate ML systems. PROACTIVELY activate for: (1) MLflow, Weights & Biases, Neptune, Comet, ClearML experiment tracking, (2) model registry, model versioning, artifact lineage, reproducibility, (3) Kubeflow, SageMaker Pipelines, Vertex AI Pipelines, Azure ML pipelines, Databricks workflows, (4) CI/CD, continuous training/evaluation, A/B tests, canary/shadow deployments, (5) drift detection, model monitoring, data validation, responsible AI governance. Provides: end-to-end MLOps architecture and operational safeguards.
development
This skill should be used when the user asks to optimize, export, serve, compress, or accelerate ML inference. PROACTIVELY activate for: (1) latency, throughput, p95/p99, batching, concurrency, KV cache, memory, or cost issues, (2) quantization INT8/INT4, GPTQ, AWQ, bitsandbytes, pruning, sparsity, distillation, (3) ONNX export, ONNX Runtime, TensorRT, TorchScript, torch.compile, XLA, OpenVINO, Core ML, TFLite, (4) Triton, TorchServe, TF Serving, BentoML, Seldon, KServe configuration, (5) edge deployment, CPU/GPU/TPU/Inferentia serving. Provides: hardware-aware inference optimization and safe benchmarking.
testing
This skill should be used when the user asks to tune hyperparameters, run sweeps, optimize search spaces, or use AutoML. PROACTIVELY activate for: (1) Optuna, Ray Tune, FLAML, AutoGluon, Hyperopt, Nevergrad, KerasTuner, W&B sweeps, (2) grid search, random search, Bayesian optimization, TPE, Gaussian processes, evolutionary search, (3) ASHA, Hyperband, successive halving, multi-fidelity optimization, population-based training, (4) learning-rate finder, batch-size search, early stopping, pruning, (5) reproducible sweep design and experiment analysis. Provides: budget-aware hyperparameter search strategy.