plugins/ml-master/skills/ml-azureml-adf-automation/SKILL.md
This skill should be used when the user asks to automate Azure ML and Azure Data Factory production workflows. PROACTIVELY activate for: (1) Azure ML code asset registration, azure-ai-ml SDK, AML code versions, `result.version`, requested-vs-actual versions, (2) ADF to Azure ML orchestration, ADF WebActivity, managed identity blob reads, `connectVia`, managed VNet IR, (3) code asset version pointer blobs, latest.json contracts, training/scoring code version propagation, (4) private storage firewalls, Microsoft-hosted CI agents, temporary network rules, storage data-plane smoke tests, (5) marshmallow<4 pinning, AML SDK import failures, runtime validation for Azure ML infrastructure. Provides: operationally safe Azure ML + ADF automation patterns.
npx skillsauth add JosiahSiegel/claude-plugin-marketplace ml-azureml-adf-automationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use this skill for Azure Machine Learning automation that registers code assets in CI and orchestrates training, scoring, registration, or deployment through Azure Data Factory. The main invariant is that runtime systems must consume the exact Azure ML asset versions that were actually registered, not the versions a pipeline attempted to request. Validate every recommendation against runtime behavior because Azure ML, ADF, storage networking, and SDK dependency behavior can diverge from static API documentation.
Prefer the Python SDK for registering Azure ML code assets when automation must reliably capture the registered version. Use the Azure CLI only after confirming the target environment's az ml extension supports the needed code commands and returns enough information for downstream automation.
from azure.ai.ml import MLClient
from azure.ai.ml.entities._assets._artifacts.code import Code
from azure.identity import AzureCliCredential
ml_client = MLClient(
AzureCliCredential(),
subscription_id,
resource_group,
workspace_name,
)
result = ml_client._code.create_or_update(
Code(name=code_name, version=requested_version, path=staged_code_path)
)
actual_version = result.version
print(actual_version)
Do not assume requested_version == result.version. Azure ML code assets can deduplicate uploads by content hash and return an existing version when the staged directory matches a prior asset. That is useful storage behavior but dangerous if CI publishes a requested build identifier instead of the SDK-returned version.
Publish the returned version as a pipeline output variable and wire downstream steps to that output.
print(
"##vso[task.setvariable "
f"variable=trainingCodeVersion;isOutput=true]{result.version}"
)
Prefer:
$registeredVersion = '$(RegisterTrainingCode.trainingCodeVersion)'
Avoid:
$registeredVersion = '$(Build.BuildId)'
If unique asset versions are operationally required even when code content repeats, stage the code directory and write a small marker file such as .aml-code-asset-version before registration. Treat this only as a dedup workaround. The real contract remains the SDK-returned result.version.
Avoid making ADF discover AML code versions through Azure ML ARM code-container endpoints unless that exact path has passed runtime validation. Some AML management endpoints can appear valid in REST documentation but fail from ADF WebActivity at execution time with unsupported-operation behavior. Treat that as service behavior until proven otherwise, not primarily an RBAC problem.
Use a CI-written pointer blob as the runtime contract between registration and orchestration:
https://<storage-account>.blob.core.windows.net/ml-globals/code-assets/training-code/latest.json
Example payload:
{
"assetName": "training-code",
"version": "<actual-sdk-returned-version>",
"workspaceName": "<workspace-name>",
"resourceGroup": "<resource-group>",
"subscriptionId": "<subscription-id>",
"buildId": "<build-or-run-id>",
"sourceBranch": "<branch>",
"sourceVersion": "<source-version>",
"registeredAtUtc": "<timestamp>"
}
ADF reads version from this blob and passes it as a parameter to training, scoring, model registration, or deployment pipelines. The payload may include provenance fields, but downstream jobs should depend only on fields that are deliberately part of the contract.
Read the pointer blob with managed identity authentication against Azure Storage:
{
"name": "ReadLatestTrainingCodeVersion",
"type": "WebActivity",
"typeProperties": {
"method": "GET",
"url": {
"type": "Expression",
"value": "@concat('https://', pipeline().globalParameters.StorageAccountName, '.blob.core.windows.net/ml-globals/code-assets/training-code/latest.json')"
},
"headers": {
"x-ms-version": "2023-11-03",
"Accept": "application/json"
},
"authentication": {
"type": "MSI",
"resource": "https://storage.azure.com/"
},
"connectVia": {
"referenceName": "<managed-vnet-ir-name>",
"type": "IntegrationRuntimeReference"
}
}
}
Critical placement rule: for ADF WebActivity, connectVia belongs inside typeProperties. If it is placed at the activity root, it can be ignored, causing traffic to leave over the public internet and fail against storage accounts with defaultAction: Deny.
Required access commonly includes:
Storage Blob Data Reader on the pointer container or account scope.Storage Blob Data Contributor to write pointer blobs.Storage Account Contributor when the pipeline manages storage firewall rules.For storage accounts with private endpoints and defaultAction: Deny, Microsoft-hosted CI agents usually egress from public per-run IP addresses. Correct RBAC is not enough if the agent cannot reach the storage data plane. Before blaming the Azure ML SDK, ADF, or IAM, prove storage reachability from the agent.
Safe CI pattern:
always() cleanup step.$agentIp = (Invoke-RestMethod -Uri 'https://api.ipify.org' -TimeoutSec 20).Trim()
az storage account network-rule add `
--resource-group $rg `
--account-name $storageAccount `
--action Allow `
--ip-address $agentIp `
--only-show-errors
Start-Sleep -Seconds 30
az storage container list `
--account-name $storageAccount `
--auth-mode login `
--only-show-errors `
-o none
Cleanup should run even when registration fails. In Azure DevOps YAML, put network-rule removal in a step with condition: always().
Some azure-ai-ml versions import private marshmallow symbols that are unavailable in marshmallow 4.x. Hosted agents can install an incompatible transitive version and fail before any Azure ML API call runs. Pin the SDK and transitive dependency together when using affected versions.
python -m pip install --upgrade `
"azure-ai-ml==1.24.0" `
"azure-identity==1.19.0" `
"marshmallow>=3.18,<4.0"
If using a newer SDK, verify the dependency behavior in CI rather than removing the pin based on local success.
Confirm which ADF execution mode reads unpublished Git-branch state and which mode runs the published factory definition. Debug runs may exercise branch state, while scheduled and production runs typically execute the last published factory. Manual trigger behavior depends on how the factory is configured and invoked. Pick the mode that actually exercises the change being validated.
Accept runtime evidence, not structural plausibility. Validate:
result.version and CI propagated that exact value.Insufficient validation includes: docs showing an endpoint exists, JSON parsing, a plausible ARM URL, a successful deployment template, a requested version printed in logs without checking result.version, or review comments without runtime evidence.
result.version?connectVia inside typeProperties when private networking is required?https://storage.azure.com/?Storage Blob Data Reader?Storage Blob Data Contributor?result.version as the source of truth./codes/... discovery from ADF without runtime testing.connectVia inside typeProperties.development
This skill should be used when the user asks to train, debug, scale, or improve ML models. PROACTIVELY activate for: (1) PyTorch, TensorFlow/Keras, JAX, Flax, Hugging Face Trainer/Accelerate training loops, (2) distributed training, DDP/FSDP/DeepSpeed, TPU/GPU setup, (3) mixed precision AMP/bf16, gradient accumulation, checkpointing, seeding, (4) overfitting, imbalance, loss functions, regularization, LR schedules, warmup, (5) memory optimization, gradient checkpointing, offloading, quantization-aware training. Provides: reproducible training best practices across deep learning and classical ML.
development
This skill should be used when the user asks to productionize, track, version, govern, monitor, or automate ML systems. PROACTIVELY activate for: (1) MLflow, Weights & Biases, Neptune, Comet, ClearML experiment tracking, (2) model registry, model versioning, artifact lineage, reproducibility, (3) Kubeflow, SageMaker Pipelines, Vertex AI Pipelines, Azure ML pipelines, Databricks workflows, (4) CI/CD, continuous training/evaluation, A/B tests, canary/shadow deployments, (5) drift detection, model monitoring, data validation, responsible AI governance. Provides: end-to-end MLOps architecture and operational safeguards.
development
This skill should be used when the user asks to optimize, export, serve, compress, or accelerate ML inference. PROACTIVELY activate for: (1) latency, throughput, p95/p99, batching, concurrency, KV cache, memory, or cost issues, (2) quantization INT8/INT4, GPTQ, AWQ, bitsandbytes, pruning, sparsity, distillation, (3) ONNX export, ONNX Runtime, TensorRT, TorchScript, torch.compile, XLA, OpenVINO, Core ML, TFLite, (4) Triton, TorchServe, TF Serving, BentoML, Seldon, KServe configuration, (5) edge deployment, CPU/GPU/TPU/Inferentia serving. Provides: hardware-aware inference optimization and safe benchmarking.
testing
This skill should be used when the user asks to tune hyperparameters, run sweeps, optimize search spaces, or use AutoML. PROACTIVELY activate for: (1) Optuna, Ray Tune, FLAML, AutoGluon, Hyperopt, Nevergrad, KerasTuner, W&B sweeps, (2) grid search, random search, Bayesian optimization, TPE, Gaussian processes, evolutionary search, (3) ASHA, Hyperband, successive halving, multi-fidelity optimization, population-based training, (4) learning-rate finder, batch-size search, early stopping, pruning, (5) reproducible sweep design and experiment analysis. Provides: budget-aware hyperparameter search strategy.