plugins/azure-master/skills/aks-automatic-2025/SKILL.md
Azure Kubernetes Service (AKS) Automatic mode and 2025 platform features. PROACTIVELY activate for: (1) AKS Automatic (managed Kubernetes, zero operational overhead), (2) Karpenter-based autoscaling on AKS, (3) NodePool CRD usage, (4) HPA, VPA, KEDA on AKS, (5) workload identity and Microsoft Entra integration, (6) AKS billing model (Automatic vs Standard), (7) AKS 2025 cluster defaults (RBAC, Azure CNI overlay, Cilium), (8) AKS upgrade and version management, (9) GitOps on AKS (Flux, ArgoCD), (10) AKS observability (Azure Monitor for containers, Managed Prometheus). Provides: AKS Automatic vs Standard comparison, Karpenter setup, workload-identity recipes, KEDA scaler patterns, and an end-to-end AKS Automatic deployment guide.
npx skillsauth add JosiahSiegel/claude-plugin-marketplace aks-automatic-2025Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Complete knowledge base for Azure Kubernetes Service Automatic mode (GA October 2025).
AKS Automatic is a fully-managed Kubernetes offering that eliminates operational overhead through intelligent automation and built-in best practices.
az aks create \
--resource-group MyRG \
--name MyAKSAutomatic \
--sku automatic \
--kubernetes-version 1.34 \
--location eastus
az aks create \
--resource-group MyRG \
--name MyAKSAutomatic \
--location eastus \
--sku automatic \
--tier standard \
\
# Kubernetes version
--kubernetes-version 1.34 \
\
# Karpenter (default in automatic mode)
--enable-karpenter \
\
# Networking
--network-plugin azure \
--network-plugin-mode overlay \
--network-dataplane cilium \
--service-cidr 10.0.0.0/16 \
--dns-service-ip 10.0.0.10 \
--load-balancer-sku standard \
\
# Use custom VNet (optional)
--vnet-subnet-id /subscriptions/<sub-id>/resourceGroups/MyRG/providers/Microsoft.Network/virtualNetworks/MyVNet/subnets/AKSSubnet \
\
# Availability zones
--zones 1 2 3 \
\
# Authentication and authorization
--enable-managed-identity \
--enable-aad \
--enable-azure-rbac \
--aad-admin-group-object-ids <group-object-id> \
\
# Auto-upgrade
--auto-upgrade-channel stable \
--node-os-upgrade-channel NodeImage \
\
# Security
--enable-defender \
--enable-workload-identity \
--enable-oidc-issuer \
\
# Monitoring
--enable-addons monitoring \
--workspace-resource-id /subscriptions/<sub-id>/resourceGroups/MyRG/providers/Microsoft.OperationalInsights/workspaces/MyWorkspace \
\
# Tags
--tags Environment=Production ManagedBy=AKSAutomatic
az aks create \
--resource-group MyRG \
--name MyAKSAutomatic \
--sku automatic \
--enable-addons azure-policy \
--kubernetes-version 1.34
AKS Automatic uses Karpenter for intelligent node provisioning. Customize node provisioning with AKSNodeClass and NodePool CRDs.
apiVersion: karpenter.azure.com/v1alpha1
kind: AKSNodeClass
metadata:
name: default
spec:
# OS Image - Ubuntu 24.04 for K8s 1.34+
osImage:
sku: Ubuntu
version: "24.04"
# VM Series
vmSeries:
- Standard_D
- Standard_E
# Max pods per node
maxPodsPerNode: 110
# Security
securityProfile:
sshAccess: Disabled
securityType: Standard
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: general-purpose
spec:
# Constraints
template:
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"]
- key: kubernetes.azure.com/agentpool
operator: In
values: ["general"]
# Node labels
labels:
workload-type: general
# Taints (optional)
taints:
- key: "dedicated"
value: "general"
effect: "NoSchedule"
# NodeClass reference
nodeClassRef:
group: karpenter.azure.com
kind: AKSNodeClass
name: default
# Limits
limits:
cpu: "1000"
memory: 4000Gi
# Disruption budget
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 30s
expireAfter: 720h # 30 days
budgets:
- nodes: "10%"
duration: 5m
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-workloads
spec:
template:
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"]
- key: node.kubernetes.io/instance-type
operator: In
values: ["Standard_NC6s_v3", "Standard_NC12s_v3", "Standard_NC24s_v3"]
labels:
workload-type: gpu
gpu-type: nvidia-v100
taints:
- key: "nvidia.com/gpu"
value: "true"
effect: "NoSchedule"
nodeClassRef:
group: karpenter.azure.com
kind: AKSNodeClass
name: gpu-nodeclass
limits:
cpu: "200"
memory: 800Gi
nvidia.com/gpu: "16"
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 300s
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: myapp-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
minReplicas: 2
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 4
periodSeconds: 15
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 15
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: myapp-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
updatePolicy:
updateMode: "Auto" # Auto, Recreate, Initial, Off
resourcePolicy:
containerPolicies:
- containerName: "*"
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 4
memory: 8Gi
controlledResources: ["cpu", "memory"]
controlledValues: RequestsAndLimits
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: myapp-queue-scaler
spec:
scaleTargetRef:
name: myapp
minReplicaCount: 0 # Scale to zero
maxReplicaCount: 100
pollingInterval: 30
cooldownPeriod: 300
triggers:
# Azure Service Bus Queue
- type: azure-servicebus
metadata:
queueName: myqueue
namespace: myservicebus
messageCount: "5"
authenticationRef:
name: azure-servicebus-auth
# Azure Storage Queue
- type: azure-queue
metadata:
queueName: myqueue
queueLength: "10"
accountName: mystorageaccount
authenticationRef:
name: azure-storage-auth
# Prometheus metrics
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
metricName: http_requests_per_second
threshold: "100"
query: sum(rate(http_requests_total[2m]))
# Workload identity is enabled by default in AKS Automatic
# Create managed identity
az identity create \
--name myapp-identity \
--resource-group MyRG
# Get identity details
export IDENTITY_CLIENT_ID=$(az identity show -g MyRG -n myapp-identity --query clientId -o tsv)
export IDENTITY_OBJECT_ID=$(az identity show -g MyRG -n myapp-identity --query principalId -o tsv)
# Assign role to identity
az role assignment create \
--assignee $IDENTITY_OBJECT_ID \
--role "Storage Blob Data Contributor" \
--scope /subscriptions/<sub-id>/resourceGroups/MyRG/providers/Microsoft.Storage/storageAccounts/mystorage
# Create federated credential
export AKS_OIDC_ISSUER=$(az aks show -g MyRG -n MyAKSAutomatic --query oidcIssuerProfile.issuerUrl -o tsv)
az identity federated-credential create \
--name myapp-federated-credential \
--identity-name myapp-identity \
--resource-group MyRG \
--issuer $AKS_OIDC_ISSUER \
--subject system:serviceaccount:default:myapp-sa
# Service Account
apiVersion: v1
kind: ServiceAccount
metadata:
name: myapp-sa
namespace: default
annotations:
azure.workload.identity/client-id: "<IDENTITY_CLIENT_ID>"
---
# Deployment using workload identity
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
replicas: 2
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
azure.workload.identity/use: "true" # Enable workload identity
spec:
serviceAccountName: myapp-sa
containers:
- name: myapp
image: myregistry.azurecr.io/myapp:latest
env:
- name: AZURE_CLIENT_ID
value: "<IDENTITY_CLIENT_ID>"
- name: AZURE_TENANT_ID
value: "<TENANT_ID>"
- name: AZURE_FEDERATED_TOKEN_FILE
value: /var/run/secrets/azure/tokens/azure-identity-token
volumeMounts:
- name: azure-identity-token
mountPath: /var/run/secrets/azure/tokens
readOnly: true
volumes:
- name: azure-identity-token
projected:
sources:
- serviceAccountToken:
path: azure-identity-token
expirationSeconds: 3600
audience: api://AzureADTokenExchange
# Already enabled with --enable-addons monitoring
# Query logs using Azure Monitor
# Get cluster logs
az monitor log-analytics query \
--workspace <workspace-id> \
--analytics-query "KubePodInventory | where ClusterName == 'MyAKSAutomatic' | take 10" \
--output table
# Get Karpenter logs
kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter
# Enable managed Prometheus
az aks update \
--resource-group MyRG \
--name MyAKSAutomatic \
--enable-azure-monitor-metrics
# Access Grafana dashboards through Azure Portal
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
disruption:
consolidationPolicy: WhenUnderutilized
consolidateAfter: 30s
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: myapp-pdb
spec:
minAvailable: 1
selector:
matchLabels:
app: myapp
AKS Automatic is a new cluster mode - in-place migration is not supported. Follow these steps:
✓ Use AKS Automatic for new production clusters ✓ Enable workload identity for pod authentication ✓ Configure custom NodePools for specific workload types ✓ Implement HPA, VPA, and KEDA for comprehensive scaling ✓ Use spot instances for batch and fault-tolerant workloads ✓ Enable Container Insights and Managed Prometheus ✓ Configure Pod Disruption Budgets for critical apps ✓ Use network policies for microsegmentation ✓ Enable Azure Policy add-on for compliance ✓ Implement GitOps with Flux or Argo CD
kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter --tail=100
kubectl get nodepools
kubectl get nodeclaims
kubectl get events --field-selector involvedObject.kind=NodePool -A
# Check service account annotation
kubectl get sa myapp-sa -o yaml
# Check pod labels
kubectl get pod <pod-name> -o yaml | grep azure.workload.identity
# Check federated credential
az identity federated-credential show \
--identity-name myapp-identity \
--resource-group MyRG \
--name myapp-federated-credential
AKS Automatic represents the future of managed Kubernetes on Azure - zero operational overhead with maximum automation!
development
This skill should be used when the user asks to train, debug, scale, or improve ML models. PROACTIVELY activate for: (1) PyTorch, TensorFlow/Keras, JAX, Flax, Hugging Face Trainer/Accelerate training loops, (2) distributed training, DDP/FSDP/DeepSpeed, TPU/GPU setup, (3) mixed precision AMP/bf16, gradient accumulation, checkpointing, seeding, (4) overfitting, imbalance, loss functions, regularization, LR schedules, warmup, (5) memory optimization, gradient checkpointing, offloading, quantization-aware training. Provides: reproducible training best practices across deep learning and classical ML.
development
This skill should be used when the user asks to productionize, track, version, govern, monitor, or automate ML systems. PROACTIVELY activate for: (1) MLflow, Weights & Biases, Neptune, Comet, ClearML experiment tracking, (2) model registry, model versioning, artifact lineage, reproducibility, (3) Kubeflow, SageMaker Pipelines, Vertex AI Pipelines, Azure ML pipelines, Databricks workflows, (4) CI/CD, continuous training/evaluation, A/B tests, canary/shadow deployments, (5) drift detection, model monitoring, data validation, responsible AI governance. Provides: end-to-end MLOps architecture and operational safeguards.
development
This skill should be used when the user asks to optimize, export, serve, compress, or accelerate ML inference. PROACTIVELY activate for: (1) latency, throughput, p95/p99, batching, concurrency, KV cache, memory, or cost issues, (2) quantization INT8/INT4, GPTQ, AWQ, bitsandbytes, pruning, sparsity, distillation, (3) ONNX export, ONNX Runtime, TensorRT, TorchScript, torch.compile, XLA, OpenVINO, Core ML, TFLite, (4) Triton, TorchServe, TF Serving, BentoML, Seldon, KServe configuration, (5) edge deployment, CPU/GPU/TPU/Inferentia serving. Provides: hardware-aware inference optimization and safe benchmarking.
testing
This skill should be used when the user asks to tune hyperparameters, run sweeps, optimize search spaces, or use AutoML. PROACTIVELY activate for: (1) Optuna, Ray Tune, FLAML, AutoGluon, Hyperopt, Nevergrad, KerasTuner, W&B sweeps, (2) grid search, random search, Bayesian optimization, TPE, Gaussian processes, evolutionary search, (3) ASHA, Hyperband, successive halving, multi-fidelity optimization, population-based training, (4) learning-rate finder, batch-size search, early stopping, pruning, (5) reproducible sweep design and experiment analysis. Provides: budget-aware hyperparameter search strategy.