plugins/azure-master/skills/azure-well-architected-framework/SKILL.md
Azure Well-Architected Framework (WAF) for cloud architecture review. PROACTIVELY activate for: (1) Azure architecture review or design, (2) Reliability pillar (availability zones, geo-replication, backup/restore, RPO/RTO), (3) Security pillar (Zero Trust, encryption at rest/in transit, identity, network segmentation), (4) Cost Optimization pillar (rightsizing, reserved instances, savings plans, FinOps), (5) Operational Excellence pillar (IaC, observability, automation), (6) Performance Efficiency pillar (caching, autoscaling, async patterns), (7) Sustainability pillar, (8) WAF Reviews via the WAF Assessment Tool, (9) Microsoft Cloud Adoption Framework (CAF) alignment. Provides: pillar-by-pillar checklist, WAF assessment workflow, common antipatterns by pillar, and Azure Advisor mapping.
npx skillsauth add JosiahSiegel/claude-plugin-marketplace azure-well-architected-frameworkInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
The Azure Well-Architected Framework is a set of guiding tenets for building high-quality cloud solutions. It consists of five pillars of architectural excellence.
Purpose: Help architects and engineers build secure, high-performing, resilient, and efficient infrastructure for applications.
The Five Pillars:
Definition: The ability of a system to recover from failures and continue to function.
Key Principles:
Best Practices:
Availability Zones:
# Deploy VM across availability zones
az vm create \
--resource-group MyRG \
--name MyVM \
--zone 1 \
--image Ubuntu2204 \
--size Standard_D2s_v3
# Availability SLAs:
# - Single VM (Premium SSD): 99.9%
# - Availability Set: 99.95%
# - Availability Zones: 99.99%
Backup and Disaster Recovery:
# Enable Azure Backup
az backup protection enable-for-vm \
--resource-group MyRG \
--vault-name MyVault \
--vm MyVM \
--policy-name DefaultPolicy
# Recovery Point Objective (RPO): How much data loss is acceptable
# Recovery Time Objective (RTO): How long can system be down
Health Probes:
Definition: Protecting applications and data from threats.
Key Principles:
Best Practices:
Identity and Access:
# Use managed identities (no credentials in code)
az vm identity assign \
--resource-group MyRG \
--name MyVM
# RBAC assignment
az role assignment create \
--assignee <principal-id> \
--role "Contributor" \
--scope /subscriptions/<subscription-id>/resourceGroups/MyRG
Network Security:
Data Protection:
# Enable encryption at rest (automatic for most services)
# Enable TLS 1.2+ for data in transit
# Azure Storage encryption
az storage account update \
--name mystorageaccount \
--resource-group MyRG \
--min-tls-version TLS1_2 \
--https-only true
Security Monitoring:
# Enable Microsoft Defender for Cloud
az security pricing create \
--name VirtualMachines \
--tier Standard
# Enable Azure Sentinel
az sentinel onboard \
--resource-group MyRG \
--workspace-name MyWorkspace
Definition: Managing costs to maximize the value delivered.
Key Principles:
Best Practices:
Right-Sizing:
# Use Azure Advisor recommendations
az advisor recommendation list \
--category Cost \
--output table
# Common optimizations:
# 1. Shutdown dev/test VMs when not in use
# 2. Use Azure Hybrid Benefit for Windows/SQL
# 3. Purchase reservations for consistent workloads
# 4. Use autoscaling to match demand
Reserved Instances:
Azure Hybrid Benefit:
# Apply Windows license to VM
az vm update \
--resource-group MyRG \
--name MyVM \
--license-type Windows_Server
# SQL Server Hybrid Benefit
az sql vm create \
--resource-group MyRG \
--name MySQLVM \
--license-type AHUB
Cost Management:
# Create budget
az consumption budget create \
--budget-name MyBudget \
--category cost \
--amount 1000 \
--time-grain monthly \
--start-date 2025-01-01 \
--end-date 2025-12-31
# Set up alerts at 80%, 100%, 120% of budget
Definition: Operations processes that keep a system running in production.
Key Principles:
Best Practices:
Infrastructure as Code:
# Use ARM, Bicep, or Terraform
# Version control all infrastructure
# Implement CI/CD for infrastructure
# Example: Bicep deployment
az deployment group create \
--resource-group MyRG \
--template-file main.bicep \
--parameters @parameters.json
Monitoring and Alerting:
# Application Insights for apps
az monitor app-insights component create \
--app MyApp \
--location eastus \
--resource-group MyRG
# Log Analytics for infrastructure
az monitor log-analytics workspace create \
--resource-group MyRG \
--workspace-name MyWorkspace
# Create alerts
az monitor metrics alert create \
--name HighCPU \
--resource-group MyRG \
--scopes <vm-id> \
--condition "avg Percentage CPU > 80" \
--description "CPU usage is above 80%"
DevOps Practices:
Definition: The ability of a system to adapt to changes in load.
Key Principles:
Best Practices:
Scaling:
# Horizontal scaling (preferred)
# VM Scale Sets
az vmss create \
--resource-group MyRG \
--name MyVMSS \
--image Ubuntu2204 \
--instance-count 3 \
--vm-sku Standard_D2s_v3
# Autoscaling
az monitor autoscale create \
--resource-group MyRG \
--resource MyVMSS \
--resource-type Microsoft.Compute/virtualMachineScaleSets \
--name MyAutoscale \
--min-count 2 \
--max-count 10
Caching:
Data Access:
Networking:
# Use Azure Front Door for global apps
az afd profile create \
--profile-name MyFrontDoor \
--resource-group MyRG \
--sku Premium_AzureFrontDoor
# Features:
# - Global load balancing
# - CDN capabilities
# - Web Application Firewall
# - SSL offloading
# - Caching
Azure Well-Architected Review:
# Self-assessment tool in Azure Portal
# Generates recommendations per pillar
# Provides actionable guidance
Azure Advisor:
# Get recommendations
az advisor recommendation list --output table
# Categories:
# - Reliability (High Availability)
# - Security
# - Performance
# - Cost
# - Operational Excellence
Reliability:
Security:
Cost Optimization:
Operational Excellence:
Performance Efficiency:
Highly Available Web Application:
Mission-Critical Application:
Cost-Optimized Dev/Test:
The Well-Architected Framework provides a consistent approach to evaluating architectures and implementing designs that scale over time.
development
This skill should be used when the user asks to train, debug, scale, or improve ML models. PROACTIVELY activate for: (1) PyTorch, TensorFlow/Keras, JAX, Flax, Hugging Face Trainer/Accelerate training loops, (2) distributed training, DDP/FSDP/DeepSpeed, TPU/GPU setup, (3) mixed precision AMP/bf16, gradient accumulation, checkpointing, seeding, (4) overfitting, imbalance, loss functions, regularization, LR schedules, warmup, (5) memory optimization, gradient checkpointing, offloading, quantization-aware training. Provides: reproducible training best practices across deep learning and classical ML.
development
This skill should be used when the user asks to productionize, track, version, govern, monitor, or automate ML systems. PROACTIVELY activate for: (1) MLflow, Weights & Biases, Neptune, Comet, ClearML experiment tracking, (2) model registry, model versioning, artifact lineage, reproducibility, (3) Kubeflow, SageMaker Pipelines, Vertex AI Pipelines, Azure ML pipelines, Databricks workflows, (4) CI/CD, continuous training/evaluation, A/B tests, canary/shadow deployments, (5) drift detection, model monitoring, data validation, responsible AI governance. Provides: end-to-end MLOps architecture and operational safeguards.
development
This skill should be used when the user asks to optimize, export, serve, compress, or accelerate ML inference. PROACTIVELY activate for: (1) latency, throughput, p95/p99, batching, concurrency, KV cache, memory, or cost issues, (2) quantization INT8/INT4, GPTQ, AWQ, bitsandbytes, pruning, sparsity, distillation, (3) ONNX export, ONNX Runtime, TensorRT, TorchScript, torch.compile, XLA, OpenVINO, Core ML, TFLite, (4) Triton, TorchServe, TF Serving, BentoML, Seldon, KServe configuration, (5) edge deployment, CPU/GPU/TPU/Inferentia serving. Provides: hardware-aware inference optimization and safe benchmarking.
testing
This skill should be used when the user asks to tune hyperparameters, run sweeps, optimize search spaces, or use AutoML. PROACTIVELY activate for: (1) Optuna, Ray Tune, FLAML, AutoGluon, Hyperopt, Nevergrad, KerasTuner, W&B sweeps, (2) grid search, random search, Bayesian optimization, TPE, Gaussian processes, evolutionary search, (3) ASHA, Hyperband, successive halving, multi-fidelity optimization, population-based training, (4) learning-rate finder, batch-size search, early stopping, pruning, (5) reproducible sweep design and experiment analysis. Provides: budget-aware hyperparameter search strategy.