.github/skills/cloud-solution-architect/SKILL.md
Transform the agent into a Cloud Solution Architect following Azure Architecture Center best practices. Use when designing cloud architectures, reviewing system designs, selecting architecture styles, applying cloud design patterns, making technology choices, or conducting Well-Architected Framework reviews.
npx skillsauth add microsoft/skills cloud-solution-architectInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Design well-architected, production-grade cloud systems following Azure Architecture Center best practices. This skill provides:
| # | Principle | Key Tactics | |---|-----------|-------------| | 1 | Design for self-healing | Retry with backoff, circuit breaker, bulkhead isolation, health endpoint monitoring, graceful degradation | | 2 | Make all things redundant | Eliminate single points of failure, use availability zones, deploy multi-region, replicate data | | 3 | Minimize coordination | Decouple services, use async messaging, embrace eventual consistency, use domain events | | 4 | Design to scale out | Horizontal scaling, autoscaling rules, stateless services, avoid session stickiness, partition workloads | | 5 | Partition around limits | Data partitioning (shard/hash/range), respect compute & network limits, use CDNs for static content | | 6 | Design for operations | Structured logging, distributed tracing, metrics & dashboards, runbook automation, infrastructure as code | | 7 | Use managed services | Prefer PaaS over IaaS, reduce operational burden, leverage built-in HA/DR/scaling | | 8 | Use an identity service | Microsoft Entra ID, managed identity, RBAC, avoid storing credentials, zero-trust principles | | 9 | Design for evolution | Loose coupling, versioned APIs, backward compatibility, async messaging for integration, feature flags | | 10 | Build for business needs | Define SLAs/SLOs, establish RTO/RPO targets, domain-driven design, cost modeling, composite SLAs |
| Style | Description | When to Use | Key Services | |-------|-------------|-------------|--------------| | N-tier | Horizontal layers (presentation, business, data) | Traditional enterprise apps, lift-and-shift | App Service, SQL Database, VNets | | Web-Queue-Worker | Web frontend → message queue → backend worker | Moderate-complexity apps with long-running tasks | App Service, Service Bus, Functions | | Microservices | Small autonomous services, bounded contexts, independent deploy | Complex domains, independent team scaling | AKS, Container Apps, API Management | | Event-driven | Pub/sub model, event producers/consumers | Real-time processing, IoT, reactive systems | Event Hubs, Event Grid, Functions | | Big data | Batch + stream processing pipeline | Analytics, ML pipelines, large-scale data | Synapse, Data Factory, Databricks | | Big compute | HPC, parallel processing | Simulations, modeling, rendering, genomics | Batch, CycleCloud, HPC VMs |
44 patterns organized by primary concern. WAF pillar mapping: R=Reliability, S=Security, CO=Cost Optimization, OE=Operational Excellence, PE=Performance Efficiency.
| Pattern | Summary | Pillars | |---------|---------|---------| | Asynchronous Request-Reply | Decouple request/response with polling or callbacks | R, PE | | Claim Check | Split large messages; store payload separately, pass reference | R, PE | | Choreography | Services coordinate via events without central orchestrator | R, OE | | Competing Consumers | Multiple consumers process messages from shared queue concurrently | R, PE | | Messaging Bridge | Connect incompatible messaging systems | R, OE | | Pipes and Filters | Decompose complex processing into reusable filter stages | R, OE | | Priority Queue | Prioritize requests so higher-priority work is processed first | R, PE | | Publisher/Subscriber | Decouple senders from receivers via topics/subscriptions | R, PE | | Queue-Based Load Leveling | Buffer requests with a queue to smooth intermittent loads | R, PE | | Sequential Convoy | Process related messages in order while allowing parallel groups | R, PE |
| Pattern | Summary | Pillars | |---------|---------|---------| | Bulkhead | Isolate resources per workload to prevent cascading failure | R | | Circuit Breaker | Stop calling a failing service; fail fast to protect resources | R | | Compensating Transaction | Undo previously committed steps when a later step fails | R | | Health Endpoint Monitoring | Expose health checks for load balancers and orchestrators | R, OE | | Leader Election | Coordinate distributed instances by electing a leader | R | | Retry | Handle transient faults by retrying with exponential backoff | R | | Saga | Manage data consistency across microservices with compensating transactions | R | | Scheduler Agent Supervisor | Coordinate distributed actions with retry and failure handling | R |
| Pattern | Summary | Pillars | |---------|---------|---------| | Cache-Aside | Load data on demand into cache from data store | PE | | CQRS | Separate read and write models for independent scaling | PE, R | | Event Sourcing | Store state as append-only sequence of domain events | R, OE | | Index Table | Create indexes over frequently queried fields in data stores | PE | | Materialized View | Pre-compute views over data for efficient queries | PE | | Sharding | Distribute data across partitions for scale and performance | PE, R | | Static Content Hosting | Serve static content from cloud storage/CDN directly | PE, CO | | Valet Key | Grant clients limited direct access to storage resources | S, PE |
| Pattern | Summary | Pillars | |---------|---------|---------| | Ambassador | Offload cross-cutting concerns to a helper sidecar proxy | OE | | Anti-Corruption Layer | Translate between new and legacy system models | OE, R | | Backends for Frontends | Create separate backends per frontend type (mobile, web, etc.) | OE, PE | | Compute Resource Consolidation | Combine multiple workloads into fewer compute instances | CO | | External Configuration Store | Externalize configuration from deployment packages | OE | | Sidecar | Deploy helper components alongside the main service | OE | | Strangler Fig | Incrementally migrate legacy systems by replacing pieces | OE, R |
| Pattern | Summary | Pillars | |---------|---------|---------| | Federated Identity | Delegate authentication to an external identity provider | S | | Gatekeeper | Protect services using a dedicated broker that validates requests | S | | Quarantine | Isolate and validate external assets before allowing use | S | | Rate Limiting | Control consumption rate of resources by consumers | R, S | | Throttling | Control resource consumption to sustain SLAs under load | R, PE |
| Pattern | Summary | Pillars | |---------|---------|---------| | Deployment Stamps | Deploy multiple independent copies of application components | R, PE | | Edge Workload Configuration | Configure workloads differently across diverse edge devices | OE | | Gateway Aggregation | Aggregate multiple backend calls into a single client request | PE | | Gateway Offloading | Offload shared functionality (SSL, auth) to a gateway | OE, S | | Gateway Routing | Route requests to multiple backends using a single endpoint | OE | | Geode | Deploy backends to multiple regions for active-active serving | R, PE |
See Design Patterns Reference for detailed implementation guidance.
For each technology area, evaluate: requirements → constraints → tradeoffs → select.
| Area | Key Options | Selection Criteria | |------|-------------|-------------------| | Compute | App Service, Functions, Container Apps, AKS, VMs, Batch | Hosting model, scaling, cost, team skills | | Storage | Blob Storage, Data Lake, Files, Disks, Managed Lustre | Access patterns, throughput, cost tier | | Data stores | SQL Database, Cosmos DB, PostgreSQL, Redis, Table Storage | Consistency model, query patterns, scale | | Messaging | Service Bus, Event Hubs, Event Grid, Queue Storage | Ordering, throughput, pub/sub vs queue | | Networking | Front Door, Application Gateway, Load Balancer, Traffic Manager | Global vs regional, L4 vs L7, WAF | | AI services | Azure OpenAI, AI Search, AI Foundry, Document Intelligence | Model needs, data grounding, orchestration | | Containers | Container Apps, AKS, Container Instances | Operational control vs simplicity |
See Technology Choices Reference for detailed decision trees.
| Practice | Key Guidance | |----------|-------------| | API design | RESTful conventions, resource-oriented URIs, HATEOAS, versioning via URL path or header | | API implementation | Async operations, pagination, idempotent PUT/DELETE, content negotiation, ETag caching | | Autoscaling | Scale on metrics (CPU, queue depth, custom), cool-down periods, predictive scaling, scale-in protection | | Background jobs | Use queues or scheduled triggers, idempotent processing, poison message handling, graceful shutdown | | Caching | Cache-aside pattern, TTL policies, cache invalidation strategies, distributed cache for multi-instance | | CDN | Static asset offloading, cache-busting with versioned URLs, geo-distribution, HTTPS enforcement | | Data partitioning | Horizontal (sharding), vertical, functional partitioning; partition key selection for even distribution | | Partitioning strategies | Hash-based, range-based, directory-based; rebalancing approach, cross-partition query avoidance | | Host name preservation | Preserve original host header through proxies/gateways for cookies, redirects, auth flows | | Message encoding | Schema evolution (Avro/Protobuf), backward/forward compatibility, schema registry | | Monitoring & diagnostics | Structured logging, distributed tracing (W3C Trace Context), metrics, alerts, dashboards | | Transient fault handling | Retry with exponential backoff + jitter, circuit breaker, idempotency keys, timeout budgets |
See Best Practices Reference for implementation details.
Avoid these common patterns that degrade performance under load:
| Antipattern | Problem | Fix | |-------------|---------|-----| | Busy Database | Offloading too much processing to the database | Move logic to application tier, use caching | | Busy Front End | Resource-intensive work on frontend request threads | Offload to background workers/queues | | Chatty I/O | Many small I/O requests instead of fewer large ones | Batch requests, use bulk APIs, buffer writes | | Extraneous Fetching | Retrieving more data than needed | Project only required fields, paginate, filter server-side | | Improper Instantiation | Recreating expensive objects per request | Use singletons, connection pooling, HttpClientFactory | | Monolithic Persistence | Single data store for all data types | Polyglot persistence — right store for each workload | | No Caching | Repeatedly fetching unchanged data | Cache-aside pattern, CDN, output caching, Redis | | Noisy Neighbor | One tenant consuming all shared resources | Bulkhead isolation, per-tenant quotas, throttling | | Retry Storm | Aggressive retries overwhelming a recovering service | Exponential backoff + jitter, circuit breaker, retry budgets | | Synchronous I/O | Blocking threads on I/O operations | Async/await, non-blocking I/O, reactive streams |
For workloads targeting 99.99%+ SLO, address these design areas:
| Design Area | Key Considerations | |-------------|-------------------| | Application platform | Multi-region active-active, availability zones, Container Apps or AKS with zone redundancy | | Application design | Stateless services, idempotent operations, graceful degradation, bulkhead isolation | | Networking | Azure Front Door (global LB), DDoS Protection, private endpoints, redundant connectivity | | Data platform | Multi-region Cosmos DB, zone-redundant SQL, async replication, conflict resolution | | Deployment & testing | Blue-green deployments, canary releases, chaos engineering, automated rollback | | Health modeling | Composite health scores, dependency health tracking, automated remediation, SLI dashboards | | Security | Zero-trust, managed identity everywhere, key rotation, WAF policies, threat modeling | | Operational procedures | Automated runbooks, incident response playbooks, game days, postmortems |
See Mission-Critical Reference for detailed guidance.
Every architecture decision should be evaluated against all five pillars:
| Pillar | Focus | Key Questions | |--------|-------|---------------| | Reliability | Resiliency, availability, disaster recovery | What is the RTO/RPO? How does it handle failures? Is there redundancy? | | Security | Threat protection, identity, data protection | Is identity managed? Is data encrypted? Are there network controls? | | Cost Optimization | Cost management, efficiency, right-sizing | Is compute right-sized? Are there reserved instances? Is there waste? | | Operational Excellence | Monitoring, deployment, automation | Is deployment automated? Is there observability? Are there runbooks? | | Performance Efficiency | Scaling, load testing, performance targets | Can it scale horizontally? Are there performance baselines? Is caching used? |
| Optimizing for... | May impact... | |-------------------|---------------| | Reliability (redundancy) | Cost (more resources) | | Security (isolation) | Performance (added latency) | | Cost (consolidation) | Reliability (shared failure domains) | | Performance (caching) | Cost (cache infrastructure), Reliability (stale data) |
When reviewing or designing a system, follow this structured approach:
Functional: What must the system do?
Non-functional:
- Availability target (e.g., 99.9%, 99.99%)
- Latency requirements (p50, p95, p99)
- Throughput (requests/sec, messages/sec)
- Data residency and compliance
- Recovery targets (RTO, RPO)
- Cost constraints
Match requirements to architecture style using the selection criteria table above.
Use the technology choices decision framework. Prefer managed services (PaaS) over IaaS.
Select relevant patterns from the 44 cloud design patterns based on identified concerns.
Review each pillar systematically. Document tradeoffs explicitly.
Use Architecture Decision Records (ADRs):
# ADR-NNN: [Decision Title]
## Status: [Proposed | Accepted | Deprecated]
## Context
[What is the issue we're addressing?]
## Decision
[What did we decide and why?]
## Consequences
[What are the positive and negative impacts?]
Content derived from the Azure Architecture Center — Microsoft's official guidance for cloud solution architecture on Azure. Covers design principles, architecture styles, cloud design patterns, technology choices, best practices, performance antipatterns, mission-critical design, and the Well-Architected Framework.
tools
KQL language expertise for writing correct, efficient Kusto Query Language queries. Covers syntax gotchas, join patterns, dynamic types, datetime pitfalls, regex patterns, serialization, memory management, result-size discipline, and advanced functions (geo, vector, graph). USE THIS SKILL whenever writing, debugging, or reviewing KQL queries — even simple ones — because the gotchas section prevents the most common errors that waste tool calls and cause expensive retry cascades. Trigger on: KQL, Kusto, ADX, Azure Data Explorer, Fabric Real-Time Intelligence, EventHouse, Log Analytics, log analysis, data exploration, time series, anomaly detection, summarize, where clause, join, extend, project, let statement, parse operator, extract function, any mention of pipe-forward query syntax.
development
Deploy, evaluate, and manage Foundry agents end-to-end: Docker build, ACR push, hosted/prompt agent create, container start, batch eval, prompt optimization, prompt optimizer workflows, agent.yaml, dataset curation from traces. USE FOR: deploy agent to Foundry, hosted agent, create agent, invoke agent, evaluate agent, run batch eval, optimize prompt, improve prompt, prompt optimization, prompt optimizer, improve agent instructions, optimize agent instructions, optimize system prompt, deploy model, Foundry project, RBAC, role assignment, permissions, quota, capacity, region, troubleshoot agent, deployment failure, create dataset from traces, dataset versioning, eval trending, create AI Services, Cognitive Services, create Foundry resource, provision resource, knowledge index, agent monitoring, customize deployment, onboard, availability. DO NOT USE FOR: Azure Functions, App Service, general Azure deploy (use azure-deploy), general Azure prep (use azure-prepare).
testing
Pre-deployment validation for Azure readiness. Run deep checks on configuration, infrastructure (Bicep or Terraform), RBAC role assignments, managed identity permissions, and prerequisites before deploying. WHEN: validate my app, check deployment readiness, run preflight checks, verify configuration, check if ready to deploy, validate azure.yaml, validate Bicep, test before deploying, troubleshoot deployment errors, validate Azure Functions, validate function app, validate serverless deployment, verify RBAC roles, check role assignments, review managed identity permissions, what-if analysis, validate Container Apps deployment.
testing
Check/manage Azure quotas and usage across providers. For deployment planning, capacity validation, region selection. WHEN: "check quotas", "service limits", "current usage", "request quota increase", "quota exceeded", "validate capacity", "regional availability", "provisioning limits", "vCPU limit", "how many vCPUs available in my subscription".