skills/k8s-clusters/SKILL.md
--- name: k8s-clusters description: Hypera Azure AKS infrastructure reference. Use when user mentions cluster names (cafehyna, loyalty, sonora, painelclientes), needs kubeconfig paths, asks about spot tolerations, cert-manager issuers, or resource definition policies. Critical: Hub cluster Azure name differs from developer name. --- # Kubernetes Clusters Skill ## Critical: Hub Cluster Naming | Context | Name | |---------|------| | Developer/Docs | `cafehyna-hub` | | Azure CLI | `aks-cafehyna-
npx skillsauth add julianobarbosa/claude-code-skills skills/k8s-clustersInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
| Context | Name |
|---------|------|
| Developer/Docs | cafehyna-hub |
| Azure CLI | aks-cafehyna-default |
Always use Azure name in az commands.
Format: developer-name → Azure: azure-name, RG: resource-group, Config: kubeconfig
Cafehyna
cafehyna-dev → Azure: aks-cafehyna-dev-hlg, RG: RS_Hypera_Cafehyna_Dev, Config: aks-rg-hypera-cafehyna-dev-config, Spot: Yescafehyna-hub → Azure: aks-cafehyna-default, RG: rs_hypera_cafehyna, Config: aks-rg-hypera-cafehyna-hub-config, Spot: Nocafehyna-prd → Azure: aks-cafehyna-prd, RG: rs_hypera_cafehyna_prd, Config: aks-rg-hypera-cafehyna-prd-config, Spot: NoLoyalty
loyalty-dev → Azure: Loyalty_AKS-QAS, RG: RS_Hypera_Loyalty_AKS_QAS, Config: aks-rg-hypera-loyalty-dev-config, Spot: Yesloyalty-prd → Azure: Loyalty_AKS-PRD, RG: RS_Hypera_Loyalty_AKS_PRD, Config: aks-rg-hypera-loyalty-prd-config, Spot: NoSonora
sonora-dev → Azure: AKS-Hypera-Sonora-Dev-Hlg, RG: rg-hypera-sonora-dev, Config: aks-rg-hypera-sonora-dev-config, Spot: Yessonora-prd → Azure: AKS-Hypera-Sonora-Prod, RG: rg-hypera-sonora-prd, Config: aks-rg-hypera-sonora-prd-config, Spot: NoPainelclientes
painelclientes-dev → Azure: akspainelclientedev, RG: rg-hypera-painelclientes-dev, Config: aks-rg-hypera-painelclientes-dev-config, Spot: Yes, Region: East US2painelclientes-prd → Azure: akspainelclientesprd, RG: rg-hypera-painelclientes-prd, Config: aks-rg-hypera-painelclientes-prd-config, Spot: No, Region: East US2All kubeconfigs at ~/.kube/<config-name>.
Required toleration for ALL pods on spot clusters:
tolerations:
- key: kubernetes.azure.com/scalesetpriority
operator: Equal
value: "spot"
effect: NoSchedule
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: agentpool
operator: In
values: ["cafedevspot"] # cafehyna-dev: only use cafedevspot, NOT cafedev
Important: The cafedev nodepool has CriticalAddonsOnly taint and should NOT be used for workloads. Always use the spot nodepool (e.g., cafedevspot, pcdevspot).
Without this → pods stuck Pending. Use scripts/patch-tolerations.sh to fix.
| Resource | Requirement | |----------|-------------| | CPU requests | ✅ Required | | CPU limits | ❌ Forbidden (causes throttling) | | Memory requests | ✅ Required | | Memory limits | ✅ Required, must equal requests |
| Environment | Issuer |
|-------------|--------|
| prd, hub | letsencrypt-prod-cloudflare |
| dev | letsencrypt-staging-cloudflare |
❌ Never use issuers without -cloudflare suffix.
MANDATORY for ALL stateful workloads across ALL clusters:
| Access Mode | StorageClass | Use Case |
|-------------|--------------|----------|
| ReadWriteOnce (RWO) | managed-premium-zrs | Databases, caches, single-pod storage |
| ReadWriteMany (RWX) | azurefile-csi-premium | Shared storage, media files, multi-pod access |
Rules:
| Rule | Requirement |
|------|-------------|
| Helm values storageClass | ✅ MUST be explicitly set (never omit or use null) |
| storageClass: null or omitted | ❌ FORBIDDEN - causes zone affinity conflicts |
| Default StorageClass reliance | ❌ FORBIDDEN - not guaranteed across clusters |
Why Zone-Redundant Storage (ZRS)?
WaitForFirstConsumer binding modeThis applies to ALL workloads including:
Creating managed-premium-zrs StorageClass
Run on each cluster that doesn't have it:
# Quick check and create
.claude/skills/k8s-clusters/scripts/create-storageclass.sh <cluster-name>
# Or manually:
KUBECONFIG=~/.kube/<config> kubectl apply -f - <<EOF
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: managed-premium-zrs
provisioner: disk.csi.azure.com
parameters:
skuName: Premium_ZRS
kind: Managed
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
EOF
Example Helm values pattern:
# For databases, caches (RWO)
persistence:
storageClass: managed-premium-zrs # NEVER omit this
# For shared/media storage (RWX)
persistence:
storageClass: azurefile-csi-premium
accessMode: ReadWriteMany
MANDATORY for ALL Robusta deployments:
Robusta requires secrets from Azure Key Vault. The CSI Secret Store driver syncs these secrets to Kubernetes Secrets that the Robusta runner pod references.
Required Azure Key Vault Secrets (must exist in each cluster's Key Vault):
| Secret Name | Description | Required By |
|-------------|-------------|-------------|
| robusta-ms-teams-webhook | MS Teams incoming webhook URL | MS Teams sink |
| robusta-ui-token | Robusta SaaS UI authentication token | Robusta UI sink |
| robusta-signing-key | Signing key for Robusta authentication | globalConfig |
| robusta-account-id | Robusta account identifier | globalConfig |
| azure-openai-key | Azure OpenAI API key | HolmesGPT |
Create missing secrets (if any are missing, pod will fail with FailedMount):
# Check existing secrets in Key Vault
az keyvault secret list --vault-name <keyvault-name> --query "[?starts_with(name,'robusta') || starts_with(name,'azure-openai')].name" -o tsv
# Create missing secrets (get values from Hub KV or Robusta SaaS)
az keyvault secret set --vault-name <keyvault-name> --name robusta-ms-teams-webhook --value "<webhook-url>"
az keyvault secret set --vault-name <keyvault-name> --name robusta-ui-token --value "<ui-token>"
az keyvault secret set --vault-name <keyvault-name> --name robusta-signing-key --value "<signing-key>"
az keyvault secret set --vault-name <keyvault-name> --name robusta-account-id --value "<account-id>"
az keyvault secret set --vault-name <keyvault-name> --name azure-openai-key --value "<openai-key>"
Required SecretProviderClass (secretproviderclass.yaml in each cluster's robusta directory):
apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
name: robusta-secrets-kv
namespace: monitoring
spec:
provider: azure
secretObjects:
- data:
- key: ms-teams-webhook
objectName: robusta-ms-teams-webhook
- key: robusta-ui-token
objectName: robusta-ui-token
- key: azure-openai-key
objectName: azure-openai-key
- key: robusta-signing-key
objectName: robusta-signing-key
- key: robusta-account-id
objectName: robusta-account-id
secretName: robusta-secrets
type: Opaque
parameters:
usePodIdentity: "false"
useVMManagedIdentity: "true"
userAssignedIdentityID: "<cluster-managed-identity>" # From cluster lookup
keyvaultName: "<cluster-keyvault>" # From cluster lookup
tenantId: "3f7a3df4-f85b-4ca8-98d0-08b1034e6567"
objects: |
array:
- |
objectName: robusta-ms-teams-webhook
objectType: secret
- |
objectName: robusta-ui-token
objectType: secret
- |
objectName: azure-openai-key
objectType: secret
- |
objectName: robusta-signing-key
objectType: secret
- |
objectName: robusta-account-id
objectType: secret
Required Helm values (in values.yaml under runner: section):
runner:
# CSI volume mount to trigger robusta-secrets creation from Azure Key Vault
extraVolumes:
- name: robusta-secrets-store
csi:
driver: secrets-store.csi.k8s.io
readOnly: true
volumeAttributes:
secretProviderClass: robusta-secrets-kv
extraVolumeMounts:
- name: robusta-secrets-store
mountPath: /mnt/secrets-store/robusta
readOnly: true
How it works:
robusta-secrets in monitoring namespaceCommon issues:
| Symptom | Cause | Fix |
|---------|-------|-----|
| FailedMount with SecretNotFound | Secret missing in Key Vault | Create the missing secret with az keyvault secret set |
| Pod stuck ContainerCreating | SecretProviderClass name mismatch | Ensure secretProviderClass: robusta-secrets-kv matches metadata.name |
| Secret not created | Missing extraVolumes/extraVolumeMounts | Add CSI volume configuration to values.yaml |
| Auth error in pod events | Wrong Managed Identity ID | Check userAssignedIdentityID matches cluster's identity |
Reference: HolmesGPT Azure OpenAI Docs
HolmesGPT uses the LiteLLM API to support Azure OpenAI. Configuration is done via Helm values.
Required environment variables (in values.yaml under holmes: section):
enableHolmesGPT: true
holmes:
additionalEnvVars:
- name: ROBUSTA_AI
value: "true"
- name: AZURE_API_KEY
valueFrom:
secretKeyRef:
name: robusta-secrets
key: azure-openai-key
- name: MODEL
value: "azure/<deployment-name>" # e.g., azure/gpt-4o or azure/claude-sonnet-4-5
- name: AZURE_API_VERSION
value: "2024-12-01-preview" # Use latest stable version
- name: AZURE_API_BASE
value: "https://<resource>.openai.azure.com/" # Or AI Foundry endpoint
Advanced: Multiple models with modelList (2025 approach):
holmes:
additionalEnvVars:
- name: AZURE_API_KEY
valueFrom:
secretKeyRef:
name: robusta-secrets
key: azure-openai-key
modelList:
azure-gpt-4o:
api_key: "{{ env.AZURE_API_KEY }}"
model: azure/gpt-4o
api_base: https://your-resource.openai.azure.com/
api_version: "2024-12-01-preview"
temperature: 0
config:
model: "azure-gpt-4o" # References key name in modelList
Important notes:
MODEL value uses format azure/<deployment-name> (keep the azure/ prefix)AZURE_API_BASE| Symptom | Fix |
|---------|-----|
| Pod Pending on dev | Add spot toleration + nodeAffinity to cafedevspot |
| Volume node affinity conflict | Set explicit storageClass: managed-premium-zrs, delete stuck PVC |
| PVC stuck Pending | 1) Check StorageClass exists 2) Run create-storageclass.sh 3) Delete and recreate PVC |
| StorageClass not found | Run scripts/create-storageclass.sh <cluster> |
| Certificate stuck | Change to *-cloudflare issuer |
| Connection timeout | Check VPN, run scripts/diagnose.sh |
| Auth failed | az login then re-get credentials |
| ArgoCD sync error: podReplacementPolicy: field not declared in schema | See ArgoCD SSA troubleshooting below |
| Robusta pod stuck ContainerCreating | Check SecretProviderClass name matches robusta-secrets-kv, add CSI volumes |
podReplacementPolicy / status.terminating Schema ErrorError message:
ComparisonError: error calculating structured merge diff: error building typed value from live resource:
errors: .spec.podReplacementPolicy: field not declared in schema .status.terminating: field not declared in schema
Root Cause: ArgoCD issue #18778. Kubernetes 1.29+ Job resources have new fields (podReplacementPolicy, status.terminating) that ArgoCD's embedded schema doesn't recognize when using Server-Side Diff.
Important: ignoreDifferences does NOT work for this issue because the error occurs during schema validation before diff comparison.
Solution: Disable Server-Side Diff at the Application level:
apiVersion: argoproj.io/v1alpha1
kind: Application # or ApplicationSet template
metadata:
annotations:
# Workaround for ArgoCD issue #18778
argocd.argoproj.io/compare-options: ServerSideDiff=false
For ApplicationSets:
spec:
template:
metadata:
annotations:
argocd.argoproj.io/compare-options: ServerSideDiff=false
Affected resources: Any application deploying Jobs, CronJobs, or Helm charts that create Jobs (e.g., DefectDojo initializer, database migrations).
References:
# cafehyna clusters (hypera-pharma subscription)
az aks get-credentials --resource-group RS_Hypera_Cafehyna_Dev --name aks-cafehyna-dev-hlg --file ~/.kube/aks-rg-hypera-cafehyna-dev-config --overwrite-existing
az aks get-credentials --resource-group rs_hypera_cafehyna --name aks-cafehyna-default --file ~/.kube/aks-rg-hypera-cafehyna-hub-config --overwrite-existing
az aks get-credentials --resource-group rs_hypera_cafehyna_prd --name aks-cafehyna-prd --file ~/.kube/aks-rg-hypera-cafehyna-prd-config --overwrite-existing
# painelclientes (requires subscription switch)
az account set --subscription "56bb103c-1075-4536-b6fc-abf6df80b15c" # operation-dev
az aks get-credentials --resource-group rg-hypera-painelclientes-dev --name akspainelclientedev --file ~/.kube/aks-rg-hypera-painelclientes-dev-config --overwrite-existing
az account set --subscription "1e705d23-900f-471e-b18d-7e0eb94d8c7a" # operation
az aks get-credentials --resource-group rg-hypera-painelclientes-prd --name akspainelclientesprd --file ~/.kube/aks-rg-hypera-painelclientes-prd-config --overwrite-existing
KUBECONFIG=~/.kube/<config> kubectl apply -f - <<EOF
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: managed-premium-zrs
provisioner: disk.csi.azure.com
parameters:
skuName: Premium_ZRS
kind: Managed
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
EOF
# Check Azure login
az account show
# Test DNS resolution (for private clusters)
nslookup <api-server-fqdn>
# Test connectivity
nc -zv <api-server-fqdn> 443
# Check RBAC
kubectl --kubeconfig ~/.kube/<config> auth can-i --list
For API endpoints, Key Vaults, nodepool details, and extended troubleshooting:
cafehyna-hub (docs) vs aks-cafehyna-default (Azure). az aks commands using the developer name fail with ResourceNotFound — always use the Azure column from the lookup table.cafedev nodepool has CriticalAddonsOnly taint and is invisible to workloads even with the spot toleration. Pin nodeAffinity to cafedevspot explicitly; without it pods land on the wrong pool or stay Pending forever.helm template — resources.limits.cpu sneaks in via subcharts.storageClass: null triggers WaitForFirstConsumer with default class which is often non-ZRS — PVC binds to a zone-pinned disk and the pod cannot reschedule cross-zone. Symptom: volume node affinity conflict after a node drain. Always set storageClass: managed-premium-zrs explicitly.-cloudflare suffix waits on HTTP-01 challenge forever because ingress is internal. Certificate stays Pending with no obvious error in the Certificate object — check the Challenge resource.ContainerCreating with FailedMount and the only clue is in pod events. kubectl describe pod is mandatory; logs show nothing.az account set is required before az aks get-credentials for painelclientes: dev and prd live in different subscriptions. Using the wrong subscription silently writes a credential that can't reach the cluster.development
End-to-end branch delivery: commit (no AI attribution) → push → open a pull request → ensure a Board work item exists (create one per task, assigned to the configured user, if none) and link it → after merge, clean up branch and worktree. Auto-detects the platform from the remote — Azure Repos + Boards (azure-devops-node-api SDK; OAuth Bearer push fallback via `az`) or GitHub (Octokit; `gh` for auth). Scripts are TypeScript, run via `bun`. Use whenever asked to "ship", "ship it", "ship this branch", "open a PR", "push and open a PR", "raise a PR", "deliver this", "send this for review", or "create a PR and link the work item" — and when a direct push to main is blocked and the change needs to go through a PR instead.
testing
Brief description of what this skill does. Include specific triggers - when should Claude use this skill? Example triggers, file types, or keywords that indicate this skill applies.
tools
Manage and troubleshoot PATH configuration in zsh. Use when adding tools to PATH (bun, nvm, Python venv, cargo, go), diagnosing "command not found" errors, validating PATH entries, or organizing shell configuration in .zshrc and .zshrc.local files.
tools
Zabbix monitoring system automation via API and Python. Use when: (1) Managing hosts, templates, items, triggers, or host groups, (2) Automating monitoring configuration, (3) Sending data via Zabbix trapper/sender, (4) Querying historical data or events, (5) Bulk operations on Zabbix objects, (6) Maintenance window management, (7) User/permission management