skills/core/hpa-debug/SKILL.md
Diagnose HorizontalPodAutoscaler failures (not scaling, metrics unavailable, target mismatch). Checks HPA status, metrics-server health, and scaling events to identify why autoscaling is not working.
npx skillsauth add scitix/siclaw hpa-debugInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
When a HorizontalPodAutoscaler is not scaling as expected — stuck at min/max replicas, showing <unknown> metrics, or not responding to load — follow this flow to identify the root cause.
Scope: This skill is for diagnosis only. Once you identify the root cause, report it to the user and stop. Do NOT attempt to modify HPA settings, resource requests, or metrics configuration — that should be left to the user.
kubectl get hpa <hpa-name> -n <ns>
Note:
80%/50%). If showing <unknown>/50%, metrics are unavailable.kubectl describe hpa <hpa-name> -n <ns>
Focus on:
AbleToScale, ScalingActive, and ScalingLimited with their status and reasonkubectl get pods -n kube-system -l k8s-app=metrics-server -o wide
If no pods are found, try:
kubectl get deployment -n kube-system metrics-server
Verify metrics-server is serving data:
kubectl top pods -n <ns>
If kubectl top returns error: Metrics API not available, metrics-server is not working.
HPA with CPU/memory percentage targets requires resources.requests to be set on the target containers.
kubectl get deployment <target-name> -n <ns> -o jsonpath='{range .spec.template.spec.containers[*]}{.name}: cpu={.resources.requests.cpu} memory={.resources.requests.memory}{"\n"}{end}'
If requests are not set, the HPA cannot calculate utilization percentages.
<unknown> metrics — Metrics not availableThe HPA cannot fetch metrics for one or more targets.
Common causes:
resources.requests on the target pods (step 4)Advise the user to check metrics-server health and ensure resource requests are set.
ScalingActive: False — HPA cannot scaleThe HPA has been disabled or cannot function.
Check the reason in kubectl describe hpa:
FailedGetResourceMetric — cannot fetch resource metrics (metrics-server issue)FailedGetExternalMetric — cannot fetch external metrics (adapter issue)InvalidMetricSourceType — the metric source type is not recognizedScalingLimited: True — At min or max replicasThe HPA wants to scale but is constrained by minReplicas or maxReplicas.
Check the reason:
TooFewReplicas — HPA wants to scale down below minReplicasTooManyReplicas — HPA wants to scale up above maxReplicasDesiredWithinRange — current replicas are within bounds (normal)If the HPA is stuck at maxReplicas and load is still high, advise the user to increase maxReplicas or investigate why the application needs so many replicas (possible performance issue).
The HPA keeps oscillating between replica counts.
Check the events for rapid scale-up/scale-down cycles. Common causes:
The HPA has a default stabilization window (5 minutes for scale-down). Check if custom behavior is set:
kubectl get hpa <hpa-name> -n <ns> -o jsonpath='{.spec.behavior}'
Advise the user to adjust the stabilization window or the target utilization.
The HPA sees metrics below the target threshold, so it does not scale.
Verify the actual pod resource usage:
kubectl top pods -n <ns> -l <selector>
Compare with the HPA target. If actual usage is below the target even under load:
Only one HPA should target a given Deployment/StatefulSet. Multiple HPAs on the same target cause unpredictable behavior.
kubectl get hpa -n <ns>
Check if multiple HPAs reference the same scaleTargetRef. Advise the user to consolidate into a single HPA with multiple metrics.
--horizontal-pod-autoscaler-sync-period on the controller manager).autoscaling/v2, multiple metrics can be specified. The HPA scales to the highest recommended replica count across all metrics.replicas field is managed externally.testing
Show and ping the gateway of a network interface, on a Kubernetes node or inside a pod's network namespace. Auto-detects the gateway from the routing table (ip -j route), reports interface type (RoCE / Ethernet / IB), and tests reachability with ping. Use for default-route / gateway questions, network reachability checks, RoCE/RDMA data-path validation, and "can this node/pod reach its gateway" investigations.
development
Guide for writing and improving Siclaw skills. Read this when creating or modifying a skill. Covers skill directory layout, SKILL.md format, script execution modes, and best practices.
devops
Retrieve logs from a Kubernetes node. Supports journalctl (systemd units) and file-based logs. Use when you need to inspect node-level logs (containerd, kubelet, etc.). Run via host_script (preferred) or node_script.
development
Guides the user to the Siclaw Web page to manage Skills. Use this guide when the user requests to create, edit, or view a Skill in a Channel conversation.