.claude/skills/karpenter-autoscaling/SKILL.md
Karpenter for intelligent Kubernetes node autoscaling on EKS. Use when configuring node provisioning, optimizing costs with Spot instances, replacing Cluster Autoscaler, implementing consolidation, or achieving 20-70% cost savings.
npx skillsauth add adaptationio/skrillz karpenter-autoscalingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Intelligent, high-performance node autoscaling for Amazon EKS that provisions nodes in seconds, automatically selects optimal instance types, and reduces costs by 20-70% through Spot integration and consolidation.
Karpenter is the recommended autoscaler for production EKS workloads (2025), replacing Cluster Autoscaler with:
Real-World Results:
# Add Karpenter Helm repo
helm repo add karpenter https://charts.karpenter.sh
helm repo update
# Install Karpenter v1.0+
helm upgrade --install karpenter karpenter/karpenter \
--namespace kube-system \
--set settings.clusterName=my-cluster \
--set settings.interruptionQueue=my-cluster \
--set controller.resources.requests.cpu=1 \
--set controller.resources.requests.memory=1Gi \
--set controller.resources.limits.cpu=1 \
--set controller.resources.limits.memory=1Gi \
--wait
See: references/installation.md for complete setup including IRSA/Pod Identity
NodePool (defines scheduling requirements and limits):
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"] # Compute, general, memory-optimized
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["4"] # Gen 5+
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
limits:
cpu: "1000"
memory: "1000Gi"
disruption:
consolidationPolicy: WhenUnderutilized
consolidateAfter: 30s
budgets:
- nodes: "10%"
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
amiFamily: AL2023 # Amazon Linux 2023
role: KarpenterNodeRole-my-cluster
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 100Gi
volumeType: gp3
encrypted: true
deleteOnTermination: true
kubectl apply -f nodepool.yaml
See: references/nodepools.md for advanced NodePool patterns
# Deploy test workload
kubectl create deployment inflate --image=public.ecr.aws/eks-distro/kubernetes/pause:3.7 \
--replicas=0
# Scale up to trigger node provisioning
kubectl scale deployment inflate --replicas=10
# Watch Karpenter provision nodes (seconds!)
kubectl logs -f -n kube-system -l app.kubernetes.io/name=karpenter -c controller
# Verify nodes
kubectl get nodes -l karpenter.sh/nodepool=default
# Scale down to trigger consolidation
kubectl scale deployment inflate --replicas=0
# Watch Karpenter consolidate (30s after scale-down)
kubectl logs -f -n kube-system -l app.kubernetes.io/name=karpenter -c controller
# Check NodePool status
kubectl get nodepools
# View disruption metrics
kubectl describe nodepool default
# Monitor provisioning decisions
kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter | grep -i "launched\|terminated"
# Cost optimization metrics
kubectl top nodes
See: references/optimization.md for cost optimization strategies
Key Resources (v1.0+):
How It Works:
Consolidation:
| Feature | Karpenter NodePool | Cluster Autoscaler | |---------|-------------------|-------------------| | Provisioning Speed | 30-60 seconds | 2-5 minutes | | Instance Selection | Automatic (600+ types) | Manual (pre-defined) | | Bin-Packing | Intelligent | Limited | | Spot Integration | Built-in, intelligent | Requires node groups | | Consolidation | Automatic | Manual | | Configuration | Single NodePool | Multiple node groups | | Cost Savings | 20-70% | 10-20% |
Use case: Production-grade installation with infrastructure as code
# Karpenter module
module "karpenter" {
source = "terraform-aws-modules/eks/aws//modules/karpenter"
version = "~> 20.0"
cluster_name = module.eks.cluster_name
irsa_oidc_provider_arn = module.eks.oidc_provider_arn
# Enable Pod Identity (2025 recommended)
enable_pod_identity = true
# Additional IAM policies
node_iam_role_additional_policies = {
AmazonSSMManagedInstanceCore = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
}
tags = {
Environment = "production"
}
}
# Helm release
resource "helm_release" "karpenter" {
namespace = "kube-system"
name = "karpenter"
repository = "oci://public.ecr.aws/karpenter"
chart = "karpenter"
version = "1.0.0"
set {
name = "settings.clusterName"
value = module.eks.cluster_name
}
set {
name = "settings.interruptionQueue"
value = module.karpenter.queue_name
}
set {
name = "serviceAccount.annotations.eks\\.amazonaws\\.com/role-arn"
value = module.karpenter.iam_role_arn
}
}
Steps:
references/installation.mdterraform applykubectl get pods -n kube-system -l app.kubernetes.io/name=karpenterSee: references/installation.md for complete Terraform setup
Use case: Optimize costs while maintaining availability (recommended: 30% On-Demand, 70% Spot)
Critical NodePool (On-Demand only):
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: critical
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["m", "c"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
taints:
- key: "critical"
value: "true"
effect: "NoSchedule"
limits:
cpu: "200"
weight: 100 # Higher priority
Flexible NodePool (Spot preferred):
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: flexible
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["4"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
limits:
cpu: "800"
disruption:
consolidationPolicy: WhenUnderutilized
budgets:
- nodes: "20%"
weight: 10 # Lower priority (use after critical)
Pod tolerations for critical workloads:
spec:
tolerations:
- key: "critical"
operator: "Equal"
value: "true"
effect: "NoSchedule"
nodeSelector:
karpenter.sh/capacity-type: on-demand
Steps:
kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter | grep -i interruptSee: references/nodepools.md for Spot strategies
Use case: Reduce costs by automatically consolidating underutilized nodes
Aggressive consolidation (development/staging):
spec:
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s # Consolidate quickly
budgets:
- nodes: "50%" # Allow disrupting 50% of nodes
Conservative consolidation (production):
spec:
disruption:
consolidationPolicy: WhenUnderutilized
consolidateAfter: 5m # Wait 5 minutes before consolidating
budgets:
- nodes: "10%" # Limit disruption to 10% of nodes at a time
- schedule: "0 9-17 * * MON-FRI" # Only during business hours
nodes: "20%"
- schedule: "0 0-8,18-23 * * *" # Off-hours
nodes: "5%"
Pod Disruption Budget (protect critical pods):
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: critical-app-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: critical-app
Steps:
references/optimization.mdkubectl logs -n kube-system -l app.kubernetes.io/name=karpenter | grep consolidatExpected savings: 15-30% additional reduction beyond Spot savings
See: references/optimization.md for consolidation best practices
Use case: Upgrade from Cluster Autoscaler to Karpenter for better performance and cost savings
Migration strategy (zero-downtime):
Install Karpenter (runs alongside Cluster Autoscaler)
helm install karpenter karpenter/karpenter --namespace kube-system
Create NodePool with distinct labels
spec:
template:
metadata:
labels:
provisioner: karpenter
Migrate workloads gradually
# Add node selector to new deployments
spec:
nodeSelector:
provisioner: karpenter
Monitor both autoscalers
# Watch Karpenter
kubectl logs -f -n kube-system -l app.kubernetes.io/name=karpenter
# Watch Cluster Autoscaler
kubectl logs -f -n kube-system -l app=cluster-autoscaler
Gradually scale down CA node groups
# Reduce desired size of CA node groups
aws eks update-nodegroup-config \
--cluster-name my-cluster \
--nodegroup-name ca-nodes \
--scaling-config desiredSize=1,minSize=0,maxSize=3
Remove Cluster Autoscaler tags
# Remove tags from node groups
# k8s.io/cluster-autoscaler/enabled
# k8s.io/cluster-autoscaler/<cluster-name>
Uninstall Cluster Autoscaler
helm uninstall cluster-autoscaler -n kube-system
Testing checklist:
Rollback plan: Keep CA node groups at min size until confident in Karpenter
Use case: Automatically provision GPU instances for ML workloads
GPU NodePool:
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"] # GPU typically on-demand
- key: karpenter.k8s.aws/instance-family
operator: In
values: ["g4dn", "g5", "p3", "p4d"]
- key: karpenter.k8s.aws/instance-gpu-count
operator: Gt
values: ["0"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: gpu
taints:
- key: "nvidia.com/gpu"
value: "true"
effect: "NoSchedule"
limits:
cpu: "1000"
nvidia.com/gpu: "8"
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: gpu
spec:
amiFamily: AL2 # AL2 with GPU drivers
amiSelectorTerms:
- alias: al2@latest # Latest GPU-enabled AMI
role: KarpenterNodeRole-my-cluster
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
userData: |
#!/bin/bash
# Install NVIDIA device plugin
/etc/eks/bootstrap.sh my-cluster
GPU workload:
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
containers:
- name: cuda-container
image: nvidia/cuda:11.8.0-base-ubuntu22.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1
See: references/nodepools.md for GPU configuration details
Prevent runaway scaling:
spec:
limits:
cpu: "1000" # Max 1000 CPUs across all nodes in pool
memory: "1000Gi" # Max 1000Gi memory
nvidia.com/gpu: "8" # Max 8 GPUs
Balance cost savings with stability:
spec:
disruption:
# When to consolidate
consolidationPolicy: WhenUnderutilized | WhenEmpty | WhenEmptyOrUnderutilized
# Delay before consolidating (prevent flapping)
consolidateAfter: 30s # Default: 30s
# Node expiration (security patching)
expireAfter: 720h # 30 days
# Disruption budgets (rate limiting)
budgets:
- nodes: "10%" # Max 10% of nodes disrupted at once
reasons:
- Underutilized
- Empty
- schedule: "0 0-8 * * *" # Off-hours: more aggressive
nodes: "50%"
Maximize Spot availability and cost savings:
spec:
template:
spec:
requirements:
# Architecture
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"] # Include ARM for savings
# Instance categories (c=compute, m=general, r=memory)
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
# Instance generation (5+ for best performance/cost)
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["4"]
# Instance size (exclude large sizes if not needed)
- key: karpenter.k8s.aws/instance-size
operator: NotIn
values: ["metal", "32xlarge", "24xlarge"]
# Capacity type
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
Result: Karpenter selects from 600+ instance types, maximizing Spot availability
# NodePool status
kubectl get nodepools
# NodeClaim status (pending provisions)
kubectl get nodeclaims
# Node events
kubectl get events --field-selector involvedObject.kind=Node
# Karpenter controller logs
kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter -c controller --tail=100
# Filter for provisioning decisions
kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter | grep "launched instance"
# Filter for consolidation events
kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter | grep "consolidating"
# Spot interruption warnings
kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter | grep "interrupt"
1. Nodes not provisioning:
# Check NodePool status
kubectl describe nodepool default
# Check for unschedulable pods
kubectl get pods -A --field-selector=status.phase=Pending
# Review Karpenter logs for errors
kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter | grep -i error
Common causes:
2. Excessive consolidation (pod restarts):
# Increase consolidateAfter delay
spec:
disruption:
consolidateAfter: 5m # Increase from 30s
3. Spot interruptions causing issues:
# Reduce Spot ratio
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"] # Use more on-demand
Detailed Guides (load on-demand):
references/installation.md - Complete installation with Helm, Terraform, IRSA, Pod Identityreferences/nodepools.md - NodePool and EC2NodeClass configuration patternsreferences/optimization.md - Cost optimization, consolidation, disruption budgetsOfficial Resources:
Community Examples:
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter \
--version 1.0.0 \
--namespace kube-system \
--set settings.clusterName=my-cluster
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
nodeClassRef:
kind: EC2NodeClass
name: default
limits:
cpu: "1000"
disruption:
consolidationPolicy: WhenUnderutilized
kubectl logs -f -n kube-system -l app.kubernetes.io/name=karpenter
Next Steps: Install Karpenter using references/installation.md, then configure NodePools with references/nodepools.md
development
Setup secure web-based terminal access to WSL2 from mobile/tablet via ttyd + ngrok/Cloudflare/Tailscale. One-command install, start, stop, status. Use when you need remote terminal access, web terminal, browser-based shell, or mobile access to WSL2 environment.
development
Complete development workflows where Claude writes the code while Gemini and Codex provide research, planning, reviews, and different perspectives. Claude remains the main developer. Use for complex projects requiring expert planning and multi-perspective reviews.
development
Systematic progress tracking for skill development. Manages task states (pending/in_progress/completed), updates in real-time, reports progress, identifies blockers, and maintains momentum. Use when tracking skill development, coordinating work, or reporting progress.
testing
Comprehensive testing workflow orchestrating functional testing, example validation, integration testing, and usability assessment. Sequential workflow for complete skill testing from examples through scenarios to integration validation. Use when conducting thorough testing, pre-deployment validation, ensuring skill functionality, or comprehensive quality checks.