Verifying EKS Managed Node Group Warm Pools — Scale-Out Times by poolState and the reuseOnScaleIn Reality
Table of Contents
Introduction
On April 8, 2026, AWS announced warm pool support for EKS managed node groups. Warm pools maintain pre-initialized EC2 instances—with OS initialization, user data execution, and software configuration already complete—that can join the cluster without a full cold-start sequence during scale-out events.
The effectiveness of warm pools varies significantly based on poolState selection. Running offers the fastest scale-out but incurs continuous instance charges, while Stopped is cost-efficient but slower to transition. This article quantifies the "speed vs. cost" tradeoff with real measurements and provides a decision framework for choosing the right poolState based on workload characteristics.
We measured scale-out times across three patterns and present the data needed to decide whether warm pools are worth adopting and which poolState to choose. For official documentation, see Warm pools with managed node groups.
How Warm Pools Work
When you enable a warm pool, EKS creates an EC2 Auto Scaling warm pool attached to the node group's Auto Scaling group. Instances wait in the warm pool in one of three states:
| poolState | State | Maintenance Cost (t3.medium/mo) | Scale-Out Behavior |
|---|---|---|---|
| Running | Running | ~$32 (instance + EBS) | Joins cluster directly |
| Stopped | Stopped | ~$2 (EBS only) | Starts → joins cluster |
| Hibernated | Hibernated | ~$2 (EBS + RAM) | Restores RAM → joins cluster |
Cost estimates based on t3.medium (gp3 20GB) pricing in ap-northeast-1. Varies by instance type and disk size.
Prerequisites:
- AWS CLI v2.34+ configured (
eks:*,ec2:*,iam:*,autoscaling:*permissions) - kubectl v1.35+
- Test region: ap-northeast-1 (Tokyo)
Skip to Verification 1 if you only want the results.
Environment Setup
EKS cluster, node role, and node group creation steps
We used an existing EKS cluster (v1.35). For Auto Mode clusters, managed node groups require vpc-cni, kube-proxy, and coredns addons.
REGION=ap-northeast-1
CLUSTER=eks-sandbox
aws eks create-addon --cluster-name $CLUSTER --addon-name vpc-cni --region $REGION
aws eks create-addon --cluster-name $CLUSTER --addon-name kube-proxy --region $REGION
aws eks create-addon --cluster-name $CLUSTER --addon-name coredns --region $REGIONCreate a node role with the required policies.
cat > node-trust-policy.json << 'EOF'
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": { "Service": "ec2.amazonaws.com" },
"Action": "sts:AssumeRole"
}]
}
EOF
aws iam create-role \
--role-name eks-warmpool-verify-node-role \
--assume-role-policy-document file://node-trust-policy.json
aws iam attach-role-policy --role-name eks-warmpool-verify-node-role \
--policy-arn arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
aws iam attach-role-policy --role-name eks-warmpool-verify-node-role \
--policy-arn arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
aws iam attach-role-policy --role-name eks-warmpool-verify-node-role \
--policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnlyCreate three node groups: cold start (no warm pool), Stopped warm pool, and Running warm pool.
# Get account ID
ACCOUNT_ID=$(aws sts get-caller-identity --query 'Account' --output text)
NODE_ROLE="arn:aws:iam::${ACCOUNT_ID}:role/eks-warmpool-verify-node-role"
# Get cluster's private subnets
SUBNETS=$(aws eks describe-cluster --name $CLUSTER --region $REGION \
--query 'cluster.resourcesVpcConfig.subnetIds' --output text)
# 1. Cold start (no warm pool)
aws eks create-nodegroup \
--cluster-name $CLUSTER --nodegroup-name ng-cold-start \
--node-role $NODE_ROLE --subnets $SUBNETS \
--instance-types t3.medium \
--scaling-config minSize=2,maxSize=5,desiredSize=2 \
--region $REGION
# 2. Stopped warm pool
aws eks create-nodegroup \
--cluster-name $CLUSTER --nodegroup-name ng-warm-stopped \
--node-role $NODE_ROLE --subnets $SUBNETS \
--instance-types t3.medium \
--scaling-config minSize=2,maxSize=5,desiredSize=2 \
--warm-pool-config enabled=true,maxGroupPreparedCapacity=4,minSize=1,poolState=Stopped,reuseOnScaleIn=true \
--region $REGION
# 3. Running warm pool
aws eks create-nodegroup \
--cluster-name $CLUSTER --nodegroup-name ng-warm-running \
--node-role $NODE_ROLE --subnets $SUBNETS \
--instance-types t3.medium \
--scaling-config minSize=2,maxSize=5,desiredSize=2 \
--warm-pool-config enabled=true,maxGroupPreparedCapacity=4,minSize=1,poolState=Running,reuseOnScaleIn=false \
--region $REGIONWarm pool configuration parameters:
enabled=true— Enable warm poolmaxGroupPreparedCapacity=4— Maximum total instances across warm pool and ASG. With desiredSize=2, the warm pool gets 4-2=2 instancesminSize=1— Minimum instances in the warm poolpoolState=Stopped|Running— Instance state in the warm poolreuseOnScaleIn=true— Return instances to warm pool on scale-in
After node groups become ACTIVE, wait for warm pool instances to reach the target state (Stopped/Running).
ASG=$(aws eks describe-nodegroup --cluster-name $CLUSTER --nodegroup-name ng-warm-stopped \
--region $REGION --query 'nodegroup.resources.autoScalingGroups[0].name' --output text)
aws autoscaling describe-warm-pool --auto-scaling-group-name $ASG --region $REGION \
--query 'Instances[].{InstanceId:InstanceId,State:LifecycleState}' --output table-------------------------------------------
| DescribeWarmPool |
+----------------------+------------------+
| InstanceId | State |
+----------------------+------------------+
| i-0a546e64a73986140 | Warmed:Stopped |
| i-0c0c0f4ad0652338a | Warmed:Stopped |
+----------------------+------------------+Verification 1: Scale-Out Time by poolState
We measured scale-out time across three patterns. For each, we changed desiredSize from 2 to 3 and recorded the time until the new node showed Ready in kubectl get nodes.
We used the following polling script for measurement. Replace NODEGROUP with each node group name.
Measurement polling script
NODEGROUP="ng-cold-start" # Replace with ng-warm-stopped / ng-warm-running
START_TIME=$(date +%s)
while true; do
READY=$(kubectl get nodes --selector="eks.amazonaws.com/nodegroup=$NODEGROUP" \
--no-headers 2>/dev/null | grep " Ready " | wc -l)
TOTAL=$(kubectl get nodes --selector="eks.amazonaws.com/nodegroup=$NODEGROUP" \
--no-headers 2>/dev/null | wc -l)
ELAPSED=$(( $(date +%s) - START_TIME ))
echo "$(date +%T) [${ELAPSED}s] Nodes: $TOTAL total, $READY ready"
[ "$READY" -ge 3 ] && echo "=== New node Ready! ===" && break
sleep 5
done(a) Cold Start (No Warm Pool)
aws eks update-nodegroup-config \
--cluster-name eks-sandbox --nodegroup-name ng-cold-start \
--scaling-config minSize=2,maxSize=5,desiredSize=3 \
--region ap-northeast-117:30:27 [3s] Nodes: 2 total, 2 ready
17:30:54 [30s] Nodes: 2 total, 2 ready
17:31:06 [42s] Nodes: 3 total, 2 ready ← node appeared
17:31:26 [62s] Nodes: 3 total, 3 ready ← ReadyResult: 62 seconds. Looking at the log breakdown, the node appeared at 42s (EC2 instance launch complete + kubelet startup began), then took 20 more seconds to reach Ready (CNI configuration, node registration complete).
(b) Stopped Warm Pool
aws eks update-nodegroup-config \
--cluster-name eks-sandbox --nodegroup-name ng-warm-stopped \
--scaling-config minSize=2,maxSize=5,desiredSize=3 \
--region ap-northeast-117:31:39 [2s] Nodes: 2 total, 2 ready
17:32:05 [28s] Nodes: 3 total, 2 ready ← node appeared
17:32:25 [48s] Nodes: 3 total, 3 ready ← ReadyResult: 48 seconds. 14 seconds (23%) faster than cold start. Since warm pool instances are stopped with OS initialization already complete, the full boot sequence is skipped, which likely accounts for the reduction.
(c) Running Warm Pool
aws eks update-nodegroup-config \
--cluster-name eks-sandbox --nodegroup-name ng-warm-running \
--scaling-config minSize=2,maxSize=5,desiredSize=3 \
--region ap-northeast-117:32:38 [2s] Nodes: 2 total, 2 ready
17:32:51 [15s] Nodes: 3 total, 2 ready ← node appeared
17:33:04 [28s] Nodes: 3 total, 3 ready ← ReadyResult: 28 seconds. 34 seconds (55%) faster than cold start. Since the Running instance is already up, EC2 launch and OS initialization steps can be skipped. The remaining 28 seconds is likely spent primarily on EKS bootstrap (cluster join + CNI setup).
Verification 2: reuseOnScaleIn Behavior and Re-Scale-Out Speed
Verification 1 established the Stopped warm pool baseline (48s). Verification 2 tests the scale-in → re-scale-out cycle with reuseOnScaleIn=true to see if reused instances are faster.
After Verification 1, ng-warm-stopped has desiredSize=3. We scale in from 3 to 2.
Scale-In: Instance Return to Warm Pool
We changed desiredSize from 3 to 2 to trigger scale-in.
aws eks update-nodegroup-config \
--cluster-name eks-sandbox --nodegroup-name ng-warm-stopped \
--scaling-config minSize=2,maxSize=5,desiredSize=2 \
--region ap-northeast-1The scale-in process progressed through these stages. You can observe the warm pool state transitions with the following command:
# Get ASG name (skip if already obtained during setup)
ASG=$(aws eks describe-nodegroup --cluster-name eks-sandbox --nodegroup-name ng-warm-stopped \
--region ap-northeast-1 --query 'nodegroup.resources.autoScalingGroups[0].name' --output text)
# Periodically check node and warm pool status
while true; do
echo "--- $(date +%T) ---"
kubectl get nodes --selector='eks.amazonaws.com/nodegroup=ng-warm-stopped' --no-headers
aws autoscaling describe-warm-pool --auto-scaling-group-name $ASG --region ap-northeast-1 \
--query 'Instances[].{Id:InstanceId,State:LifecycleState}' --output table
sleep 15
done17:33:43 Node: Ready,SchedulingDisabled ← drain started
17:34:35 Node: NotReady,SchedulingDisabled ← kubelet stopped
17:49:09 Warm pool: Warmed:Pending:Proceed ← lifecycle hook completed
17:49:40 Warm pool: Warmed:Stopped ← return completeScale-in to return took approximately 16 minutes. EKS automatically configures a Terminate-LC-Hook on the ASG with a heartbeat timeout of 1800 seconds (30 minutes). Although the actual node drain completed in about 1 minute, the lifecycle hook processing caused additional wait time. The documentation states to always configure warm pools through the EKS API and not modify warm pool settings directly via the EC2 Auto Scaling API, so adjusting this timeout value is not recommended.
{
"LifecycleHooks": [
{
"LifecycleHookName": "Launch-LC-Hook",
"LifecycleTransition": "autoscaling:EC2_INSTANCE_LAUNCHING",
"HeartbeatTimeout": 1800
},
{
"LifecycleHookName": "Terminate-LC-Hook",
"LifecycleTransition": "autoscaling:EC2_INSTANCE_TERMINATING",
"HeartbeatTimeout": 1800
}
]
}Re-Scale-Out: Reused Instance Speed
We scaled back out using the instance that was returned to the warm pool.
aws eks update-nodegroup-config \
--cluster-name eks-sandbox --nodegroup-name ng-warm-stopped \
--scaling-config minSize=2,maxSize=5,desiredSize=3 \
--region ap-northeast-117:49:55 [2s] Nodes: 2 total, 2 ready
17:50:22 [29s] Nodes: 2 total, 2 ready
17:50:28 [35s] Nodes: 3 total, 3 ready ← ReadyResult: 35 seconds. 13 seconds faster than the initial Stopped warm pool (48s). On the first launch, kubelet must perform TLS bootstrapping (certificate issuance) and register a new node object with the cluster. Reused instances have these cached, which likely accounts for the faster cluster join.
Verification 3: Scale-Out with Pending Pods — Time to Pod Running
Verifications 1-2 used manual desiredSize changes. Verification 3 simulates a production scenario with pending pods to measure the full cycle from scale-out to pod Running.
After Verification 2's re-scale-out, ng-warm-stopped has desiredSize=3. First, scale back to 2.
aws eks update-nodegroup-config \
--cluster-name eks-sandbox --nodegroup-name ng-warm-stopped \
--scaling-config minSize=2,maxSize=5,desiredSize=2 \
--region ap-northeast-1
# Wait for warm pool instances to return to Warmed:Stopped (~16 min)Start the verification once desiredSize is back to 2 with 2 Stopped instances in the warm pool.
We filled the existing 2 nodes with resource-heavy pods and created a third pod that went Pending.
Pod manifest (filler-pod × 2 + pending-pod × 1)
kubectl apply -f - << 'EOF'
apiVersion: v1
kind: Pod
metadata:
name: filler-pod-1
spec:
nodeSelector:
eks.amazonaws.com/nodegroup: ng-warm-stopped
containers:
- name: nginx
image: nginx:alpine
resources:
requests:
cpu: "1500m"
memory: "2Gi"
---
apiVersion: v1
kind: Pod
metadata:
name: filler-pod-2
spec:
nodeSelector:
eks.amazonaws.com/nodegroup: ng-warm-stopped
containers:
- name: nginx
image: nginx:alpine
resources:
requests:
cpu: "1500m"
memory: "2Gi"
---
apiVersion: v1
kind: Pod
metadata:
name: pending-pod
spec:
nodeSelector:
eks.amazonaws.com/nodegroup: ng-warm-stopped
containers:
- name: nginx
image: nginx:alpine
resources:
requests:
cpu: "1500m"
memory: "2Gi"
EOFSince t3.medium has 2 vCPU / 4 GiB with approximately 1930m allocatable CPU, and each node runs system pods, a pod requesting 1500m CPU can only fit one per node. As a result, only filler-pod-1 reached Running, while filler-pod-2 and pending-pod went Pending.
NAME READY STATUS AGE
filler-pod-1 1/1 Running 6s
filler-pod-2 0/1 Pending 6s
pending-pod 0/1 Pending 6sWith 2 pods Pending, we needed 2 additional nodes. We scaled out from the Stopped warm pool by changing desiredSize from 2 to 4.
aws eks update-nodegroup-config \
--cluster-name eks-sandbox --nodegroup-name ng-warm-stopped \
--scaling-config minSize=2,maxSize=5,desiredSize=4 \
--region ap-northeast-117:53:29 [2s] Running: 1/3
17:53:55 [28s] filler-pod-2: ContainerCreating ← node Ready, pod scheduled
17:54:21 [54s] Running: 2/3
17:54:34 [67s] Running: 3/3 ← all pods RunningResult: 67 seconds (all pods Running). Breakdown: ~48s for node Ready + ~19s for pod scheduling and image pull. With Cluster Autoscaler, additional overhead for pending pod detection and ASG desired size update would apply (not measured in this article).
Overall Comparison: poolState × Scaling Method
| Pattern | Time to Node Ready | Maintenance Cost (1 instance/mo) | Notes |
|---|---|---|---|
| Cold start (no warm pool) | 62s | $0 | Baseline |
| Stopped warm pool (initial) | 48s (-23%) | ~$2 (EBS only) | OS restart only |
| Running warm pool (initial) | 28s (-55%) | ~$32 (instance + EBS) | Bootstrap only |
| reuseOnScaleIn reuse (Stopped) | 35s (-44%) | ~$2 (EBS only) | Faster reuse (likely bootstrap cache) |
| Stopped + Pod Running | 67s | ~$2 (EBS only) | Includes image pull |
Decision Framework for poolState Selection
Based on the measurements:
- Choose Running when scale-out speed is the top priority and you need nodes within 30 seconds. Keep warm pool size minimal since instance charges continue. Best for workloads with frequent bursts where per-event latency cost is high
- Choose Stopped when cost efficiency matters and ~50 seconds of scale-out time is acceptable. At roughly 1/16 the cost of Running (EBS charges only), this suits low-to-medium burst frequency workloads
- Enable reuseOnScaleIn when scale-in/out cycles are frequent. Reused instances are 13 seconds faster than initial Stopped. However, the ~16 minute drain wait for scale-in to warm pool return means this is less effective for rapid consecutive bursts
When NOT to Use Warm Pools
- For instance types with short boot times like t3.medium, the gap between cold start and warm pool is only 14-34 seconds. The benefit increases with heavy user data initialization (large package installs, data downloads)
- Warm pools don't support custom AMIs (EKS optimized AMIs only). If you use custom AMIs, warm pools aren't available. Initialization must be done via user data (launch template)
- Bottlerocket AMIs don't support Hibernated state or reuseOnScaleIn
Summary
- Running is fast but expensive — 28 seconds to add a node, but instance charges continue. Reserve for workloads with strict latency requirements
- Stopped offers the best balance — 48 seconds with EBS-only charges. Sufficient for most workloads
- reuseOnScaleIn has a 16-minute drain bottleneck — Re-scale-out is 35 seconds (fast), but returning to the warm pool took ~16 minutes. EKS's auto-configured lifecycle hook (1800s timeout) is involved, and the documentation does not recommend modifying this value directly
- Pod Running time = Node Ready + image pull — Even after a node is Ready, pods take an additional ~19 seconds to reach Running. Consider image pre-pulling and image size optimization alongside warm pool configuration
Cleanup steps
REGION=ap-northeast-1
CLUSTER=eks-sandbox
# Delete pods
kubectl delete pod filler-pod-1 filler-pod-2 pending-pod
# Delete node groups
for NG in ng-cold-start ng-warm-stopped ng-warm-running; do
aws eks delete-nodegroup --cluster-name $CLUSTER --nodegroup-name $NG --region $REGION
done
# Wait for deletion
for NG in ng-cold-start ng-warm-stopped ng-warm-running; do
aws eks wait nodegroup-deleted --cluster-name $CLUSTER --nodegroup-name $NG --region $REGION
done
# Delete addons (if added for Auto Mode cluster)
aws eks delete-addon --cluster-name $CLUSTER --addon-name vpc-cni --region $REGION
aws eks delete-addon --cluster-name $CLUSTER --addon-name kube-proxy --region $REGION
aws eks delete-addon --cluster-name $CLUSTER --addon-name coredns --region $REGION
# Delete IAM role
aws iam detach-role-policy --role-name eks-warmpool-verify-node-role \
--policy-arn arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
aws iam detach-role-policy --role-name eks-warmpool-verify-node-role \
--policy-arn arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
aws iam detach-role-policy --role-name eks-warmpool-verify-node-role \
--policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
aws iam delete-role-policy --role-name eks-warmpool-verify-node-role \
--policy-name ClusterAutoscalerPolicy 2>/dev/null # Only if added
aws iam delete-role --role-name eks-warmpool-verify-node-role