Skip to main content
This guide walks you through upgrading EC2 instance types for an existing Amazon EKS cluster by creating a new managed node group, migrating workloads, and then tearing down the old group.
If you’re coming from the MilkStraw AI recommender, you’ll plug the recommended instance types into the steps below.

Before you start

Make sure:
  • You can log in to the AWS Console and have IAM permissions for EKS, EC2, and IAM.
  • kubectl is installed and configured for the correct cluster:
    aws eks update-kubeconfig --region <REGION> --name <CLUSTER_NAME>
    
  • You know whether your workloads require:
    • GPU nodes, and/or
    • Spot capacity (only use Spot if you understand interruption behavior and have workloads that tolerate it).
  • If you use public subnets for nodes:
    • MapPublicIpOnLaunch=true is set.
    • Subnets have the standard EKS tags:
      • kubernetes.io/cluster/<CLUSTER_NAME> = shared
      • kubernetes.io/role/elb = 1 (for public subnets).

Step 1 · Create a new node group with the target instance type

You’ll first add a new node group using the new instance type(s). This lets you migrate workloads safely before deleting the old group.

1.1 Open the EKS cluster and add a node group

In the AWS Console:
  1. Go to Amazon EKSClusters.
  2. Select your cluster.
  3. Open the Compute tab.
  4. Choose Add node group.

1.2 Basics

Configure:
  • Name Use something descriptive and versioned, for example:
    production-core-ng-v2
    
    A good pattern is: env-project-function-version.
  • Node IAM role Choose the role with standard EKS worker node permissions, or create one if needed.
  • (Optional) Launch template Use a launch template if you need:
    • Custom AMI or user data
    • Specific disk types / encryption settings
    • Extra EC2-level configuration
  • (Optional) Labels / Taints / Tags
    • Example label:
      nodepool = new
      
    • Example taint (useful for GPU-only workloads or to keep pods off until you’re ready):
      key = no-scheduling
      effect = NoSchedule
      

1.3 Compute configuration

Under Compute configuration:
  • AMI type
    • General-purpose: Amazon Linux 2 or AL2023. AL2023 is newer and has different defaults; validate compatibility for your workloads.
    • GPU: Amazon Linux 2 GPU (or a Bottlerocket GPU variant if you already use Bottlerocket).
  • Capacity type
    • On-Demand for predictable capacity.
    • Spot only if your workloads tolerate interruption and you have a proper Spot strategy.
  • Instance types
    • Example general-purpose instance:
      m6a.xlarge
      
    • If you’re using MilkStraw AI recommender, set this to the recommended instance type(s) from your report.
  • Disk size
    • For most applications: 50–100 GiB root volume is sufficient.
    • Increase if pods make heavy use of emptyDir volumes or local caching.
  • Scaling
    • Set Desired, Min, and Max. For example:
      Desired capacity: 2
      Min size:         1
      Max size:         10
      
    • If you run Cluster Autoscaler, ensure these values work with your expected scale range.

1.4 Networking

Under Networking:
  • Subnets
    • Select the subnets where you want worker nodes to live.
    • For public subnets, confirm:
      • MapPublicIpOnLaunch=true
      • Proper EKS tags as mentioned above.
  • (Optional) SSH key pair
    • Add an SSH key only if you need direct SSH access to nodes.

1.5 Create and verify node readiness

Create the node group and wait for nodes to join the cluster. Watch nodes until they are Ready:
kubectl get nodes --watch

GPU clusters

If these are GPU nodes, install the NVIDIA device plugin after the nodes are Ready. In the NVIDIA k8s-device-plugin GitHub repo, locate the latest DaemonSet manifest and apply it:
kubectl apply -f <NVIDIA_DEVICE_PLUGIN_YAML_URL>

Step 2 · Move workloads to the new node group

You now have both old and new node groups attached to the cluster. The goal is to:
  1. Stop new pods from landing on the old nodes.
  2. Let the scheduler and Cluster Autoscaler move workloads to the new group.
  3. Drain and empty the old nodes.
The examples below assume you’re using Managed Node Groups.

2.1 Taint the old node group (EKS-managed taint)

First, add a taint to the old node group to prevent new pods from scheduling on it while existing pods continue running:
aws eks update-nodegroup-config \
  --cluster-name <CLUSTER_NAME> \
  --nodegroup-name <OLD_NODE_GROUP_NAME> \
  --taints addOrUpdateTaints='[{key=no-scheduling,effect=NO_SCHEDULE}]'

2.2 Cordon the old nodes

Cordoning marks nodes as unschedulable at the Kubernetes level:
kubectl get nodes -l eks.amazonaws.com/nodegroup=<OLD_NODE_GROUP_NAME> -o name \
  | xargs -n1 kubectl cordon
New pods will no longer be placed on these nodes.

2.3 Drain the old nodes

Drain nodes to evict pods safely and move them to the new node group:
kubectl drain <OLD_NODE_NAME> \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --grace-period=60 \
  --timeout=10m
Flags:
  • --ignore-daemonsets DaemonSet pods are not evicted by drain. They are terminated when the node is deleted.
  • --delete-emptydir-data Only use if pods don’t rely on data inside emptyDir volumes, as that data will be lost.
Additional considerations:
  • PodDisruptionBudgets (PDBs) If PDBs are strict, kubectl drain may block until disruptions are allowed. You might temporarily increase maxUnavailable for smoother migrations.
  • Unmanaged pods If some pods are not controlled by a Deployment, ReplicaSet, or DaemonSet, you may need --force. Use this carefully and only if you understand the impact.

2.4 Watch workloads reschedule

Monitor pods as they are rescheduled onto the new node group:
kubectl get pods -A -o wide --watch
If you set a label on the new node group (for example nodepool=new), make sure workloads can land there:
  • Add a matching nodeSelector or node affinity to workloads that must move:
    spec:
      template:
        spec:
          nodeSelector:
            nodepool: new
    
  • Or skip tainting the new group so the scheduler naturally prefers it once old nodes are cordoned and drained.
If you run Cluster Autoscaler, it should automatically scale the new node group to fit the evicted pods.

2.5 Validate migration

Confirm the migration before deleting anything:
  • All pods are Running and scheduled on new nodes:
    kubectl get pods -A -o wide
    
  • All nodes in the old node group:
    • Are SchedulingDisabled, and
    • Either have no non-DaemonSet pods left or are fully drained.

Step 3 · Remove the old node group

Once you are confident workloads are stable on the new instance types, remove the old group. In the AWS Console:
  1. Go to Amazon EKSClusters → select your cluster.
  2. Open the Compute tab → Node groups.
  3. Select the old node group.
  4. Click Delete.
  5. Type the node group name to confirm, then Delete.
If you are using Cluster Autoscaler:
  • Make sure the old node group is no longer in its configuration, or
  • Temporarily set its Desired/Min capacity to 0 before you delete it.

Rollback · Move workloads back to the old node group

If something goes wrong after migration (for example, performance or compatibility issues), you can quickly roll workloads back to the old node group. The high-level flow:
  1. Scale the old group back up.
  2. Pause scheduling onto the new group.
  3. Re-enable scheduling on the old group.
  4. Drain the new nodes so pods move back.
You keep the new group around until you decide next steps.

Scale the old node group

In EC2 or EKS:
  • Set the old node group’s or Auto Scaling Group’s Desired/Min/Max back to healthy values.
Example (Auto Scaling Group via console or CLI):
  • Desired: previous steady-state value.
  • Min: match Desired.
  • Max: your upper bound.

Pause scheduling on the new group

Add a taint to the new node group:
aws eks update-nodegroup-config \
  --cluster-name <CLUSTER_NAME> \
  --nodegroup-name <NEW_NODE_GROUP_NAME> \
  --taints addOrUpdateTaints='[{key=no-scheduling,effect=NO_SCHEDULE}]'
Cordon all new nodes:
kubectl get nodes -l eks.amazonaws.com/nodegroup=<NEW_NODE_GROUP_NAME> -o name \
  | xargs -n1 kubectl cordon

Re-enable scheduling on the old group

Remove the taint from the old node group:
aws eks update-nodegroup-config \
  --cluster-name <CLUSTER_NAME> \
  --nodegroup-name <OLD_NODE_GROUP_NAME> \
  --taints removeTaints='[{key=no-scheduling,effect=NO_SCHEDULE}]'
Uncordon old nodes so they accept new pods again:
kubectl get nodes -l eks.amazonaws.com/nodegroup=<OLD_NODE_GROUP_NAME> -o name \
  | xargs -n1 kubectl uncordon

Drain the new nodes

Now drain the new nodes so pods move back to the old group:
kubectl drain <NEW_NODE_NAME> \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --grace-period=60 \
  --timeout=10m
Watch pods move back:
kubectl get pods -A -o wide --watch
Once you stabilize and understand the root cause, you can try another migration with updated instance types or configuration.

Extra resources