If you’re coming from the MilkStraw AI recommender, you’ll plug the recommended instance types into the steps below.
Before you start
Make sure:- You can log in to the AWS Console and have IAM permissions for EKS, EC2, and IAM.
-
kubectlis installed and configured for the correct cluster: -
You know whether your workloads require:
- GPU nodes, and/or
- Spot capacity (only use Spot if you understand interruption behavior and have workloads that tolerate it).
-
If you use public subnets for nodes:
-
MapPublicIpOnLaunch=trueis set. -
Subnets have the standard EKS tags:
-
kubernetes.io/cluster/<CLUSTER_NAME> = shared -
kubernetes.io/role/elb = 1(for public subnets).
-
-
Step 1 · Create a new node group with the target instance type
You’ll first add a new node group using the new instance type(s). This lets you migrate workloads safely before deleting the old group.1.1 Open the EKS cluster and add a node group
In the AWS Console:- Go to Amazon EKS → Clusters.
- Select your cluster.
- Open the Compute tab.
- Choose Add node group.
1.2 Basics
Configure:-
Name
Use something descriptive and versioned, for example:
A good pattern is:
env-project-function-version. - Node IAM role Choose the role with standard EKS worker node permissions, or create one if needed.
-
(Optional) Launch template
Use a launch template if you need:
- Custom AMI or user data
- Specific disk types / encryption settings
- Extra EC2-level configuration
-
(Optional) Labels / Taints / Tags
-
Example label:
-
Example taint (useful for GPU-only workloads or to keep pods off until you’re ready):
-
Example label:
1.3 Compute configuration
Under Compute configuration:-
AMI type
- General-purpose: Amazon Linux 2 or AL2023. AL2023 is newer and has different defaults; validate compatibility for your workloads.
- GPU: Amazon Linux 2 GPU (or a Bottlerocket GPU variant if you already use Bottlerocket).
-
Capacity type
- On-Demand for predictable capacity.
- Spot only if your workloads tolerate interruption and you have a proper Spot strategy.
-
Instance types
-
Example general-purpose instance:
- If you’re using MilkStraw AI recommender, set this to the recommended instance type(s) from your report.
-
Example general-purpose instance:
-
Disk size
- For most applications: 50–100 GiB root volume is sufficient.
-
Increase if pods make heavy use of
emptyDirvolumes or local caching.
-
Scaling
-
Set Desired, Min, and Max.
For example:
- If you run Cluster Autoscaler, ensure these values work with your expected scale range.
-
Set Desired, Min, and Max.
For example:
1.4 Networking
Under Networking:-
Subnets
- Select the subnets where you want worker nodes to live.
-
For public subnets, confirm:
-
MapPublicIpOnLaunch=true - Proper EKS tags as mentioned above.
-
-
(Optional) SSH key pair
- Add an SSH key only if you need direct SSH access to nodes.
1.5 Create and verify node readiness
Create the node group and wait for nodes to join the cluster. Watch nodes until they areReady:
GPU clusters
If these are GPU nodes, install the NVIDIA device plugin after the nodes areReady.
In the NVIDIA k8s-device-plugin GitHub repo, locate the latest DaemonSet manifest and apply it:
Step 2 · Move workloads to the new node group
You now have both old and new node groups attached to the cluster. The goal is to:- Stop new pods from landing on the old nodes.
- Let the scheduler and Cluster Autoscaler move workloads to the new group.
- Drain and empty the old nodes.
2.1 Taint the old node group (EKS-managed taint)
First, add a taint to the old node group to prevent new pods from scheduling on it while existing pods continue running:2.2 Cordon the old nodes
Cordoning marks nodes as unschedulable at the Kubernetes level:2.3 Drain the old nodes
Drain nodes to evict pods safely and move them to the new node group:-
--ignore-daemonsetsDaemonSet pods are not evicted bydrain. They are terminated when the node is deleted. -
--delete-emptydir-dataOnly use if pods don’t rely on data insideemptyDirvolumes, as that data will be lost.
-
PodDisruptionBudgets (PDBs)
If PDBs are strict,
kubectl drainmay block until disruptions are allowed. You might temporarily increasemaxUnavailablefor smoother migrations. -
Unmanaged pods
If some pods are not controlled by a Deployment, ReplicaSet, or DaemonSet, you may need
--force. Use this carefully and only if you understand the impact.
2.4 Watch workloads reschedule
Monitor pods as they are rescheduled onto the new node group:nodepool=new), make sure workloads can land there:
-
Add a matching
nodeSelectoror node affinity to workloads that must move: - Or skip tainting the new group so the scheduler naturally prefers it once old nodes are cordoned and drained.
2.5 Validate migration
Confirm the migration before deleting anything:-
All pods are
Runningand scheduled on new nodes: -
All nodes in the old node group:
-
Are
SchedulingDisabled, and - Either have no non-DaemonSet pods left or are fully drained.
-
Are
Step 3 · Remove the old node group
Once you are confident workloads are stable on the new instance types, remove the old group. In the AWS Console:- Go to Amazon EKS → Clusters → select your cluster.
- Open the Compute tab → Node groups.
- Select the old node group.
- Click Delete.
- Type the node group name to confirm, then Delete.
- Make sure the old node group is no longer in its configuration, or
-
Temporarily set its Desired/Min capacity to
0before you delete it.
Rollback · Move workloads back to the old node group
If something goes wrong after migration (for example, performance or compatibility issues), you can quickly roll workloads back to the old node group. The high-level flow:- Scale the old group back up.
- Pause scheduling onto the new group.
- Re-enable scheduling on the old group.
- Drain the new nodes so pods move back.
Scale the old node group
In EC2 or EKS:- Set the old node group’s or Auto Scaling Group’s Desired/Min/Max back to healthy values.
- Desired: previous steady-state value.
- Min: match Desired.
- Max: your upper bound.
Pause scheduling on the new group
Add a taint to the new node group:Re-enable scheduling on the old group
Remove the taint from the old node group:Drain the new nodes
Now drain the new nodes so pods move back to the old group:Extra resources
- Install kubectl — Kubernetes documentation: Install and Set Up kubectl
- NVIDIA device plugin for Kubernetes — NVIDIA/k8s-device-plugin on GitHub
- VPC and subnet requirements for EKS — Amazon EKS VPC and subnet requirements