AWS EKS cluster updates: Beware hidden component incompatibility
If you created your AWS EKS cluster a long time ago, you will no doubt be familiar with the “Update Now” button on your cluster’s configuration page in the AWS Console. No doubt you will have clicked on it a few times, too. But there’s more to an update than clicking this button, and not being prepared can result in your cluster breaking. Find out about self-managed add-ons in this new installment of “learn from my mistakes so you don’t make them too!”
Updating AWS EKS, add-ons, and “self-managed add-ons”
AWS EKS, and Kubernetes itself, has many components, some of which appear as workloads inside Kubernetes. Some of these are “core” - part of the Kubernetes control plane - and are updated when you press that “update now” button. Others are not part of the control plane, instead deployed as workloads inside the Kubernetes cluster, but are still essential for Kubernetes to operate. These are services such as CoreDNS (which provides DNS name and service resolution inside the cluster) and kube-proxy (which performs an important role in exposing services to the outside world).
Around the time of Kubernetes 1.19, AWS introduced a managed “add-ons” service, which is available through the AWS Management Console. Here, you can clearly see the managed add-ons, and they each have another “update now” button that you can click to bring them up-to-date. Over time, more of these add-ons became “managed” and would appear in the web console.
But if your cluster is old enough to pre-date the addition of managed add-ons, you are left with “self-managed add-ons”. It is your responsibility to keep these up-to-date, without the help of the AWS console. Are you aware of your “self-managed add-ons” and are you managing them? It’s very easy to miss this fact!
For example, if you created your 1.18 or later cluster using the AWS Management Console after May 3, 2021, the vpc-cni Amazon EKS add-on is already on your cluster. But if it’s from earlier than that, you won’t have that add-on. vpc-cni will still be running in your cluster, however, as it’s a required component - but you can’t see it in the console, and you are responsible for keeping it up-to-date.
This spells trouble
The big “update now” button will upgrade your Kubernetes control plane version, but not the essential services deployed inside the cluster. You can then go to the “Add-ons” tab to see your AWS-managed add-ons, and click the “update now” button for each of those. But you will not see here any indication that your “self-managed add-ons” are now out of date.
If you weren’t aware of your “self managed add-ons”, it’s possible it has not had any version updates since the cluster was created. Every time you hit that “Update now” button, your Kubernetes version will move up, but your self-managed add-ons will not.
With every Kubernetes upgrade, API lifecycles will move on - active APIs may become deprecated, deprecated APIs will be removed. If your self-managed add-ons are using deprecated APIs, an upgrade may cause them to suddenly stop working.
In one particular case I was looking at, the AWS EBS driver was self-managed, and it hit this problem. The failure was not immediately obvious. Most pods didn’t need EBS volumes so were unaffected. Pods with EKS volumes that were running before the upgrade continued to work. But newly-created pods that wanted an EKS volume hit a failure. The initial smoke-testing after the upgrade did not show a problem, so we were caught out.
Switching to managed add-ons
This is obviously a user-hostile pattern, making it appear that you are keeping everything up-to-date but actually ignoring important components. Although this is not perfect - there’s now potentially a lot of “update now” buttons that have to be pressed when stepping up a Kubernetes version - it’s definitely a big improvement that this is now visible. But, unless you actively move your self-managed add-ons to be AWS-managed, they will remain invisible to the console.
The lesson I learned was to check the list of available add-ons. I’d check the list of available add-ons against pods that were already running in the kube-system namespace, to see if any components we were already running could be replaced by new add-ons. For example, if there are ‘aws-vpc-cni’ pods running, but the vpc-cni add-on was showing as “available” but not “installed”, then this is indicating a self-managed add-on that can be upgraded to a managed add-on.
If you identify any of these, check the AWS documentation for its installation instructions, but in all cases so far switching to the managed add-on was as simple as pressing its “install” button - it would update the self-managed add-on and take over management of it.
Once the add-ons are AWS-managed, they are visible in the AWS console for EKS, and so it’s much easier to see what version they are at and if upgrades are advised.