@shinyaz

EKS 1.34 to 1.35 Upgrade — A Best-Practices-Driven Verification

Table of Contents

Introduction

Upgrading EKS is technically simple — run eksctl upgrade cluster and you're done. But in production, "it upgraded" isn't enough. You need to prove it upgraded safely.

AWS publishes EKS upgrade best practices covering Cluster Insights checks, deprecated API scanning, PodDisruptionBudget-backed workload protection, and more. This post documents upgrading an EKS Auto Mode cluster from 1.34 to 1.35 while methodically following each of these practices.

Starting Environment

ItemValue
Cluster nameeks-sandbox
Regionap-northeast-1
Initial version1.34
ModeEKS Auto Mode
Node OSBottlerocket
Add-onsaws-guardduty-agent (v1.12.1-eksbuild.2)
eksctl create cluster \
  --name eks-sandbox \
  --region ap-northeast-1 \
  --version 1.34 \
  --enable-auto-mode

Key Changes in EKS 1.35

Before upgrading, understand what's changing in the target version.

  • cgroup v1 support removed — Kubelet now refuses to start on cgroup v1 nodes by default. Bottlerocket ships with failCgroupV1: false, so no impact here
  • containerd 1.x support ends after 1.35 — Must migrate to containerd 2.0+ before upgrading to 1.36
  • In-Place Pod Resource Updates graduated to Stable — CPU/memory changes without pod restarts
  • IPVS mode deprecated — kube-proxy IPVS mode will be removed in 1.36
  • Ingress NGINX retirement notice — Upstream retirement planned for March 2026. Start planning Gateway API migration

Pre-Upgrade Verification

Backup

AWS best practices recommend taking a cluster backup with Velero before upgrading. Velero backs up Kubernetes resources and persistent volumes, providing a rollback path if the upgrade goes wrong.

I skipped this step since this was a fresh test cluster. In production — especially with custom resources or stateful workloads — this step should not be omitted. Note that Velero doesn't back up AWS resources like IAM roles; those must be managed separately.

Cluster Insights

EKS Cluster Insights proactively detect issues that could block an upgrade.

aws eks list-insights --region ap-northeast-1 --cluster-name eks-sandbox

Result: empty — no blockers. This is expected for a fresh cluster, but production clusters can surface warnings around kubelet version skew or add-on incompatibility here. Any issues found must be resolved before proceeding.

Deprecated API Scanning

Two tools for thorough coverage: kubent and pluto.

kubent
# Target K8s version is 1.34.4-eks-f69f56f
# Retrieved 34 resources from collector — no deprecations found
 
pluto detect-all-in-cluster
# There were no resources found with known deprecated apiVersions.

Zero findings are expected on a fresh cluster, but production clusters often have deprecated APIs lurking in custom resources or Helm releases. kubent scans Helm v3 releases too, catching template-level issues beyond just applied resources.

Infrastructure Prerequisites

Three aspects of the underlying infrastructure to verify before upgrading.

Subnet available IPs — Node replacement temporarily increases node count, so IP headroom is needed. All 6 subnets had 8,000+ available IPs — plenty of room.

IAM role — Confirmed the cluster role has sts:AssumeRole and sts:TagSession for eks.amazonaws.com. A missing or misconfigured role would cause the control plane upgrade itself to fail.

Control plane logging — All log types (api, audit, authenticator, controllerManager, scheduler) were disabled. Skipped for this test environment, but production clusters should enable at least api and audit — they're essential for troubleshooting issues during upgrades.

Add-on Compatibility

aws eks list-addons --cluster-name eks-sandbox
# ["aws-guardduty-agent"]
 
aws eks describe-addon-versions --kubernetes-version 1.35 \
  --addon-name aws-guardduty-agent \
  --query 'addons[0].addonVersions[0].addonVersion'
# "v1.12.1-eksbuild.2"

Current version is already 1.35-compatible. No update needed. For clusters running core add-ons like CoreDNS or kube-proxy, this is where you'd identify compatible versions and plan the update sequence.

Workload Preparation — PDB and topologySpreadConstraints

To verify availability during the upgrade, I deployed a sample app with deliberate resilience design:

  • 3 replicas + topologySpreadConstraints — Spread pods evenly across nodes and AZs. AWS best practices recommend configuring both kubernetes.io/hostname (node spread) and topology.kubernetes.io/zone (AZ spread)
  • PodDisruptionBudget (minAvailable: 66%) — Guarantee at least 2 of 3 pods stay Running
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  namespace: upgrade-test
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:latest
        resources:
          requests:
            cpu: "100m"
            memory: "128Mi"
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: nginx
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: nginx
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: nginx-pdb
  namespace: upgrade-test
spec:
  minAvailable: "66%"
  selector:
    matchLabels:
      app: nginx

After deployment, all 3 pods landed on 3 different nodes. PDB showed ALLOWED DISRUPTIONS: 1 (only 1 of 3 pods may be disrupted simultaneously).

Upgrade Execution and Monitoring

I recorded pod and node status every 30 seconds while running the control plane upgrade.

eksctl upgrade cluster --name eks-sandbox --version 1.35 --approve

Control plane upgrade duration: ~8 minutes 20 seconds (22:09:20 → 22:17:41)

Timeline

TimeEventPod StatusNode Composition
22:09:17CP upgrade started3/3 Runningv1.34 x 4 (1 empty node)
22:10:51Empty node auto-removed3/3 Runningv1.34 x 3
22:10:51–22:17:41CP upgrading3/3 Running (no change)v1.34 x 3
22:17:41CP upgrade completed3/3 Runningv1.34 x 3
22:18:571.35 nodes launched, pod migration began3/3 Running (1 pod on new node)v1.34 x 3 + v1.35 x 2
22:19:28Pod migration in progress3/3 Running (2 pods on new nodes)v1.34 x 2 + v1.35 x 3
22:21:33Old nodes drained, all pods on new nodes3/3 Runningv1.35 x 4

Monitoring Log Details

During the CP upgrade (~9 minutes), pods were completely unaffected. Node replacement only began after the upgrade completed.

---22:09:17--- (just before CP upgrade)
nginx-67686f8c5-j6865   Running   i-0a797512240b8a78d  (v1.34)
nginx-67686f8c5-qzpqd   Running   i-0f1976563dcc3e932  (v1.34)
nginx-67686f8c5-vmqc6   Running   i-0c1a5101ad5cdfaac  (v1.34)
 
---22:18:57--- (1.35 nodes up, pod migration starting)
nginx-67686f8c5-bzv7w   Running   i-0e128843e7cdf31fa  (v1.35) ← new node
nginx-67686f8c5-qzpqd   Running   i-0f1976563dcc3e932  (v1.34)
nginx-67686f8c5-vmqc6   Running   i-0c1a5101ad5cdfaac  (v1.34)
 
---22:21:33--- (all pods migrated to 1.35 nodes)
nginx-67686f8c5-2ql5p   Running   i-01dc0d84f5885a052  (v1.35)
nginx-67686f8c5-5vd62   Running   i-092c024550f1a62d5  (v1.35)
nginx-67686f8c5-bzv7w   Running   i-0e128843e7cdf31fa  (v1.35)

Key finding: from upgrade start (22:09:17) through full pod migration to 1.35 nodes (22:21:33) — roughly 12 minutes — no pod ever entered Pending or CrashLoopBackOff. The PDB's minAvailable: 66% was respected throughout, keeping at least 2 pods Running even during node drains. The cluster settled at 4 v1.35 nodes, with Auto Mode eventually consolidating to the optimal count.

Post-Upgrade Validation

Four checks after the upgrade completed.

Cluster versionkubectl version confirmed Server Version v1.35.2-eks-f69f56f. All nodes updated to v1.35.0-eks-ac2d5a0. Note that the Client Version was still v1.34.1. AWS best practices recommend updating kubectl to a matching version after the upgrade. A one-minor-version gap is within the Kubernetes version skew policy, but updating is needed to use new API features.

All pod statuskubectl get pods -A showed every pod across all namespaces Running. No restart count increases.

Deprecated API metricskubectl get --raw /metrics | grep apiserver_requested_deprecated_apis found only endpoints v1. This is a known warning since Kubernetes 1.33 recommends migrating to discovery.k8s.io/v1 EndpointSlice — no immediate action required.

Cluster Insights — Re-checked post-upgrade. All five checks returned PASSING:

  • Kubelet version skew — node kubelet versions match the control plane
  • Amazon Linux 2 compatibility — no AL2 nodes detected
  • Cluster health issues — no health problems
  • EKS add-on version compatibility — all add-ons compatible
  • kube-proxy version skew — kube-proxy versions match the control plane

Takeaways

  • Pre-upgrade checks buy confidence, not just compliance — Everything came back clean this time, but in production that's rarely the case. kubent and pluto catching issues before upgrade day makes the go/no-go decision straightforward
  • PDB is a guarantee, not a suggestion — With PDB in place, at least 2 of 3 pods stayed Running throughout node replacement. Without it, all pods on a draining node could be evicted simultaneously
  • Auto Mode node replacement is seamless — After the control plane upgrade, new 1.35 nodes launched automatically, pods migrated, then old nodes terminated. No manual node group update commands needed
  • Preparation matters more than execution — The eksctl command itself is one line, but the verification checklist spans over a dozen items. Turn best practices into a repeatable checklist you run every time

Share this post

Shinya Tahara

Shinya Tahara

Solutions Architect @ AWS

I'm a Solutions Architect at AWS, providing technical guidance primarily to financial industry customers. I share learnings about cloud architecture and AI/ML on this blog.

Related Posts