Kubernetes Cluster Upgrades and Pod Eviction Explained

Intermediate 9 min read

What you'll learn

✓How control plane and node upgrades are sequenced
✓How node drain triggers pod eviction
✓Why PodDisruptionBudgets matter for upgrades
✓How to write a graceful shutdown handler
✓Patterns to avoid traffic loss during rollouts

Prerequisites

•Familiar with YAML and containers

What and Why

A cluster upgrade moves the control plane and node fleet to a new Kubernetes version. It is one of the highest-risk routine operations in any platform. The mechanics are simple in principle: upgrade the control plane first, then replace or upgrade nodes one at a time, draining each node before touching it. The subtlety is in the eviction process and how your workloads respond to it.

Pod eviction is how Kubernetes asks a workload to relocate. Done well, traffic shifts smoothly to other replicas. Done badly, in-flight requests fail and customers see errors. The difference is almost always configuration, not luck.

Mental Model

Picture a node drain as a polite request: “please leave this node, but at your own pace.” The kubelet sends SIGTERM to every Pod’s containers and starts a countdown of terminationGracePeriodSeconds. During that window, the Pod should finish in-flight work, close connections, and exit. If it does not, the kubelet sends SIGKILL.

Above that, a PodDisruptionBudget (PDB) governs how many Pods of a workload may be voluntarily disrupted at once. The eviction API respects PDBs; the drain blocks until the budget allows another eviction. This is the safety net that prevents an upgrade from taking down all replicas of your service at the same time.

Hands-on Example

A Deployment with a PDB, a preStop hook, and a sensible grace period.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
spec:
  replicas: 4
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      terminationGracePeriodSeconds: 45
      containers:
        - name: app
          image: registry.example.com/web:5.2.0
          ports:
            - containerPort: 8080
          lifecycle:
            preStop:
              exec:
                command:
                  - sh
                  - -c
                  - "sleep 10 && /app/drain --timeout=25s"
          readinessProbe:
            httpGet:
              path: /healthz/ready
              port: 8080
            periodSeconds: 3
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-pdb
spec:
  minAvailable: 3
  selector:
    matchLabels:
      app: web

The sleep 10 in the preStop hook gives the Service endpoints controller time to remove the Pod from rotation before the app stops accepting connections. The PDB guarantees three of four replicas stay up during voluntary disruption.


 kubectl drain node-1
        |
        v
 +----------------+     evict     +---------------+
 | eviction API   | ------------> | PDB check     |
 +----------------+               +---------------+
                                        |
                                allowed | (>=3 available)
                                        v
                                +-----------------+
                                | Pod gets SIGTERM|
                                | preStop runs    |
                                | grace period    |
                                +-----------------+
                                        |
                                        v
                                +-----------------+
                                | Pod terminates  |
                                | rescheduled     |
                                | on node-2       |
                                +-----------------+

Drain, eviction, and PDB interaction during upgrade

Common Pitfalls

The most common failure is no PDB at all. Without one, a drain can evict every replica of your service at once. Even a one-line minAvailable: 1 is dramatically better than nothing.

Another is a terminationGracePeriodSeconds that is shorter than your real shutdown work. If your service takes 30 seconds to flush in-flight requests but the grace period is 10, you will see SIGKILL mid-request every time a node drains.

A subtle one: the endpoints controller is eventually consistent. There is a window of one to two seconds where a terminating Pod still receives traffic. Without a preStop sleep, you lose those requests. The sleep does not slow shutdown meaningfully but it prevents traffic loss.

Cluster autoscaler interactions matter too. If a node leaves before the replacement is ready, scheduling can stall. Pair PDBs with maxUnavailable on the node pool’s surge upgrade settings.

Production Tips

Practice upgrades in a staging cluster on every minor version. Kubernetes releases break things in predictable ways; rehearsing finds them before customers do.

Use kubectl drain --grace-period=60 --timeout=10m --delete-emptydir-data consciously. Each flag changes behavior; do not paste from Stack Overflow without reading them.

Skew matters. The control plane must be upgraded before nodes, and version skew between kubelet and API server is supported only within a narrow window. Plan to finish a node rollout within a week of the control plane.

Watch kube_pod_status_phase and request error rates during drains. A sudden bump in 5xx during eviction is your signal that grace periods or preStop hooks need tuning.

Wrap-up

Cluster upgrades are routine when workloads cooperate with eviction. Add a PDB, a preStop sleep, and a realistic grace period to every Deployment. With those three knobs set correctly, drains become quiet events and upgrades stop being scary.