Kubernetes Node Affinity and Anti-Affinity: A Practical Guide

Intermediate 10 min read

What you'll learn

✓The difference between nodeSelector and nodeAffinity
✓Required vs preferred scheduling rules
✓How podAntiAffinity spreads replicas across nodes
✓Scheduler performance considerations at scale

Prerequisites

•Comfortable creating Deployments

What and Why

By default the Kubernetes scheduler picks any node that fits a pod’s requests. That is fine until you need GPU workloads only on GPU nodes, or you need replicas of a stateful service spread across zones so a zone outage does not take you down. Node affinity pulls pods toward certain nodes. Pod anti-affinity pushes pods away from each other.

Together they let you express placement intent declaratively, instead of pinning to specific node names or relying on luck.

Mental Model

There are two strengths of rule. requiredDuringSchedulingIgnoredDuringExecution is a hard constraint: if no node matches, the pod stays Pending. preferredDuringSchedulingIgnoredDuringExecution is a soft hint with a weight; the scheduler scores nodes and picks the best match but will fall back if no preferred node is free.

“IgnoredDuringExecution” means the rule is only evaluated at scheduling time. If a node label changes later, already-running pods are not moved. There is no RequiredDuringExecution mode.

Hands-on Example

Run a workload only on GPU nodes and spread replicas across availability zones:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: inference
spec:
  replicas: 3
  selector:
    matchLabels: { app: inference }
  template:
    metadata:
      labels: { app: inference }
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: hardware
                    operator: In
                    values: ["gpu"]
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              preference:
                matchExpressions:
                  - key: gpu-model
                    operator: In
                    values: ["a100"]
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchLabels: { app: inference }
              topologyKey: topology.kubernetes.io/zone
      containers:
        - name: server
          image: example/inference:2.0
          resources:
            limits:
              nvidia.com/gpu: 1

Nodes:
node-a (zone=us-east-1a, hardware=gpu, gpu-model=a100)
node-b (zone=us-east-1b, hardware=gpu, gpu-model=t4)
node-c (zone=us-east-1c, hardware=gpu, gpu-model=a100)
node-d (zone=us-east-1a, hardware=cpu)

Scheduling 3 replicas:
pod-1 -> node-a   (a100 preferred, zone 1a)
pod-2 -> node-c   (a100 preferred, different zone)
pod-3 -> node-b   (zone 1b free, falls back from a100)
node-d skipped: hardware!=gpu (hard rule)

Scheduling outcome across three zones

Common Pitfalls

topologyKey must be a label that exists on nodes. A typo like topology.kubernetes.io/zones (plural) silently makes the anti-affinity rule a no-op because no node matches, and pods pile up on one node anyway. Always verify with kubectl get nodes --show-labels.

Strict requiredDuringScheduling anti-affinity on a Deployment with more replicas than failure domains will leave pods Pending forever. Three replicas and only two zones with hard anti-affinity is a stuck rollout. Use preferred or switch to topologySpreadConstraints which handle “best effort” spread more gracefully.

nodeSelector still works and is simpler than nodeAffinity for exact-match cases. Mixing both in one pod spec is allowed but confusing; pick one per workload.

Operators allowed are In, NotIn, Exists, DoesNotExist, Gt, Lt. Gt and Lt only work on integer values stored as strings, which surprises people the first time.

Production Tips

Prefer topologySpreadConstraints for simple spread requirements; it scales better than anti-affinity rules with many label combinations. Reserve anti-affinity for cases where you need hard guarantees (one replica per node, never two together).

Anti-affinity is expensive at scale. The scheduler must check every candidate node against every existing pod that matches the selector. In clusters with thousands of pods, this can add seconds to scheduling latency. Scope selectors tightly (one app, one namespace) to keep the work bounded.

Use taints and tolerations alongside node affinity. Affinity says “I want this node”; a taint says “this node rejects pods that do not tolerate it.” Together they prevent GPU nodes from being colonized by random pods that merely match a label.

Label nodes through your node-pool config (EKS managed node groups, GKE node pools), not by kubectl label, so the labels survive node replacement.

Wrap-up

Node affinity steers pods toward the right hardware, and anti-affinity keeps replicas apart so one failure cannot take them all. Start with soft rules and topologySpreadConstraints, escalate to hard anti-affinity only where uptime requires it, and verify your topology keys before you ship.