Kubernetes Node Affinity and Anti-Affinity: A Practical Guide
Understand how nodeAffinity and podAntiAffinity steer the scheduler, with real YAML, hard vs soft rules, and the performance traps to avoid.
What you'll learn
- ✓The difference between nodeSelector and nodeAffinity
- ✓Required vs preferred scheduling rules
- ✓How podAntiAffinity spreads replicas across nodes
- ✓Scheduler performance considerations at scale
Prerequisites
- •Comfortable creating Deployments
What and Why
By default the Kubernetes scheduler picks any node that fits a pod’s requests. That is fine until you need GPU workloads only on GPU nodes, or you need replicas of a stateful service spread across zones so a zone outage does not take you down. Node affinity pulls pods toward certain nodes. Pod anti-affinity pushes pods away from each other.
Together they let you express placement intent declaratively, instead of pinning to specific node names or relying on luck.
Mental Model
There are two strengths of rule. requiredDuringSchedulingIgnoredDuringExecution is a hard constraint: if no node matches, the pod stays Pending. preferredDuringSchedulingIgnoredDuringExecution is a soft hint with a weight; the scheduler scores nodes and picks the best match but will fall back if no preferred node is free.
“IgnoredDuringExecution” means the rule is only evaluated at scheduling time. If a node label changes later, already-running pods are not moved. There is no RequiredDuringExecution mode.
Hands-on Example
Run a workload only on GPU nodes and spread replicas across availability zones:
apiVersion: apps/v1
kind: Deployment
metadata:
name: inference
spec:
replicas: 3
selector:
matchLabels: { app: inference }
template:
metadata:
labels: { app: inference }
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: hardware
operator: In
values: ["gpu"]
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: gpu-model
operator: In
values: ["a100"]
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels: { app: inference }
topologyKey: topology.kubernetes.io/zone
containers:
- name: server
image: example/inference:2.0
resources:
limits:
nvidia.com/gpu: 1
Nodes:
node-a (zone=us-east-1a, hardware=gpu, gpu-model=a100)
node-b (zone=us-east-1b, hardware=gpu, gpu-model=t4)
node-c (zone=us-east-1c, hardware=gpu, gpu-model=a100)
node-d (zone=us-east-1a, hardware=cpu)
Scheduling 3 replicas:
pod-1 -> node-a (a100 preferred, zone 1a)
pod-2 -> node-c (a100 preferred, different zone)
pod-3 -> node-b (zone 1b free, falls back from a100)
node-d skipped: hardware!=gpu (hard rule) Common Pitfalls
topologyKey must be a label that exists on nodes. A typo like topology.kubernetes.io/zones (plural) silently makes the anti-affinity rule a no-op because no node matches, and pods pile up on one node anyway. Always verify with kubectl get nodes --show-labels.
Strict requiredDuringScheduling anti-affinity on a Deployment with more replicas than failure domains will leave pods Pending forever. Three replicas and only two zones with hard anti-affinity is a stuck rollout. Use preferred or switch to topologySpreadConstraints which handle “best effort” spread more gracefully.
nodeSelector still works and is simpler than nodeAffinity for exact-match cases. Mixing both in one pod spec is allowed but confusing; pick one per workload.
Operators allowed are In, NotIn, Exists, DoesNotExist, Gt, Lt. Gt and Lt only work on integer values stored as strings, which surprises people the first time.
Production Tips
Prefer topologySpreadConstraints for simple spread requirements; it scales better than anti-affinity rules with many label combinations. Reserve anti-affinity for cases where you need hard guarantees (one replica per node, never two together).
Anti-affinity is expensive at scale. The scheduler must check every candidate node against every existing pod that matches the selector. In clusters with thousands of pods, this can add seconds to scheduling latency. Scope selectors tightly (one app, one namespace) to keep the work bounded.
Use taints and tolerations alongside node affinity. Affinity says “I want this node”; a taint says “this node rejects pods that do not tolerate it.” Together they prevent GPU nodes from being colonized by random pods that merely match a label.
Label nodes through your node-pool config (EKS managed node groups, GKE node pools), not by kubectl label, so the labels survive node replacement.
Wrap-up
Node affinity steers pods toward the right hardware, and anti-affinity keeps replicas apart so one failure cannot take them all. Start with soft rules and topologySpreadConstraints, escalate to hard anti-affinity only where uptime requires it, and verify your topology keys before you ship.
Related articles
- Kubernetes Kubernetes Resource Requests and Limits Explained
What requests and limits really do, how they interact with the scheduler and the OOM killer, and how to set them without overpaying or getting throttled.
- Kubernetes Kubernetes Cluster Upgrades and Pod Eviction Explained
How Kubernetes cluster upgrades drain nodes, how pod eviction works, and how PodDisruptionBudgets and graceful shutdown keep workloads safe during upgrades.
- Kubernetes Kubernetes ConfigMaps and Secrets Tutorial
A practical walkthrough of ConfigMaps and Secrets in Kubernetes, including how to inject them as environment variables, mount as files, and rotate safely.
- Kubernetes Introduction to Kubernetes Helm Charts
Learn what Helm charts are, how templates and values work together, and how to package your own application for repeatable, parameterized Kubernetes deployments.