Kubernetes Resource Requests and Limits Explained
What requests and limits really do, how they interact with the scheduler and the OOM killer, and how to set them without overpaying or getting throttled.
What you'll learn
- ✓Requests vs limits semantics
- ✓How the scheduler uses requests
- ✓CPU throttling and memory OOM
- ✓QoS classes
- ✓Right-sizing in practice
Prerequisites
- •Familiar with terminals and YAML
What and Why
Every Kubernetes pod can declare two numbers per resource: a request and a limit. They look like one knob with two values, but they do different jobs. Requests influence scheduling and fairness. Limits cap actual consumption. Get them wrong and you either waste cluster capacity or randomly get OOM-killed in production.
This matters because it is the single biggest lever on cluster efficiency. Most teams overpay by 2 to 3x because they request 4 CPUs and 8 GB for a service that needs 200 millicores and 256 MB.
Mental Model
- Request is the floor the scheduler promises. The scheduler will only place a pod on a node if
sum(requests) + new request <= node capacity. The kubelet also uses requests to share CPU during contention. - Limit is the ceiling. CPU above the limit is throttled by the kernel CFS scheduler. Memory above the limit gets the container killed with OOMKilled.
request = limit (both set, equal) -> Guaranteed
request < limit -> Burstable
no request, no limit -> BestEffort
Scheduler view: uses request
Runtime cap: uses limit (CPU throttle, mem kill) QoS class drives eviction order under node pressure: BestEffort goes first, then Burstable furthest over its request, then Guaranteed last.
Hands-on Example
A typical web service deployment:
apiVersion: apps/v1
kind: Deployment
metadata: { name: api }
spec:
replicas: 3
selector: { matchLabels: { app: api } }
template:
metadata: { labels: { app: api } }
spec:
containers:
- name: api
image: example/api:1.4.2
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
memory: "512Mi"
Note there is no CPU limit. That is deliberate. CPU throttling under a limit causes more incidents than it prevents because spikes that would have completed in milliseconds now queue. Set CPU requests for fair sharing and skip the limit unless you have a noisy-neighbor problem.
For memory, set request near steady-state and limit near 2x to protect the node. Verify with kubectl:
kubectl top pod api-7d5
kubectl describe node ip-10-0-1-23 | grep -A5 "Allocated resources"
To right-size automatically, install the Vertical Pod Autoscaler in recommendation mode:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata: { name: api-vpa }
spec:
targetRef: { apiVersion: apps/v1, kind: Deployment, name: api }
updatePolicy: { updateMode: "Off" }
After a week, VPA suggests requests grounded in real usage.
Common Pitfalls
- CPU limits everywhere. Throttling causes p99 latency cliffs that look like bugs. Audit
container_cpu_cfs_throttled_seconds_totalin Prometheus; if it is nonzero, your limit is too tight. - Memory limit equal to request. No headroom for GC or buffer cache means random OOMKills. Give memory at least 30 percent headroom.
- No requests at all. Pods land as BestEffort and get evicted first. Always set requests in production.
- JVM ignoring cgroup limits. Old JVMs read the host’s memory. Use a modern JDK and tune
MaxRAMPercentage. - Sidecars without resources. Istio or logging sidecars consume memory too. Set their requests or they distort node packing.
Production Tips
- Use Goldilocks or VPA to generate recommendations and commit them quarterly. Treat them as a starting point, not gospel.
- Reserve headroom on nodes with a system-reserved kubelet flag and use PodDisruptionBudgets so eviction does not break SLOs.
- Apply a LimitRange in each namespace so teams who forget get a sane default instead of unlimited consumption.
- Pair requests with the Horizontal Pod Autoscaler keyed on CPU or a custom metric. Right-size pods first, then scale them.
- For batch and ML workloads, use PriorityClasses so latency-sensitive services preempt best-effort jobs under pressure.
Wrap-up
Requests and limits are how Kubernetes shares finite hardware between many workloads. Set requests grounded in real measurements, set memory limits with headroom, and be cautious with CPU limits. Combine those defaults with autoscaling and good observability and you get a cluster that is both cheap and predictable, instead of one that mysteriously throttles and kills containers under load.
Related articles
- Kubernetes Kubernetes Horizontal Pod Autoscaler Explained
Understand how HPA decides when to add or remove pods, the metrics it can scale on, and the tuning knobs that prevent flapping and runaway scaling.
- Kubernetes Kubernetes Node Affinity and Anti-Affinity: A Practical Guide
Understand how nodeAffinity and podAntiAffinity steer the scheduler, with real YAML, hard vs soft rules, and the performance traps to avoid.
- Kubernetes Kubernetes Cluster Upgrades and Pod Eviction Explained
How Kubernetes cluster upgrades drain nodes, how pod eviction works, and how PodDisruptionBudgets and graceful shutdown keep workloads safe during upgrades.
- Kubernetes Kubernetes ConfigMaps and Secrets Tutorial
A practical walkthrough of ConfigMaps and Secrets in Kubernetes, including how to inject them as environment variables, mount as files, and rotate safely.