Kubernetes Resource Requests and Limits Explained

Intermediate 10 min read

What you'll learn

✓Requests vs limits semantics
✓How the scheduler uses requests
✓CPU throttling and memory OOM
✓QoS classes
✓Right-sizing in practice

Prerequisites

•Familiar with terminals and YAML

What and Why

Every Kubernetes pod can declare two numbers per resource: a request and a limit. They look like one knob with two values, but they do different jobs. Requests influence scheduling and fairness. Limits cap actual consumption. Get them wrong and you either waste cluster capacity or randomly get OOM-killed in production.

This matters because it is the single biggest lever on cluster efficiency. Most teams overpay by 2 to 3x because they request 4 CPUs and 8 GB for a service that needs 200 millicores and 256 MB.

Mental Model

Request is the floor the scheduler promises. The scheduler will only place a pod on a node if sum(requests) + new request <= node capacity. The kubelet also uses requests to share CPU during contention.
Limit is the ceiling. CPU above the limit is throttled by the kernel CFS scheduler. Memory above the limit gets the container killed with OOMKilled.

request = limit (both set, equal)   -> Guaranteed
request < limit                     -> Burstable
no request, no limit                -> BestEffort

Scheduler view:   uses request
Runtime cap:      uses limit (CPU throttle, mem kill)

Requests vs limits and the QoS classes they produce

QoS class drives eviction order under node pressure: BestEffort goes first, then Burstable furthest over its request, then Guaranteed last.

Hands-on Example

A typical web service deployment:

apiVersion: apps/v1
kind: Deployment
metadata: { name: api }
spec:
  replicas: 3
  selector: { matchLabels: { app: api } }
  template:
    metadata: { labels: { app: api } }
    spec:
      containers:
        - name: api
          image: example/api:1.4.2
          resources:
            requests:
              cpu: "250m"
              memory: "256Mi"
            limits:
              memory: "512Mi"

Note there is no CPU limit. That is deliberate. CPU throttling under a limit causes more incidents than it prevents because spikes that would have completed in milliseconds now queue. Set CPU requests for fair sharing and skip the limit unless you have a noisy-neighbor problem.

For memory, set request near steady-state and limit near 2x to protect the node. Verify with kubectl:

kubectl top pod api-7d5
kubectl describe node ip-10-0-1-23 | grep -A5 "Allocated resources"

To right-size automatically, install the Vertical Pod Autoscaler in recommendation mode:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata: { name: api-vpa }
spec:
  targetRef: { apiVersion: apps/v1, kind: Deployment, name: api }
  updatePolicy: { updateMode: "Off" }

After a week, VPA suggests requests grounded in real usage.

Common Pitfalls

CPU limits everywhere. Throttling causes p99 latency cliffs that look like bugs. Audit container_cpu_cfs_throttled_seconds_total in Prometheus; if it is nonzero, your limit is too tight.
Memory limit equal to request. No headroom for GC or buffer cache means random OOMKills. Give memory at least 30 percent headroom.
No requests at all. Pods land as BestEffort and get evicted first. Always set requests in production.
JVM ignoring cgroup limits. Old JVMs read the host’s memory. Use a modern JDK and tune MaxRAMPercentage.
Sidecars without resources. Istio or logging sidecars consume memory too. Set their requests or they distort node packing.

Production Tips

Use Goldilocks or VPA to generate recommendations and commit them quarterly. Treat them as a starting point, not gospel.
Reserve headroom on nodes with a system-reserved kubelet flag and use PodDisruptionBudgets so eviction does not break SLOs.
Apply a LimitRange in each namespace so teams who forget get a sane default instead of unlimited consumption.
Pair requests with the Horizontal Pod Autoscaler keyed on CPU or a custom metric. Right-size pods first, then scale them.
For batch and ML workloads, use PriorityClasses so latency-sensitive services preempt best-effort jobs under pressure.

Wrap-up

Requests and limits are how Kubernetes shares finite hardware between many workloads. Set requests grounded in real measurements, set memory limits with headroom, and be cautious with CPU limits. Combine those defaults with autoscaling and good observability and you get a cluster that is both cheap and predictable, instead of one that mysteriously throttles and kills containers under load.