Skip to content
C Codeloom
Kubernetes

Kubernetes Vertical Pod Autoscaler: A Practical Guide

Learn how the Vertical Pod Autoscaler right-sizes CPU and memory requests, when to use it instead of HPA, and how to deploy it safely in production.

·4 min read · By Codeloom
Intermediate 10 min read

What you'll learn

  • What the VPA components do and how they interact
  • The three update modes and when to use each
  • Why VPA and HPA on the same metric conflict
  • How to roll VPA out without surprise restarts

Prerequisites

  • Familiarity with pod CPU and memory requests

What and Why

Every pod has CPU and memory requests that the scheduler uses to pack nodes and the kernel uses to throttle and OOM-kill. Most teams set those requests once, then either over-provision to avoid pages or under-provision and live with throttling. The Vertical Pod Autoscaler (VPA) observes real usage over time and updates a pod’s requests to match.

VPA is the right tool when a workload’s load profile is roughly stable but you do not know the right size, or when usage drifts over months as the codebase evolves. It is wrong for spiky workloads where adding more replicas is the better response.

Mental Model

VPA has three components. The Recommender watches metrics-server and history, computes target requests, and writes them to the VerticalPodAutoscaler status. The Updater evicts pods whose current requests are too far from the target. The Admission Controller rewrites requests on newly created pods using the recommendation.

Recommendation flows top-down through the VPA object. The Updater is the only piece that causes restarts; turn it off and VPA becomes a read-only sizing report.

Hands-on Example

Create a Deployment and a VPA in Off mode to just see recommendations:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 3
  selector:
    matchLabels: { app: api }
  template:
    metadata:
      labels: { app: api }
    spec:
      containers:
        - name: api
          image: example/api:1.0
          resources:
            requests:
              cpu: "200m"
              memory: "256Mi"
            limits:
              cpu: "1"
              memory: "512Mi"
---
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  updatePolicy:
    updateMode: "Off"
  resourcePolicy:
    containerPolicies:
      - containerName: api
        minAllowed:
          cpu: "100m"
          memory: "128Mi"
        maxAllowed:
          cpu: "2"
          memory: "2Gi"
        controlledResources: ["cpu", "memory"]

After a few hours, inspect the recommendation:

kubectl describe vpa api-vpa
# Recommendation:
#   Container: api
#     Target: cpu 450m, memory 380Mi
#     Lower Bound: cpu 320m, memory 300Mi
#     Upper Bound: cpu 800m, memory 600Mi

Once you trust the numbers, switch to Auto:

  updatePolicy:
    updateMode: "Auto"
metrics-server --usage--> [Recommender]
                            | writes
                            v
                   [VPA object status]
                            |
 on pod create               |  observes drift
      v                      v
[Admission Webhook]      [Updater]
 rewrites requests   evicts outdated pod
      |                      |
      +----------> [new Pod with target requests]
VPA pipeline from metrics to pod restart

Common Pitfalls

VPA and HPA cannot both manage the same resource. If your HPA scales on CPU and your VPA also adjusts CPU requests, the HPA target percentage moves under it and you get oscillation. Use VPA for memory and HPA for CPU, or use a custom metric for the HPA.

Auto mode evicts pods to apply new requests. Without a PodDisruptionBudget, you can lose multiple replicas at once. Always pair Auto VPA with a PDB and at least two replicas.

VPA does not work on individual pods, only on controllers (Deployment, StatefulSet, DaemonSet, custom). A naked Pod is ignored.

Setting minAllowed and maxAllowed is not optional in production. Without bounds, a memory leak can push the recommendation to absurd values and the Updater will happily evict pods to make them larger.

Production Tips

Roll out in three phases: Off for observation, Initial for new pods only (no eviction of running pods), then Auto once you trust the recommender. The Initial mode is great for batch workloads where pods are short-lived anyway.

Keep limits set independently. VPA controls requests by default; if you let it control limits too (controlledValues: RequestsAndLimits), a spike during recommendation can push limits up and let a runaway container eat the whole node.

Exclude sidecars from VPA control with a containerPolicies entry of mode: "Off" for that container. Otherwise VPA will adjust your Istio or fluentd sidecar based on its baseline usage, which is rarely what you want.

Monitor the VPA status field conditions. RecommendationProvided=False means the recommender cannot collect enough data, often because metrics-server is unhealthy.

Wrap-up

VPA turns CPU and memory sizing from guesswork into a feedback loop. Start in Off mode to see the numbers, set sane min and max bounds, pair with a PodDisruptionBudget, and keep it off any container managed by an HPA. The reward is smaller bills and fewer OOMKills without any code changes.