Kubernetes Vertical Pod Autoscaler: A Practical Guide
Learn how the Vertical Pod Autoscaler right-sizes CPU and memory requests, when to use it instead of HPA, and how to deploy it safely in production.
What you'll learn
- ✓What the VPA components do and how they interact
- ✓The three update modes and when to use each
- ✓Why VPA and HPA on the same metric conflict
- ✓How to roll VPA out without surprise restarts
Prerequisites
- •Familiarity with pod CPU and memory requests
What and Why
Every pod has CPU and memory requests that the scheduler uses to pack nodes and the kernel uses to throttle and OOM-kill. Most teams set those requests once, then either over-provision to avoid pages or under-provision and live with throttling. The Vertical Pod Autoscaler (VPA) observes real usage over time and updates a pod’s requests to match.
VPA is the right tool when a workload’s load profile is roughly stable but you do not know the right size, or when usage drifts over months as the codebase evolves. It is wrong for spiky workloads where adding more replicas is the better response.
Mental Model
VPA has three components. The Recommender watches metrics-server and history, computes target requests, and writes them to the VerticalPodAutoscaler status. The Updater evicts pods whose current requests are too far from the target. The Admission Controller rewrites requests on newly created pods using the recommendation.
Recommendation flows top-down through the VPA object. The Updater is the only piece that causes restarts; turn it off and VPA becomes a read-only sizing report.
Hands-on Example
Create a Deployment and a VPA in Off mode to just see recommendations:
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
replicas: 3
selector:
matchLabels: { app: api }
template:
metadata:
labels: { app: api }
spec:
containers:
- name: api
image: example/api:1.0
resources:
requests:
cpu: "200m"
memory: "256Mi"
limits:
cpu: "1"
memory: "512Mi"
---
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api
updatePolicy:
updateMode: "Off"
resourcePolicy:
containerPolicies:
- containerName: api
minAllowed:
cpu: "100m"
memory: "128Mi"
maxAllowed:
cpu: "2"
memory: "2Gi"
controlledResources: ["cpu", "memory"]
After a few hours, inspect the recommendation:
kubectl describe vpa api-vpa
# Recommendation:
# Container: api
# Target: cpu 450m, memory 380Mi
# Lower Bound: cpu 320m, memory 300Mi
# Upper Bound: cpu 800m, memory 600Mi
Once you trust the numbers, switch to Auto:
updatePolicy:
updateMode: "Auto"
metrics-server --usage--> [Recommender]
| writes
v
[VPA object status]
|
on pod create | observes drift
v v
[Admission Webhook] [Updater]
rewrites requests evicts outdated pod
| |
+----------> [new Pod with target requests] Common Pitfalls
VPA and HPA cannot both manage the same resource. If your HPA scales on CPU and your VPA also adjusts CPU requests, the HPA target percentage moves under it and you get oscillation. Use VPA for memory and HPA for CPU, or use a custom metric for the HPA.
Auto mode evicts pods to apply new requests. Without a PodDisruptionBudget, you can lose multiple replicas at once. Always pair Auto VPA with a PDB and at least two replicas.
VPA does not work on individual pods, only on controllers (Deployment, StatefulSet, DaemonSet, custom). A naked Pod is ignored.
Setting minAllowed and maxAllowed is not optional in production. Without bounds, a memory leak can push the recommendation to absurd values and the Updater will happily evict pods to make them larger.
Production Tips
Roll out in three phases: Off for observation, Initial for new pods only (no eviction of running pods), then Auto once you trust the recommender. The Initial mode is great for batch workloads where pods are short-lived anyway.
Keep limits set independently. VPA controls requests by default; if you let it control limits too (controlledValues: RequestsAndLimits), a spike during recommendation can push limits up and let a runaway container eat the whole node.
Exclude sidecars from VPA control with a containerPolicies entry of mode: "Off" for that container. Otherwise VPA will adjust your Istio or fluentd sidecar based on its baseline usage, which is rarely what you want.
Monitor the VPA status field conditions. RecommendationProvided=False means the recommender cannot collect enough data, often because metrics-server is unhealthy.
Wrap-up
VPA turns CPU and memory sizing from guesswork into a feedback loop. Start in Off mode to see the numbers, set sane min and max bounds, pair with a PodDisruptionBudget, and keep it off any container managed by an HPA. The reward is smaller bills and fewer OOMKills without any code changes.
Related articles
- Kubernetes Kubernetes Horizontal Pod Autoscaler Explained
Understand how HPA decides when to add or remove pods, the metrics it can scale on, and the tuning knobs that prevent flapping and runaway scaling.
- Kubernetes Kubernetes Cluster Upgrades and Pod Eviction Explained
How Kubernetes cluster upgrades drain nodes, how pod eviction works, and how PodDisruptionBudgets and graceful shutdown keep workloads safe during upgrades.
- Kubernetes Kubernetes ConfigMaps and Secrets Tutorial
A practical walkthrough of ConfigMaps and Secrets in Kubernetes, including how to inject them as environment variables, mount as files, and rotate safely.
- Kubernetes Introduction to Kubernetes Helm Charts
Learn what Helm charts are, how templates and values work together, and how to package your own application for repeatable, parameterized Kubernetes deployments.