Kubernetes Horizontal Pod Autoscaler Explained
Understand how HPA decides when to add or remove pods, the metrics it can scale on, and the tuning knobs that prevent flapping and runaway scaling.
What you'll learn
- ✓The control loop behind HPA
- ✓CPU, memory, and custom metric scaling
- ✓How stabilization windows prevent flapping
- ✓How HPA interacts with Cluster Autoscaler
- ✓Common failure modes and how to debug them
Prerequisites
- •Familiarity with Deployments and resource requests
What and Why
The Horizontal Pod Autoscaler (HPA) automatically adjusts the number of pod replicas based on observed metrics. Traffic doubles at lunch? HPA spins up more pods. Quiet at 3 a.m.? It scales back down. Done well, it keeps latency stable without paying for peak capacity all day.
But misconfigured HPAs cause two opposite failures: under-scaling (pods saturate and 5xx errors spike) or scaling flaps (pods come and go every minute, hurting cache hit rates).
Mental Model
HPA is a control loop that runs every 15 seconds (configurable). Each iteration:
- Reads the current metric value (CPU, memory, or a custom metric).
- Compares it to the target.
- Computes desired replicas using the ratio:
desired = ceil(current_replicas * current / target) - Applies the result, subject to min/max bounds and stabilization windows.
metrics-server / Prometheus adapter
|
v
[HPA controller] (every 15s)
|
reads metric, computes ratio
|
v
patch Deployment.replicas
|
v
[more or fewer pods] Hands-on Example
Define resource requests on your Deployment — HPA cannot work without them for CPU/memory scaling:
apiVersion: apps/v1
kind: Deployment
metadata: { name: api }
spec:
replicas: 2
selector: { matchLabels: { app: api } }
template:
metadata: { labels: { app: api } }
spec:
containers:
- name: api
image: myorg/api:2.1
resources:
requests: { cpu: 250m, memory: 256Mi }
limits: { cpu: 1, memory: 512Mi }
Add an HPA targeting 70 percent average CPU:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: api }
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target: { type: Utilization, averageUtilization: 70 }
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Percent
value: 100
periodSeconds: 30
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
The behavior block makes scaling asymmetric: scale up fast (double in 30 seconds), scale down slowly (halve every minute, only after 5 minutes of low usage). This is a sane default for most web apps.
Check status:
kubectl get hpa api
kubectl describe hpa api
Common Pitfalls
No resource requests. Without resources.requests.cpu, the HPA cannot compute utilization and reports <unknown>. The pod will never scale.
Scaling on memory for JVM apps. JVMs grab memory and rarely give it back. Memory-based HPA on Java services often gets stuck near the high water mark. Use CPU or custom metrics (queue depth, request rate) instead.
metrics-server not installed. On vanilla clusters, kubectl top pods fails until you install metrics-server. EKS and GKE provide it by default; bare clusters do not.
Flapping. Short stabilization windows plus bursty traffic cause the replica count to bounce. Increase stabilizationWindowSeconds on scale-down.
HPA fighting with replicas. If your CI re-applies replicas: 2 on every deploy, HPA undoes it seconds later. Either remove replicas from your manifest or use kubectl apply --field-manager patterns that respect HPA ownership.
No headroom on the cluster. HPA scales pods, not nodes. If the cluster is full, new pods sit Pending. Pair HPA with the Cluster Autoscaler or Karpenter so new nodes appear automatically.
Practical Tips
Custom metrics unlock far better autoscaling for queue-based workloads. Use the Prometheus Adapter or KEDA to scale on:
- HTTP requests per second per pod
- Kafka consumer lag
- SQS queue depth
- p95 latency
Example KEDA ScaledObject for SQS:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata: { name: worker }
spec:
scaleTargetRef: { name: worker }
minReplicaCount: 1
maxReplicaCount: 50
triggers:
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.us-east-1.amazonaws.com/123/jobs
queueLength: "20"
awsRegion: us-east-1
This scales workers based on actual backlog, not CPU, which is much more accurate for batch jobs.
Always set maxReplicas to a value you can afford. A runaway upstream service can drive your costs into the stratosphere overnight.
Pair HPA with PodDisruptionBudgets so scale-down does not violate availability guarantees during maintenance.
Wrap-up
HPA is one of Kubernetes’ best features when configured correctly. Set resource requests, pick metrics that actually correlate with load, tune the behavior block to scale up fast and down slowly, and always combine it with a node-level autoscaler. For queue-driven workloads, go straight to KEDA — it handles the metrics adapter for you and supports dozens of event sources. Done well, autoscaling lets you stop guessing capacity and start trusting the cluster.
Related articles
- Kubernetes Kubernetes Resource Requests and Limits Explained
What requests and limits really do, how they interact with the scheduler and the OOM killer, and how to set them without overpaying or getting throttled.
- Kubernetes Kubernetes Vertical Pod Autoscaler: A Practical Guide
Learn how the Vertical Pod Autoscaler right-sizes CPU and memory requests, when to use it instead of HPA, and how to deploy it safely in production.
- Kubernetes Kubernetes Cluster Upgrades and Pod Eviction Explained
How Kubernetes cluster upgrades drain nodes, how pod eviction works, and how PodDisruptionBudgets and graceful shutdown keep workloads safe during upgrades.
- Kubernetes Kubernetes ConfigMaps and Secrets Tutorial
A practical walkthrough of ConfigMaps and Secrets in Kubernetes, including how to inject them as environment variables, mount as files, and rotate safely.