Kubernetes Horizontal Pod Autoscaler Explained

Intermediate 9 min read

What you'll learn

✓The control loop behind HPA
✓CPU, memory, and custom metric scaling
✓How stabilization windows prevent flapping
✓How HPA interacts with Cluster Autoscaler
✓Common failure modes and how to debug them

Prerequisites

•Familiarity with Deployments and resource requests

What and Why

The Horizontal Pod Autoscaler (HPA) automatically adjusts the number of pod replicas based on observed metrics. Traffic doubles at lunch? HPA spins up more pods. Quiet at 3 a.m.? It scales back down. Done well, it keeps latency stable without paying for peak capacity all day.

But misconfigured HPAs cause two opposite failures: under-scaling (pods saturate and 5xx errors spike) or scaling flaps (pods come and go every minute, hurting cache hit rates).

Mental Model

HPA is a control loop that runs every 15 seconds (configurable). Each iteration:

Reads the current metric value (CPU, memory, or a custom metric).
Compares it to the target.
Computes desired replicas using the ratio: desired = ceil(current_replicas * current / target)
Applies the result, subject to min/max bounds and stabilization windows.

  metrics-server / Prometheus adapter
            |
            v
    [HPA controller]  (every 15s)
            |
 reads metric, computes ratio
            |
            v
 patch Deployment.replicas
            |
            v
     [more or fewer pods]

HPA control loop

Hands-on Example

Define resource requests on your Deployment — HPA cannot work without them for CPU/memory scaling:

apiVersion: apps/v1
kind: Deployment
metadata: { name: api }
spec:
  replicas: 2
  selector: { matchLabels: { app: api } }
  template:
    metadata: { labels: { app: api } }
    spec:
      containers:
        - name: api
          image: myorg/api:2.1
          resources:
            requests: { cpu: 250m, memory: 256Mi }
            limits:   { cpu: 1,    memory: 512Mi }

Add an HPA targeting 70 percent average CPU:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: api }
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target: { type: Utilization, averageUtilization: 70 }
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Percent
          value: 100
          periodSeconds: 30
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60

The behavior block makes scaling asymmetric: scale up fast (double in 30 seconds), scale down slowly (halve every minute, only after 5 minutes of low usage). This is a sane default for most web apps.

Check status:

kubectl get hpa api
kubectl describe hpa api

Common Pitfalls

No resource requests. Without resources.requests.cpu, the HPA cannot compute utilization and reports <unknown>. The pod will never scale.

Scaling on memory for JVM apps. JVMs grab memory and rarely give it back. Memory-based HPA on Java services often gets stuck near the high water mark. Use CPU or custom metrics (queue depth, request rate) instead.

metrics-server not installed. On vanilla clusters, kubectl top pods fails until you install metrics-server. EKS and GKE provide it by default; bare clusters do not.

Flapping. Short stabilization windows plus bursty traffic cause the replica count to bounce. Increase stabilizationWindowSeconds on scale-down.

HPA fighting with replicas. If your CI re-applies replicas: 2 on every deploy, HPA undoes it seconds later. Either remove replicas from your manifest or use kubectl apply --field-manager patterns that respect HPA ownership.

No headroom on the cluster. HPA scales pods, not nodes. If the cluster is full, new pods sit Pending. Pair HPA with the Cluster Autoscaler or Karpenter so new nodes appear automatically.

Practical Tips

Custom metrics unlock far better autoscaling for queue-based workloads. Use the Prometheus Adapter or KEDA to scale on:

HTTP requests per second per pod
Kafka consumer lag
SQS queue depth
p95 latency

Example KEDA ScaledObject for SQS:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata: { name: worker }
spec:
  scaleTargetRef: { name: worker }
  minReplicaCount: 1
  maxReplicaCount: 50
  triggers:
    - type: aws-sqs-queue
      metadata:
        queueURL: https://sqs.us-east-1.amazonaws.com/123/jobs
        queueLength: "20"
        awsRegion: us-east-1

This scales workers based on actual backlog, not CPU, which is much more accurate for batch jobs.

Always set maxReplicas to a value you can afford. A runaway upstream service can drive your costs into the stratosphere overnight.

Pair HPA with PodDisruptionBudgets so scale-down does not violate availability guarantees during maintenance.

Wrap-up

HPA is one of Kubernetes’ best features when configured correctly. Set resource requests, pick metrics that actually correlate with load, tune the behavior block to scale up fast and down slowly, and always combine it with a node-level autoscaler. For queue-driven workloads, go straight to KEDA — it handles the metrics adapter for you and supports dozens of event sources. Done well, autoscaling lets you stop guessing capacity and start trusting the cluster.