Topic 27

Horizontal Pod Autoscaler

AutoscalingMetrics

The Horizontal Pod Autoscaler (HPA) adjusts the number of Pod replicas to match load. It watches a metric — CPU utilization, memory, or a custom signal — and scales a Deployment up or down to keep that metric near a target. It is how a workload absorbs traffic swings without manual intervention.

The HPA is a control loop like everything else in Kubernetes, with a few specific behaviors — a metrics dependency, a stabilization window, and a hard requirement on resource requests — that determine whether it works smoothly or thrashes.

How the HPA Computes Replicas

The HPA control loop

Read metricavg CPU vs request

→

Compareto target (50%)

→

Computedesired replicas

→

Scalethe Deployment

→

Waitstabilize, then repeat

The HPA periodically reads the current metric value across the Pods, compares it to the target, and computes the replica count that would bring the metric to target — roughly, desired = current_replicas × (current_metric ÷ target_metric). If average CPU is 80% against a 50% target, it scales out proportionally. It then clamps to the configured minReplicas and maxReplicas.

An HPA targeting 50% CPU

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 50

It Needs Requests and a Metrics Source

CPU and memory utilization are measured relative to the Pod's request — "80% CPU" means 80% of the requested CPU. So an HPA on CPU does nothing useful if the container has no CPU request; this is the single most common reason an HPA "isn't scaling." The HPA also needs a metrics source: metrics-server for CPU/memory, or an adapter exposing custom and external metrics (often backed by Prometheus) for anything else.

Stabilization and Thrashing

Replica count lags the load — and the stabilization window smooths it

Naive scaling oscillates — scale out on a spike, scale in immediately, repeat. The HPA dampens this with a stabilization window (it considers recent recommendations and, by default, scales down conservatively over several minutes while scaling up quickly) and configurable scaling policies that cap how fast replicas change. Tuning these is how you trade responsiveness against stability.

Custom and External Metrics

CPU is a poor proxy for many workloads. Queue-based systems scale better on queue depth; web services on requests-per-second or latency. The HPA's autoscaling/v2 API supports custom metrics (from inside the cluster) and external metrics (from outside, like a cloud queue length). For event-driven scale-to-zero, KEDA extends the HPA with dozens of event sources. Choosing the metric that actually reflects load is more important than tuning the loop.

HPA vs VPA

HPA — changes the number of replicas (scale out/in). For stateless workloads that handle more load with more copies.

VPA — changes each Pod's CPU/memory (scale up/down). For right-sizing; conflicts with HPA on the same metric (Topic 28).

Common Mistakes

Running an HPA on CPU with no CPU request set on the container, so utilization is undefined and it never scales.
Scaling on CPU when the real load signal is queue depth or request rate.
No stabilization tuning, so replicas oscillate on bursty traffic.
Running an HPA and a VPA on the same resource metric, so they fight.
Setting maxReplicas so high that a runaway scales into a huge bill or exhausts the cluster.

Best Practices

Always set the resource request the HPA scales on — utilization is relative to it.
Scale on the metric that truly reflects load (RPS, latency, queue depth), not CPU by default.
Tune the stabilization window and policies to stop thrashing on bursty traffic.
Set sensible min and max replicas to bound both availability and cost.
Use KEDA for event-driven or scale-to-zero scenarios the core HPA doesn't cover.

RelatedVPA & Cluster Autoscaler — the other two scaling axes (Topic 28)Metrics & monitoring — where custom metrics come from (Topic 46)KEDA / cloud autoscaling — event-driven and managed analogs

Knowledge Check

Why does an HPA targeting CPU utilization fail to scale if the container has no CPU request?

Utilization is measured relative to the request; with no request there is no baseline percentage to compute
HPAs require a LoadBalancer Service to read external traffic metrics
Without a CPU request the scheduler simply rejects the Pod outright and it never reaches a node to begin running
CPU metrics are only collected for Pods in the Guaranteed QoS class

What is the difference between the HPA and the VPA?

HPA changes the replica count; VPA changes each Pod's CPU/memory
HPA adds and removes nodes; VPA adds and removes Pods
They are one controller toggled between a horizontal and vertical mode
HPA is meant for stateful sets; VPA for stateless deployments

Why is CPU often a poor metric to autoscale a queue worker on?

Queue depth reflects backlog and load far better than CPU for such workloads
CPU usage cannot be measured for background queue workers
The HPA cannot read CPU metrics from a worker Deployment
Queue workers sit idle and never consume any measurable CPU at all, even while draining a backlog

You got correct