Horizontal Pod Autoscaler
The Horizontal Pod Autoscaler (HPA) adjusts the number of Pod replicas to match load. It watches a metric — CPU utilization, memory, or a custom signal — and scales a Deployment up or down to keep that metric near a target. It is how a workload absorbs traffic swings without manual intervention.
The HPA is a control loop like everything else in Kubernetes, with a few specific behaviors — a metrics dependency, a stabilization window, and a hard requirement on resource requests — that determine whether it works smoothly or thrashes.
How the HPA Computes Replicas
The HPA periodically reads the current metric value across the Pods, compares it to the target, and computes the replica count that would bring the metric to target — roughly, desired = current_replicas × (current_metric ÷ target_metric). If average CPU is 80% against a 50% target, it scales out proportionally. It then clamps to the configured minReplicas and maxReplicas.
apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: web spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: web minReplicas: 2 maxReplicas: 20 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 50
It Needs Requests and a Metrics Source
CPU and memory utilization are measured relative to the Pod's request — "80% CPU" means 80% of the requested CPU. So an HPA on CPU does nothing useful if the container has no CPU request; this is the single most common reason an HPA "isn't scaling." The HPA also needs a metrics source: metrics-server for CPU/memory, or an adapter exposing custom and external metrics (often backed by Prometheus) for anything else.
Stabilization and Thrashing
Naive scaling oscillates — scale out on a spike, scale in immediately, repeat. The HPA dampens this with a stabilization window (it considers recent recommendations and, by default, scales down conservatively over several minutes while scaling up quickly) and configurable scaling policies that cap how fast replicas change. Tuning these is how you trade responsiveness against stability.
Custom and External Metrics
CPU is a poor proxy for many workloads. Queue-based systems scale better on queue depth; web services on requests-per-second or latency. The HPA's autoscaling/v2 API supports custom metrics (from inside the cluster) and external metrics (from outside, like a cloud queue length). For event-driven scale-to-zero, KEDA extends the HPA with dozens of event sources. Choosing the metric that actually reflects load is more important than tuning the loop.
HPA — changes the number of replicas (scale out/in). For stateless workloads that handle more load with more copies.
VPA — changes each Pod's CPU/memory (scale up/down). For right-sizing; conflicts with HPA on the same metric (Topic 28).
- Running an HPA on CPU with no CPU request set on the container, so utilization is undefined and it never scales.
- Scaling on CPU when the real load signal is queue depth or request rate.
- No stabilization tuning, so replicas oscillate on bursty traffic.
- Running an HPA and a VPA on the same resource metric, so they fight.
- Setting maxReplicas so high that a runaway scales into a huge bill or exhausts the cluster.
- Always set the resource request the HPA scales on — utilization is relative to it.
- Scale on the metric that truly reflects load (RPS, latency, queue depth), not CPU by default.
- Tune the stabilization window and policies to stop thrashing on bursty traffic.
- Set sensible min and max replicas to bound both availability and cost.
- Use KEDA for event-driven or scale-to-zero scenarios the core HPA doesn't cover.
Knowledge Check
Why does an HPA targeting CPU utilization fail to scale if the container has no CPU request?
- Utilization is measured relative to the request; with no request there is no baseline percentage to compute
- HPAs require a LoadBalancer Service to read external traffic metrics
- Without a CPU request the scheduler simply rejects the Pod outright and it never reaches a node to begin running
- CPU metrics are only collected for Pods in the Guaranteed QoS class
What is the difference between the HPA and the VPA?
- HPA changes the replica count; VPA changes each Pod's CPU/memory
- HPA adds and removes nodes; VPA adds and removes Pods
- They are one controller toggled between a horizontal and vertical mode
- HPA is meant for stateful sets; VPA for stateless deployments
Why is CPU often a poor metric to autoscale a queue worker on?
- Queue depth reflects backlog and load far better than CPU for such workloads
- CPU usage cannot be measured for background queue workers
- The HPA cannot read CPU metrics from a worker Deployment
- Queue workers sit idle and never consume any measurable CPU at all, even while draining a backlog
You got correct