Topic 29

Probes

HealthLifecycle

Kubernetes decides whether a container is healthy and ready by running probes against it. There are three — readiness, liveness, and startup — and each gates something different. Configured well, they make deployments safe and self-healing; configured badly, they turn a healthy app into a restart loop.

Probes are deceptively simple and a frequent source of self-inflicted outages. The key is understanding precisely what each one controls, because the failure modes differ sharply.

Readiness: Gating Traffic

Readiness

Fail → removed from Service endpoints (no traffic). Not restarted.

Liveness

Fail → the container is restarted. Tests only the process.

Startup

Runs first; suspends the other two so a slow boot isn't killed.

A readiness probe answers "can this Pod serve requests right now?" While it fails, the Pod is removed from its Service's endpoints, so no traffic is sent to it — but the Pod is not restarted. This is what makes rolling updates safe: a new Pod receives traffic only once it reports ready. Readiness is also the right tool for a Pod that is temporarily busy or waiting on a dependency — it sheds traffic without being killed.

Three probes on a container

spec:
  containers:
    - name: app
      image: my-app:1.0
      readinessProbe:
        httpGet:
          path: /ready
          port: 8080
        periodSeconds: 5
      livenessProbe:
        httpGet:
          path: /healthz
          port: 8080
        periodSeconds: 10
      startupProbe:
        httpGet:
          path: /healthz
          port: 8080
        failureThreshold: 30      # allow up to ~5 min to start
        periodSeconds: 10

Liveness: Gating Restarts

A liveness probe answers "is this container still working, or wedged?" When it fails past its threshold, the kubelet restarts the container. This recovers from deadlocks and stuck processes a crash wouldn't catch. The danger: if the liveness probe checks something it shouldn't — a slow dependency, or the same heavy logic as readiness — a transient blip restarts a perfectly healthy container, and under load every replica restarts at once. Liveness should test only that the process itself is alive.

Startup: Protecting Slow Boots

A startup probe exists for apps that take a long time to initialize. While it is running, the liveness and readiness probes are suspended, so a slow boot is not mistaken for a wedged process and killed mid-startup. Once the startup probe succeeds, the other two take over. Without it, you would have to loosen the liveness probe's timing for everyone just to accommodate startup — the startup probe lets you keep liveness tight for the running state.

Probe Types and Tuning

Probes can be HTTP GET, TCP socket, exec (run a command), or gRPC. Each has thresholds: initialDelaySeconds, periodSeconds, failureThreshold, and timeouts. Exec probes are the most expensive — they fork a process each time — so prefer HTTP or gRPC where possible. The cardinal rule is to keep liveness and readiness distinct: readiness can depend on dependencies, liveness must not.

Liveness vs readiness vs startup

Readiness — fails → removed from Service endpoints (no traffic), not restarted. Gates traffic.

Liveness — fails → container restarted. Gates recovery from wedged processes.

Startup — runs first; suspends the other two so a slow boot isn't killed. Gates startup.

Common Mistakes

Pointing the liveness probe at a dependency, so a dependency blip restarts healthy containers.
Using the same heavy check for liveness and readiness, causing restart storms under load.
No readiness probe, so rolling updates send traffic to Pods that haven't finished starting.
No startup probe for a slow-booting app, so liveness kills it mid-initialization.
Expensive exec probes on a tight interval, adding load and false failures.

Best Practices

Keep liveness minimal — test only that the process itself is alive, never external dependencies.
Use readiness to gate traffic and to shed load while a Pod is busy or waiting on a dependency.
Add a startup probe for slow-booting apps so you can keep liveness tight for the running state.
Prefer HTTP or gRPC probes over exec; tune thresholds to the app's real behavior.
Always define readiness before relying on rolling updates to be zero-downtime.

RelatedReplicaSets and Deployments — readiness gates rolling updates (Topic 06)Services — readiness controls endpoint membership (Topic 08)Cloud LB health checks — the external analog of readiness

Knowledge Check

What happens when a readiness probe fails?

The Pod is removed from its Service's endpoints (no traffic) but not restarted
The container is restarted in place right away by the kubelet on that same node
The Pod is evicted and rescheduled onto another node that has free capacity
Nothing happens — readiness is purely advisory

Why is pointing a liveness probe at an external dependency dangerous?

A dependency blip makes liveness fail and restarts healthy containers, potentially all at once under load
Liveness probes are simply forbidden from making any outbound network call to an external dependency at all
It permanently removes the Pod from the Service's endpoint list
It silently disables the container's readiness probe

What does a startup probe protect against?

A slow-initializing app being killed by the liveness probe before it finishes booting
Traffic reaching the Pod before its Service DNS name resolves
The scheduler placing the Pod on a node that lacks capacity
The Pod exceeding its configured memory limit and being OOM-killed during a traffic spike

You got correct