Probes
Topic 29

Probes

HealthLifecycle

Kubernetes decides whether a container is healthy and ready by running probes against it. There are three — readiness, liveness, and startup — and each gates something different. Configured well, they make deployments safe and self-healing; configured badly, they turn a healthy app into a restart loop.

Probes are deceptively simple and a frequent source of self-inflicted outages. The key is understanding precisely what each one controls, because the failure modes differ sharply.

Readiness: Gating Traffic

Readiness
Fail → removed from Service endpoints (no traffic). Not restarted.
Liveness
Fail → the container is restarted. Tests only the process.
Startup
Runs first; suspends the other two so a slow boot isn't killed.

A readiness probe answers "can this Pod serve requests right now?" While it fails, the Pod is removed from its Service's endpoints, so no traffic is sent to it — but the Pod is not restarted. This is what makes rolling updates safe: a new Pod receives traffic only once it reports ready. Readiness is also the right tool for a Pod that is temporarily busy or waiting on a dependency — it sheds traffic without being killed.

Three probes on a container
spec:
  containers:
    - name: app
      image: my-app:1.0
      readinessProbe:
        httpGet:
          path: /ready
          port: 8080
        periodSeconds: 5
      livenessProbe:
        httpGet:
          path: /healthz
          port: 8080
        periodSeconds: 10
      startupProbe:
        httpGet:
          path: /healthz
          port: 8080
        failureThreshold: 30      # allow up to ~5 min to start
        periodSeconds: 10

Liveness: Gating Restarts

A liveness probe answers "is this container still working, or wedged?" When it fails past its threshold, the kubelet restarts the container. This recovers from deadlocks and stuck processes a crash wouldn't catch. The danger: if the liveness probe checks something it shouldn't — a slow dependency, or the same heavy logic as readiness — a transient blip restarts a perfectly healthy container, and under load every replica restarts at once. Liveness should test only that the process itself is alive.

Startup: Protecting Slow Boots

A startup probe exists for apps that take a long time to initialize. While it is running, the liveness and readiness probes are suspended, so a slow boot is not mistaken for a wedged process and killed mid-startup. Once the startup probe succeeds, the other two take over. Without it, you would have to loosen the liveness probe's timing for everyone just to accommodate startup — the startup probe lets you keep liveness tight for the running state.

Probe Types and Tuning

Probes can be HTTP GET, TCP socket, exec (run a command), or gRPC. Each has thresholds: initialDelaySeconds, periodSeconds, failureThreshold, and timeouts. Exec probes are the most expensive — they fork a process each time — so prefer HTTP or gRPC where possible. The cardinal rule is to keep liveness and readiness distinct: readiness can depend on dependencies, liveness must not.

Liveness vs readiness vs startup

Readiness — fails → removed from Service endpoints (no traffic), not restarted. Gates traffic.

Liveness — fails → container restarted. Gates recovery from wedged processes.

Startup — runs first; suspends the other two so a slow boot isn't killed. Gates startup.

Common Mistakes
  • Pointing the liveness probe at a dependency, so a dependency blip restarts healthy containers.
  • Using the same heavy check for liveness and readiness, causing restart storms under load.
  • No readiness probe, so rolling updates send traffic to Pods that haven't finished starting.
  • No startup probe for a slow-booting app, so liveness kills it mid-initialization.
  • Expensive exec probes on a tight interval, adding load and false failures.
Best Practices
  • Keep liveness minimal — test only that the process itself is alive, never external dependencies.
  • Use readiness to gate traffic and to shed load while a Pod is busy or waiting on a dependency.
  • Add a startup probe for slow-booting apps so you can keep liveness tight for the running state.
  • Prefer HTTP or gRPC probes over exec; tune thresholds to the app's real behavior.
  • Always define readiness before relying on rolling updates to be zero-downtime.
RelatedReplicaSets and Deployments — readiness gates rolling updates (Topic 06)Services — readiness controls endpoint membership (Topic 08)Cloud LB health checks — the external analog of readiness

Knowledge Check

What happens when a readiness probe fails?

  • The Pod is removed from its Service's endpoints (no traffic) but not restarted
  • The container is restarted in place right away by the kubelet on that same node
  • The Pod is evicted and rescheduled onto another node that has free capacity
  • Nothing happens — readiness is purely advisory

Why is pointing a liveness probe at an external dependency dangerous?

  • A dependency blip makes liveness fail and restarts healthy containers, potentially all at once under load
  • Liveness probes are simply forbidden from making any outbound network call to an external dependency at all
  • It permanently removes the Pod from the Service's endpoint list
  • It silently disables the container's readiness probe

What does a startup probe protect against?

  • A slow-initializing app being killed by the liveness probe before it finishes booting
  • Traffic reaching the Pod before its Service DNS name resolves
  • The scheduler placing the Pod on a node that lacks capacity
  • The Pod exceeding its configured memory limit and being OOM-killed during a traffic spike

You got correct