Topic 66

Production-Readiness Checklist

ChecklistReadiness

Before a workload goes live, a handful of concrete gates separate a demo from something you can trust at 3 a.m. This topic collects them into one checklist — each item covered in depth earlier, here as the list to actually run through before shipping.

None of these is novel; the value is that they are easy to skip individually and collectively decisive. A workload missing several of them is not production-ready, however well it runs in a demo.

Resources and Scheduling

Every container has resource requests set from measured usage, so scheduling and eviction behave and the workload isn't BestEffort by accident (Topic 25). Memory limits reflect the real ceiling; CPU limits are used cautiously to avoid needless throttling. Critical workloads use Guaranteed QoS so they survive node pressure. Without requests, the scheduler is guessing and your Pod is first to be evicted.

Health and Availability

The workload has a readiness probe (so rollouts and traffic gate on actual readiness) and a minimal liveness probe (testing only the process), with a startup probe if it boots slowly (Topic 29). It runs multiple replicas spread across zones via topology spread, with a PodDisruptionBudget so node drains and upgrades never drop it below a safe count (Topics 26, 30). A single-replica production service is a checklist failure.

Gate	Why
Requests set (+ sensible limits)	Correct scheduling and eviction; no accidental BestEffort
Readiness + liveness probes	Zero-downtime rollouts; recover wedged processes
Multiple replicas across zones	Survive Pod, node, and zone failures
PodDisruptionBudget	Operations don't breach availability
securityContext (non-root, etc.)	Limit blast radius of a compromise

Security and Configuration

The Pod runs with a hardened securityContext — non-root, dropped capabilities, read-only root filesystem where feasible — matching at least the Baseline (ideally Restricted) Pod Security Standard (Topic 34). It uses its own least-privilege ServiceAccount (Topic 33), not the default. Config and secrets come from ConfigMaps/Secrets with encryption at rest, never baked into the image. Images are pinned by digest from a trusted source (Topic 37).

Operability

The workload emits logs to stdout, exposes metrics, and propagates trace context (Chapter 9), with alerts on the symptoms users feel. It handles graceful shutdown — responding to SIGTERM, finishing in-flight work within the termination grace period — so rescheduling doesn't drop requests. And it is delivered through GitOps so what's running is what's in Git, with a tested rollback. Run this list before every go-live; the gaps are where incidents come from.

Demo-ready vs production-ready

Demo-ready — it runs and serves a request. One replica, no probes, no limits — fine until anything goes wrong.

Production-ready — requests/probes/replicas/PDB/securityContext/observability/graceful-shutdown all in place. Survives failures and operations.

Common Mistakes

Shipping a single replica for a production service.
No requests or probes, so scheduling, rollouts, and recovery all misbehave.
Running as root with the default ServiceAccount and no securityContext.
No alerting on user-facing symptoms, so failures are noticed by customers first.
Ignoring graceful shutdown, so every reschedule drops in-flight requests.

Best Practices

Set requests (and sensible limits) on every container; use Guaranteed QoS for critical ones.
Define readiness + minimal liveness (+ startup) probes; run multiple replicas across zones with a PDB.
Harden with a non-root securityContext and a least-privilege ServiceAccount; pin images by digest.
Emit logs/metrics/traces and alert on symptoms; handle SIGTERM for graceful shutdown.
Deliver via GitOps with a tested rollback, and run this checklist before every go-live.

RelatedProbes / PDB / QoS — the availability gates (Topics 29-30)Pod Security / Service Accounts — the security gates (Topics 33-34)Observability — the operability gates (Chapter 9)

Knowledge Check

Which is a minimum availability gate for a production service?

Multiple replicas spread across zones with a PodDisruptionBudget
A single replica with a generous CPU limit set high enough to absorb traffic spikes
Running the container as the root user for unrestricted host access
Using the namespace's default ServiceAccount for the Pod

Why does graceful shutdown (handling SIGTERM) matter for production readiness?

So rescheduling and rollouts finish in-flight work instead of dropping requests
It speeds up image pulls by warming the node's layer cache
It is required before the scheduler will place the Pod on a node and bind it to the kubelet
It encrypts the Pod's inbound and outbound network traffic in transit

What does setting resource requests primarily ensure?

Correct scheduling and eviction behavior, and avoiding accidental BestEffort QoS
Automatic horizontal scaling of replica count as measured load rises
That the container image is signed and verified by a policy controller at admission time
Zero-downtime rollouts during every rolling Deployment update

You got correct