Production-Readiness Checklist
Before a workload goes live, a handful of concrete gates separate a demo from something you can trust at 3 a.m. This topic collects them into one checklist — each item covered in depth earlier, here as the list to actually run through before shipping.
None of these is novel; the value is that they are easy to skip individually and collectively decisive. A workload missing several of them is not production-ready, however well it runs in a demo.
Resources and Scheduling
Every container has resource requests set from measured usage, so scheduling and eviction behave and the workload isn't BestEffort by accident (Topic 25). Memory limits reflect the real ceiling; CPU limits are used cautiously to avoid needless throttling. Critical workloads use Guaranteed QoS so they survive node pressure. Without requests, the scheduler is guessing and your Pod is first to be evicted.
Health and Availability
The workload has a readiness probe (so rollouts and traffic gate on actual readiness) and a minimal liveness probe (testing only the process), with a startup probe if it boots slowly (Topic 29). It runs multiple replicas spread across zones via topology spread, with a PodDisruptionBudget so node drains and upgrades never drop it below a safe count (Topics 26, 30). A single-replica production service is a checklist failure.
| Gate | Why |
|---|---|
| Requests set (+ sensible limits) | Correct scheduling and eviction; no accidental BestEffort |
| Readiness + liveness probes | Zero-downtime rollouts; recover wedged processes |
| Multiple replicas across zones | Survive Pod, node, and zone failures |
| PodDisruptionBudget | Operations don't breach availability |
| securityContext (non-root, etc.) | Limit blast radius of a compromise |
Security and Configuration
The Pod runs with a hardened securityContext — non-root, dropped capabilities, read-only root filesystem where feasible — matching at least the Baseline (ideally Restricted) Pod Security Standard (Topic 34). It uses its own least-privilege ServiceAccount (Topic 33), not the default. Config and secrets come from ConfigMaps/Secrets with encryption at rest, never baked into the image. Images are pinned by digest from a trusted source (Topic 37).
Operability
The workload emits logs to stdout, exposes metrics, and propagates trace context (Chapter 9), with alerts on the symptoms users feel. It handles graceful shutdown — responding to SIGTERM, finishing in-flight work within the termination grace period — so rescheduling doesn't drop requests. And it is delivered through GitOps so what's running is what's in Git, with a tested rollback. Run this list before every go-live; the gaps are where incidents come from.
Demo-ready — it runs and serves a request. One replica, no probes, no limits — fine until anything goes wrong.
Production-ready — requests/probes/replicas/PDB/securityContext/observability/graceful-shutdown all in place. Survives failures and operations.
- Shipping a single replica for a production service.
- No requests or probes, so scheduling, rollouts, and recovery all misbehave.
- Running as root with the default ServiceAccount and no securityContext.
- No alerting on user-facing symptoms, so failures are noticed by customers first.
- Ignoring graceful shutdown, so every reschedule drops in-flight requests.
- Set requests (and sensible limits) on every container; use Guaranteed QoS for critical ones.
- Define readiness + minimal liveness (+ startup) probes; run multiple replicas across zones with a PDB.
- Harden with a non-root securityContext and a least-privilege ServiceAccount; pin images by digest.
- Emit logs/metrics/traces and alert on symptoms; handle SIGTERM for graceful shutdown.
- Deliver via GitOps with a tested rollback, and run this checklist before every go-live.
Knowledge Check
Which is a minimum availability gate for a production service?
- Multiple replicas spread across zones with a PodDisruptionBudget
- A single replica with a generous CPU limit set high enough to absorb traffic spikes
- Running the container as the root user for unrestricted host access
- Using the namespace's default ServiceAccount for the Pod
Why does graceful shutdown (handling SIGTERM) matter for production readiness?
- So rescheduling and rollouts finish in-flight work instead of dropping requests
- It speeds up image pulls by warming the node's layer cache
- It is required before the scheduler will place the Pod on a node and bind it to the kubelet
- It encrypts the Pod's inbound and outbound network traffic in transit
What does setting resource requests primarily ensure?
- Correct scheduling and eviction behavior, and avoiding accidental BestEffort QoS
- Automatic horizontal scaling of replica count as measured load rises
- That the container image is signed and verified by a policy controller at admission time
- Zero-downtime rollouts during every rolling Deployment update
You got correct