Reliability and SRE Practices
Running Kubernetes reliably is as much operational discipline as configuration. This topic covers the SRE practices — service-level objectives, error budgets, and the operational habits — that keep a cluster trustworthy over time, beyond the per-workload gates of the readiness checklist.
The shift is from "is it configured right" to "how do we operate it": measuring reliability, deciding how much is enough, and building the muscle to keep it there.
SLIs, SLOs, and Error Budgets
Reliability needs a target, not a vibe. A service-level indicator (SLI) is a measured signal users feel — availability, latency, error rate. A service-level objective (SLO) is the target for it (99.9% of requests succeed). The error budget is the inverse — the allowed failure (0.1%) — and it is a tool, not just a number: while budget remains, ship features; when it is exhausted, stop and stabilize. This turns reliability from an argument into a shared, data-driven decision.
Alert on Symptoms
Build alerting around the SLOs: page on the symptoms users feel (error rate up, latency up, budget burning fast), not on every internal cause, which produces noise and alert fatigue (Topic 46). A good alert means a human needs to act now; everything else is a dashboard or a ticket. The fastest way to make on-call ignore alerts is to page them for things that don't matter.
Operational Discipline
Day-to-day reliability rests on habits: capacity planning with headroom for bursts and failures (autoscaling is not instant); safe, progressive rollouts with automated rollback (Topic 44); and incident response with runbooks, clear ownership, and blameless postmortems that turn each incident into a fix. The cluster controls from earlier — replicas, PDBs, probes, HA control plane — are the substrate; these practices are how you operate on top of them.
| Practice | Purpose |
|---|---|
| SLIs/SLOs + error budget | Define and govern 'reliable enough' |
| Symptom-based alerting | Page on what users feel, avoid fatigue |
| Capacity + headroom | Absorb bursts and failures (scaling isn't instant) |
| Progressive rollout + rollback | Limit blast radius of bad releases |
| Runbooks + blameless postmortems | Respond fast; turn incidents into fixes |
Test Failure Deliberately
Reliability that has never been tested is a guess. Game days and chaos experiments deliberately fail components — kill Pods, drain nodes, take a zone offline — to verify the system recovers as designed and the runbooks are correct (Topic 61). It is also how you find the gaps cheaply, on your schedule, instead of expensively, during a real incident. Finally, attack toil: automate the repetitive operational work, because manual toil is both a cost and a reliability risk (humans tire and err). SRE is the discipline of making reliability measurable, then engineering to the measure.
Reactive ops — fix things when they break; reliability is a vibe; alerts page on causes. Burns out teams.
SRE — SLOs and error budgets govern reliability; alert on symptoms; test failure; automate toil. Data-driven and sustainable.
- No SLOs, so 'reliable enough' is an argument rather than a measured, shared target.
- Alerting on internal causes instead of user-facing symptoms, causing alert fatigue.
- No capacity headroom, relying on autoscaling that isn't instant to absorb bursts.
- No runbooks or blameless postmortems, so incidents recur instead of becoming fixes.
- Never testing failure, so recovery is unproven until a real incident.
- Define SLIs/SLOs and use the error budget to balance feature work against stability.
- Alert on symptoms users feel; keep pages actionable to avoid fatigue.
- Plan capacity with headroom; use progressive rollouts with automated rollback.
- Maintain runbooks and run blameless postmortems that produce concrete fixes.
- Run game days/chaos tests to prove recovery, and automate toil to remove a reliability risk.
Knowledge Check
What is an error budget, and how is it used?
- The allowed failure (inverse of the SLO); while budget remains you ship features, when exhausted you stop and stabilize
- The money allocated each quarter to fixing reported production bugs
- The number of replicas a service may lose to node drains or evictions before it drops below its minimum available count and goes down
- A limit on how many alerts may fire during a rolling time window before alerting is throttled
What should alerts page a human about?
- Symptoms users feel (error rate, latency, fast budget burn), not every internal cause
- Every metric that crosses any statically configured threshold value
- Only sustained CPU usage that stays above the node's limit
- Internal counters and queue depths on every component regardless of whether users feel any impact
Why run game days / chaos experiments?
- To prove the system recovers as designed and runbooks are correct, finding gaps before a real incident
- To increase cluster utilization by packing more Pods onto each node and squeezing out reserved headroom
- To replace monitoring and alerting entirely with scheduled failure injection runs
- To reduce the number of replicas a service needs to run while staying available
You got correct