Topic 69

Reliability and SRE Practices

ReliabilitySRE

Running Kubernetes reliably is as much operational discipline as configuration. This topic covers the SRE practices — service-level objectives, error budgets, and the operational habits — that keep a cluster trustworthy over time, beyond the per-workload gates of the readiness checklist.

The shift is from "is it configured right" to "how do we operate it": measuring reliability, deciding how much is enough, and building the muscle to keep it there.

SLIs, SLOs, and Error Budgets

Reliability needs a target, not a vibe. A service-level indicator (SLI) is a measured signal users feel — availability, latency, error rate. A service-level objective (SLO) is the target for it (99.9% of requests succeed). The error budget is the inverse — the allowed failure (0.1%) — and it is a tool, not just a number: while budget remains, ship features; when it is exhausted, stop and stabilize. This turns reliability from an argument into a shared, data-driven decision.

Alert on Symptoms

Build alerting around the SLOs: page on the symptoms users feel (error rate up, latency up, budget burning fast), not on every internal cause, which produces noise and alert fatigue (Topic 46). A good alert means a human needs to act now; everything else is a dashboard or a ticket. The fastest way to make on-call ignore alerts is to page them for things that don't matter.

Operational Discipline

Day-to-day reliability rests on habits: capacity planning with headroom for bursts and failures (autoscaling is not instant); safe, progressive rollouts with automated rollback (Topic 44); and incident response with runbooks, clear ownership, and blameless postmortems that turn each incident into a fix. The cluster controls from earlier — replicas, PDBs, probes, HA control plane — are the substrate; these practices are how you operate on top of them.

Practice	Purpose
SLIs/SLOs + error budget	Define and govern 'reliable enough'
Symptom-based alerting	Page on what users feel, avoid fatigue
Capacity + headroom	Absorb bursts and failures (scaling isn't instant)
Progressive rollout + rollback	Limit blast radius of bad releases
Runbooks + blameless postmortems	Respond fast; turn incidents into fixes

Test Failure Deliberately

Reliability that has never been tested is a guess. Game days and chaos experiments deliberately fail components — kill Pods, drain nodes, take a zone offline — to verify the system recovers as designed and the runbooks are correct (Topic 61). It is also how you find the gaps cheaply, on your schedule, instead of expensively, during a real incident. Finally, attack toil: automate the repetitive operational work, because manual toil is both a cost and a reliability risk (humans tire and err). SRE is the discipline of making reliability measurable, then engineering to the measure.

Reactive ops vs SRE

Reactive ops — fix things when they break; reliability is a vibe; alerts page on causes. Burns out teams.

SRE — SLOs and error budgets govern reliability; alert on symptoms; test failure; automate toil. Data-driven and sustainable.

Common Mistakes

No SLOs, so 'reliable enough' is an argument rather than a measured, shared target.
Alerting on internal causes instead of user-facing symptoms, causing alert fatigue.
No capacity headroom, relying on autoscaling that isn't instant to absorb bursts.
No runbooks or blameless postmortems, so incidents recur instead of becoming fixes.
Never testing failure, so recovery is unproven until a real incident.

Best Practices

Define SLIs/SLOs and use the error budget to balance feature work against stability.
Alert on symptoms users feel; keep pages actionable to avoid fatigue.
Plan capacity with headroom; use progressive rollouts with automated rollback.
Maintain runbooks and run blameless postmortems that produce concrete fixes.
Run game days/chaos tests to prove recovery, and automate toil to remove a reliability risk.

RelatedMetrics and monitoring — where SLIs and alerts live (Topic 46)High availability and DR — game days prove recovery (Topic 61)CI/CD — progressive rollout and rollback (Topic 44)

Knowledge Check

What is an error budget, and how is it used?

The allowed failure (inverse of the SLO); while budget remains you ship features, when exhausted you stop and stabilize
The money allocated each quarter to fixing reported production bugs
The number of replicas a service may lose to node drains or evictions before it drops below its minimum available count and goes down
A limit on how many alerts may fire during a rolling time window before alerting is throttled

What should alerts page a human about?

Symptoms users feel (error rate, latency, fast budget burn), not every internal cause
Every metric that crosses any statically configured threshold value
Only sustained CPU usage that stays above the node's limit
Internal counters and queue depths on every component regardless of whether users feel any impact

Why run game days / chaos experiments?

To prove the system recovers as designed and runbooks are correct, finding gaps before a real incident
To increase cluster utilization by packing more Pods onto each node and squeezing out reserved headroom
To replace monitoring and alerting entirely with scheduled failure injection runs
To reduce the number of replicas a service needs to run while staying available

You got correct