Topic 03

Monitoring, Alerting & SLI/SLO/SLA

Concept

Observability gives you signals; now someone has to watch them and react. Monitoring watches the signals automatically and alerts a human when something's wrong. And to know what "wrong" even means, teams set clear reliability targets using three linked terms: SLI, SLO, and SLA.

These three acronyms sound bureaucratic but are genuinely useful: they turn the vague question "is it working?" into something specific and measurable that a whole team can agree on.

A home thermostat is the picture: it monitors the temperature, has a target you set (the goal), and a landlord's lease might even guarantee a range (the promise). Three layers — measure, target, promise — exactly like what follows.

Monitoring and Alerting

Monitoring is automatically watching the metrics and logs from the last topic, continuously. Alerting is the part that pages a human when a signal crosses a line you care about — error rate too high, response too slow, the system down. The goal is to learn about problems from your monitoring before users do, so you're responding to an alert at 2 AM, not to angry customers in the morning.

SLI: What You Measure

An SLI (Service Level Indicator) is the actual measurement of how the service is doing — a real number, like "the percentage of requests that respond in under one second" or "the percentage of reminders that fire on time". It's the concrete, measured reality. Everything else builds on it: you can't have a target or a promise without first deciding what you're measuring.

SLO: The Target

An SLO (Service Level Objective) is the internal goal for that indicator — the target you're aiming for, like "99.9% of requests under one second". It's the line that separates "healthy" from "needs attention", and it's what alerts are set against. Crucially, SLOs are deliberately not 100%, because perfect reliability is impossibly expensive; a good SLO picks a sensible bar that's worth the cost.

SLA: The Promise

An SLA (Service Level Agreement) is the formal promise made to customers — a contractual commitment, often with consequences (like refunds) if it's broken. An SLA is usually set looser than the internal SLO, so the team has a safety margin: they aim for the tougher internal target and only breach the customer promise if things go badly wrong. Indicator, objective, agreement — measure, goal, promise.

Cadence's reliability has all three. Their SLI is "the percentage of reminders that fire within one minute of the chosen time", measured continuously. Their SLO is "99% of reminders within one minute" — the internal target their monitoring checks, paging whoever's on call if it dips. They don't offer a contractual SLA yet, but if they sold a premium plan, that promise would sit a little looser than the 99% they aim for internally, giving them a margin.

Common Confusions
  • "SLI, SLO, and SLA are interchangeable." They're a chain: the indicator is what you measure, the objective is your internal target, the agreement is the customer promise. Different roles, distinct meanings.
  • "The goal is 100% reliability." Perfect uptime is impossibly expensive, so SLOs deliberately pick a sensible bar below 100%. Chasing the last fraction of a percent rarely pays off.
  • "More alerts means a safer system." Too many alerts cause alert fatigue — people start ignoring them, including the real ones. Good alerting is selective, firing only on what genuinely matters.
Why It Matters
  • SLI, SLO, and SLA run every reliability conversation — knowing the chain lets you follow and take part in how teams set and defend their targets.
  • They turn "is it working?" into something measurable and agreed, which is what makes reliability a managed thing rather than a vague hope.
  • Knowing 100% isn't the goal, and that too many alerts backfire, are two genuinely useful insights for anyone who operates software.

Knowledge Check

What is an SLI (Service Level Indicator)?

  • The actual measurement of how the service is doing, as a real number
  • The internal target you are aiming for, such as ninety-nine percent or so
  • The formal contractual promise that you make to your customers
  • The page or alert that is sent to a human when a target breaks

What is an SLO (Service Level Objective)?

  • The internal target you're aiming for, like 99.9% under one second
  • The actual measured number showing how the service is doing now
  • The contractual promise made to customers, with penalties if broken
  • The monitoring tool that watches the signals and sends out alerts

Why are SLOs deliberately set below 100%?

  • Perfect reliability is impossibly expensive, so a sensible bar is set
  • Because a strict rule forbids ever aiming for 100% reliability
  • Because customers genuinely prefer software that fails every now and then
  • Because monitoring tools are technically unable to measure 100%

How does an SLA differ from an SLO?

  • The SLA is the customer promise, usually looser than the internal SLO
  • They are simply two different names for the very same target
  • The SLA is the measurement itself, while the SLO is the customer promise
  • The SLA is always a much stricter target than the internal SLO is

You got correct