What to Alert On
Topic 63

What to Alert On

Alerting

Network monitoring fails in two directions, and both end the same way: nobody trusts the pager. Alert on everything and the noise trains people to ignore alerts, so the one that matters scrolls past unread. Alert on nothing useful and an outage runs silent until a customer reports it. The fix is not more alerts or fewer — it is choosing the few signals that track what users actually feel, and routing only those to a human at 3 a.m.

This topic applies SLI thinking to networks. Instead of paging on raw counters — CPU at 80%, an interface at 7 Gbps — you page on symptoms: latency users experience, packets lost, requests erroring, links running out of headroom. The counters still matter, but as the things you consult to explain a symptom, not as the things that wake you. Get that split right and the alert volume drops by an order of magnitude while the coverage of real incidents goes up.

Symptom-basedpage a human
Users feel it — latency, errors, loss, traffic-dropping saturation. One alert covers a hundred root causes you never anticipated, and it correlates with what customers actually report. This is what wakes someone at 3 a.m.
Cause-baseddashboard it
A mechanism — CPU at 80%, a deep queue, a busy link. Consult these to explain a symptom after it fires; don't page on them, because a busy-but-healthy system trips them constantly and trains the team to mute the pager.

The Four Golden Signals for Networks

The SRE golden signals translate cleanly to networking. Latency is round-trip time and how it changes — the first thing users feel. Traffic is the demand on the system: requests per second, bits per second, new connections per second. Errors are the rate of things going wrong: dropped packets, interface CRC errors, failed connections, TCP resets. Saturation is how full the system is relative to capacity — the signal that predicts the other three are about to get worse.

These four cover most of what can go wrong on a network with a small, stable set of metrics. The trick is that they are rates and ratios, not absolute counters: error rate per second, not lifetime error count; utilization percentage, not raw byte total. A counter that only ever climbs tells you nothing about right now; the derived rate is what a threshold can sensibly fire on.

Symptom versus Cause Alerting

A symptom is something a user can feel: requests are slow, connections fail, a page errors. A cause is a mechanism that might produce a symptom: CPU is high, a queue is deep, a link is busy. The rule is to page on symptoms and dashboard the causes. CPU at 80% is not an incident if latency and error rate are fine — it is headroom being used as intended. Paging on it generates a 3 a.m. wake-up for a system that is working.

The reason symptom alerting wins is coverage. There are a hundred ways to make a service slow, and you cannot enumerate and threshold all of them; but they all converge on the same handful of symptoms — latency, errors, loss. Alert on the symptom and you catch causes you never anticipated. Then, when the symptom fires, the cause dashboards are right there to tell you which of the hundred mechanisms it was this time.

Saturation and Headroom

Saturation is the signal that buys you time, because it leads the others — a link at 95% is still passing traffic, but the queue is filling and latency is about to spike. The catch is that the dangerous saturations are rarely raw bandwidth. The conntrack table filling on a firewall, the ephemeral-port pool exhausting on a NAT gateway, or a switch's output queue going deep will drop traffic while the interface graph still reads a comfortable 40%.

These table- and pool-fill metrics are the ones teams forget to instrument until an incident teaches them to — connection-tracking exhaustion drops new flows with the link half-idle, and the throughput graph offers no clue. Alert on percentage-of-capacity for every bounded resource, not just the obvious one: link utilization, conntrack entries against the table limit, ephemeral ports in use, queue depth. Headroom is the early warning the bandwidth graph cannot give you.

Thresholds, Percentiles, and Error Budgets

Averages lie about latency. If the mean response time is 40 ms but the p99 is 900 ms, one request in a hundred is painfully slow — and at a thousand requests per second that is ten suffering users every second, completely invisible in the average. Alert on the high percentiles (p95, p99) because they are where users actually live; the average is dragged down by the fast majority and hides the tail that generates complaints.

Thresholds need hysteresis or they flap. A bare "fire if latency > 200 ms" alert will trigger and clear repeatedly on every transient spike, and a pager that cries wolf gets muted. Require the condition to hold for a duration (latency above 200 ms for 5 minutes) and frame budgets over a window — an error budget of, say, 0.1% of requests per month — so a brief blip spends budget instead of paging, and only sustained burn wakes someone. The goal is one page per real problem, not one per spike.

Symptom-Based vs Cause-Based Alerting

Symptom-based alerts fire on user-visible impact — latency, errors, loss, saturation that is dropping traffic. Page on these, because they cover causes you never anticipated and they correlate with what users actually report. One symptom alert can catch a hundred different root causes.

Cause-based alerts fire on a mechanism — CPU at 80%, a deep queue, a busy link. Dashboard these and investigate with them after a symptom fires; do not page on them, because a busy-but-healthy system trips them constantly and trains the team to ignore the pager.

Common Mistakes
  • Alerting on every interface counter and CPU gauge. The team gets paged for a router at 80% CPU that is serving traffic perfectly, learns the alerts are noise, and starts swiping them away — so the one real outage gets dismissed with the rest.
  • Never alerting on conntrack or ephemeral-port saturation. The table fills, new connections are dropped while the bandwidth graph reads 40%, and the first signal you get is users reporting failures, because nothing was watching the resource that actually ran out.
  • Setting bare thresholds with no duration requirement. A "latency > 200 ms" alert fires and clears on every normal spike, the pager flaps a dozen times an hour, and people mute it — so it is silent during the sustained event that mattered.
  • Monitoring averages and trusting them. A 40 ms mean hides a 900 ms p99, so one request in a hundred is miserable and the dashboard looks green; the tail that generates every complaint is exactly what the average erases.
  • Paging on causes instead of symptoms. A high-CPU page wakes someone for a system with fine latency and zero errors, burning trust and on-call sleep on a non-incident while teaching the team that pages do not mean real problems.
Best Practices
  • Page on the four golden signals — latency, traffic, errors, saturation — expressed as rates and ratios, and route everything else to dashboards you consult after a symptom fires.
  • Alert on p95 and p99 latency rather than the average, because the tail is where users feel pain and the mean is dragged down by the fast majority into looking healthy.
  • Instrument percentage-of-capacity for every bounded resource — link utilization, conntrack entries, ephemeral ports, queue depth — so a table or pool filling pages you before it drops traffic at low bandwidth.
  • Require conditions to hold for a duration and frame thresholds as error budgets over a window, so transient spikes spend budget quietly and only sustained burn pages a human.
  • Treat symptoms as the page and causes as the investigation, so a single latency or error alert covers root causes you never enumerated and the cause dashboards explain which one fired this time.
Comparable conceptsSRE golden signals (the source)SLO / error budgets (the framework)

Knowledge Check

Why does the SRE practice favor paging on symptoms over paging on causes like high CPU?

  • A few symptom alerts cover causes you never anticipated, while cause alerts fire on healthy-but-busy systems
  • Symptom alerts identify the exact root cause automatically, so no follow-up investigation is ever needed at all
  • Cause metrics such as device CPU are far more expensive and slower to collect than user-facing latency is
  • High CPU is always an outage, so it is the most reliable thing to page on

Why alert on p99 latency rather than the average?

  • The average is dragged down by fast requests and hides a slow tail that p99 exposes
  • The p99 is much cheaper and faster to compute than a plain average over the same window of requests
  • The p99 smooths transient spikes out of the signal so that the latency alert never flaps on them
  • The p99 measures how many total requests arrived in the window, which a simple average cannot do

A firewall's conntrack table fills and new connections start failing, but the interface graph reads 40% utilization. What does this teach about saturation alerting?

  • Bounded resources like the conntrack table need their own capacity alerts, since bandwidth graphs miss them
  • The interface utilization counter was simply reading wrong and the device graph should be recalibrated
  • Raising the link-utilization alert threshold a little higher would have caught this particular failure in time to act
  • Packet capture is the only way to detect this class of failure

You got correct