Topic 29

On-Call and Incidents

Concept

An alert is only useful if a human is there to answer it. So picture the alert from the last topic firing at 3am: Pageturn's error rate has crossed the line, and a notification goes out. Someone has to wake up and deal with it. That readiness — a person on the hook to respond when something breaks — has a name, and so does the kind of problem that wakes them.

Being ready to respond to alerts, in turn, is called being on call. A serious problem in live software — Pageturn down, or failing for many readers — is an incident. How a team handles an incident, and especially what it does afterward, turns out to be one of the most telling parts of DevOps culture: fix it fast, then fix the cause without blaming a person.

From a 3am alert to a lesson learned — the shape of handling an incident

Alert firessomething's wrong

→

On-call respondssomeone takes it

→

Restore serviceusers first

→

Blameless postmortemstudy why, not who

What does "on call" mean?

Being on call means a team member is the designated person to respond if an alert fires — including outside normal hours. It's a rota: this week it's Sam, next week someone else, so the responsibility is shared and nobody carries it alone forever. When the alert goes off, the on-call person is the one expected to pick it up.

On call does not mean working around the clock. Most of the time nothing fires and the on-call person lives their normal life; they're simply reachable if something does break. It's like a fire department on a quiet night — not fighting fires every hour, just ready to roll the moment the bell rings. (We'll keep the fire department for one more idea, then drop it.)

What is an incident?

An incident is a real problem in production that needs action now — Pageturn is down, or sign-ups are failing for a lot of readers, or the site has slowed to a crawl. The word draws a line: not every small hiccup is an incident, but anything genuinely hurting users in the live system is. It's the difference between a smoke detector chirping and an actual fire.

Naming something an incident matters because it changes what happens next. It says: this is worth interrupting people for, worth dropping other work, worth a coordinated response right now. That shared signal is how a team moves fast together instead of each person quietly hoping someone else has noticed.

Responding: stop the bleeding first

The response to an incident runs in a rough order: detect it, respond, restore service, and communicate along the way. The first priority is almost always to get Pageturn working for readers again — even with a quick, temporary fix — before chasing down the deep root cause. Stop the bleeding now; understand the wound afterward.

This is where rollback from Chapter 8 earns its keep: if a recent deploy caused the incident, the fastest cure is often to roll straight back to the last good version, restoring service in seconds while the real fix waits for daylight. Communicating matters too — telling teammates and users what's happening keeps a tense situation from turning into a confused one.

The blameless postmortem: study why, not who

Once service is restored, the most important part begins. A postmortem is a write-up after an incident that asks what happened and why, and how to stop it from happening again. The crucial word is blameless: it studies the cause, not the culprit. The question is "why did the system let this happen?" — never "whose fault was it?"

That distinction is the heart of it. When people fear blame, they hide mistakes, and hidden mistakes can't be fixed; when a team studies causes without pointing fingers, problems get surfaced and prevented. This blameless habit is so central to how DevOps teams actually work that Chapter 12 returns to it as a pillar of the culture. And the watching side — reading the logs, metrics, and alerts that reveal what really happened during an incident — is the craft the Observability Deep Dive takes much further.

Common Confusions

"On-call means working 24 hours a day." It means being reachable and ready if an alert fires, not actively working the whole time. Most shifts are quiet; it's usually shared on a rota so no one carries it alone.
"An incident review is about finding whose fault it was." A blameless postmortem studies why the system allowed the problem and how to prevent it — not who to blame. Hunting for a culprit just makes people hide the next mistake.
"Once the site is back up, the work is finished." Restoring service is only half of it. The learning afterward — finding and fixing the root cause so it can't recur — is the part that actually makes the system better.
"During an incident you should find the root cause before doing anything else." The first priority is usually to restore service for users — even with a temporary fix or a rollback — and dig into the deep cause once the bleeding has stopped.

Why It Matters

On-call and incident response are where "you run it" gets real — the team that built Pageturn is the team that gets paged when it breaks.
The blameless postmortem is a defining habit of DevOps culture, and it's exactly where this chapter hands off to Chapter 12.
Restoring service before chasing root cause — and reaching for rollback to do it — is how real teams keep an outage short instead of long.
Reading logs, metrics, and alerts to reconstruct an incident is a craft in its own right, and the doorway to the Observability Deep Dive.

Knowledge Check

What does it mean for someone on the team to be "on call"?

They're the designated person ready to respond if an alert fires
They have to actively work all twenty-four hours of every day without any rest
They write the report explaining why a past incident happened
They watch every chart continuously so no alert is ever needed

Pageturn goes down because of a deploy an hour ago. What should the on-call person prioritize first?

Get the site working for readers again, even with a quick fix or rollback
Fully investigate and document the deep root cause before touching the live site at all
Work out which engineer is at fault before doing anything else
Wait for readers to complain before deciding whether to act

What makes a postmortem "blameless," and why does that matter?

It studies why it happened, not who did it, so problems get surfaced
It means the team skips the review entirely to spare feelings
It carefully identifies exactly which individual person should be held responsible
It is the quick fix that brings the site back during the incident

You got correct