Topic 04

Incidents, On-Call, Postmortems & Maintenance

Concept

Things break in production — that's normal, not a sign of failure. What separates good teams is how they respond: someone is on call, they follow an incident process, and afterward they run a blameless postmortem to learn. And all of this lives inside the longest, least-glamorous phase of all: maintenance.

An incident is a serious problem in production that needs urgent attention — the app is down, data isn't saving, reminders have stopped. How a team handles incidents says a lot about its maturity.

A fire department is the model: there's always a crew on duty (on-call), a drilled response for when the alarm sounds (incident response), and a no-blame review after every call to get better (the postmortem).

Handling an incident, from alarm to lesson
Detect
Respond
Mitigate
Resolve
Postmortem

On-Call

On-call is a rotation that ensures someone is always ready to respond if production breaks, day or night. Team members take turns carrying the responsibility (and the pager) for a stretch — a week, say — so the load is shared and there's always a known person to react. On-call is what makes "we'll fix it fast" a real promise instead of a hope that someone happens to be awake.

Incident Response

Incident response is the practised process for handling a live problem: detect it (usually via an alert), mitigate it to stop the bleeding, communicate clearly to those affected, and then resolve the underlying cause. The first priority is almost always to stop the harm — get the system working again — even with a temporary fix, before hunting down the real root cause. Calm, rehearsed response beats panic every time.

Blameless Postmortems

After an incident, good teams hold a blameless postmortem — a write-up and discussion of what went wrong, why, and how to prevent it. The crucial word is blameless: the focus is on the system that allowed the failure, not on punishing a person. When people aren't afraid of blame, they share honestly, the real causes surface, and the system genuinely improves. Blaming individuals just teaches everyone to hide problems — the opposite of what you want.

Maintenance

Surrounding all of this is maintenance — the long tail of fixing bugs, patching security holes, updating dependencies, and making small improvements for years. It's the least glamorous phase and, as Chapter 2 noted, where most of a system's life and budget actually go. Far from leftover work, maintenance is the job for most of a product's existence, and it loops straight back to planning the next change.

When Cadence's reminders go down at 2 AM, the on-call engineer is paged, logs in, and mitigates fast — switching reminders off via the feature flag to stop the harm. The next day the team holds a blameless postmortem and finds the real cause: a missing alert that would have warned them hours earlier. Nobody is blamed; the fix is to add that alert. Then Cadence settles back into ongoing maintenance — a little better than before, because the failure taught them something.

Common Confusions
  • "An incident means someone messed up and should be blamed." Blameless culture targets the system that allowed the failure, not the person. Things break; learning from it beats punishing someone.
  • "Maintenance is the boring leftover work after the real job." It's the majority of the work and budget over a product's life — for most of a system's existence, maintenance is the real job.
  • "A postmortem is about finding who is at fault." It's about finding what to fix — the system gap, the missing alert, the unclear process. Naming a culprit teaches people to hide problems.
Why It Matters
  • This closes the lifecycle honestly: software is run, not just built, and breaking-then-recovering is a normal, managed part of that.
  • The blameless culture introduced here is one of the strongest predictors of healthy teams — and it returns in the final chapter.
  • "On-call", "incident", and "postmortem" are everyday operations words; understanding maintenance reframes most of a developer's career honestly.

Knowledge Check

What is "on-call"?

  • A rotation so someone is always ready to respond if production breaks
  • A single planning meeting where the whole project is scheduled out in advance
  • A technique for writing the application's code entirely by phone
  • The rule that all deployments must happen at night-time only

What is usually the first priority in incident response?

  • Stop the harm and get the system working, even with a temporary fix
  • Immediately track down and punish whoever caused the problem
  • Completely understand the root cause before taking any action at all
  • Write detailed documentation about the incident before fixing anything

What makes a postmortem "blameless"?

  • It focuses on the system that allowed the failure, not on blaming a person
  • It carefully identifies which person was at fault and holds them to it
  • It is conducted without recording anything at all about what actually happened
  • It is skipped completely so that no difficult conversations are needed

How does the topic describe the maintenance phase?

  • The majority of the work and budget over a product's life, not leftovers
  • A brief final phase that finishes very quickly soon after the launch date
  • The early phase where the original application code is first written
  • Optional busywork that experienced teams simply skip to save time

Why does blameless culture lead to safer systems?

  • When people aren't afraid of blame, they share honestly and causes surface
  • Because blaming individuals makes everyone work much harder out of sheer fear
  • Because punishing people for every mistake reliably prevents all bugs
  • Because a blame-free team simply makes the software run much faster

You got correct