Chapter Eleven

Keeping It Alive: Running Software in Production

Shipping is the middle of the story, not the end. Once software is live, someone has to keep it running, watch how it behaves, and respond when it breaks. This chapter covers the half of the job nobody pictures: deploying and hosting, observing software with logs and metrics, setting reliability targets, and handling incidents — plus the long, quiet phase of maintenance.

4 topics

Back in Chapter 2 we said most of a system's life happens after launch. This is the chapter about that life — the unglamorous, essential work of running software in production, where real users meet it and real problems appear.

Four topics. First, deployment and where software actually runs. Then observability — how you see inside a running system through logs, metrics, and traces. Then how teams set and measure reliability targets, with monitoring and the SLI/SLO/SLA vocabulary. And finally how teams respond when things break — incidents, on-call, blameless postmortems — and the long maintenance phase that surrounds it all.

The work of keeping software alive

Deploy it somewhere

a server, a container, the cloud

Watch how it behaves

logs, metrics, and traces

Set and measure targets

monitoring, alerting, SLOs

Respond and improve

incidents, postmortems, maintenance