Chapter Twelve

Production Operations

Keeping a live Terraform estate fast, recoverable, and current — large-state performance, state surgery, version upgrades, zero-downtime replacement, state disaster recovery, and debugging when the engine misbehaves.

6 topics

Everything to this point assumed Terraform behaves. In production it sometimes does not: a plan that took ten seconds now takes ten minutes, a refactor needs to move a resource between states by hand, a provider upgrade rewrites half the plan, a replacement drops traffic, a state file gets corrupted, or a resource shows a diff that never goes away no matter how many times you apply.

This chapter is the operational half of running Terraform at scale. It covers why large state is slow and how to split it, the sharp state-surgery commands and when each is the right one, upgrading core and providers without breakage, replacing serving resources without downtime, recovering state after a corrupted write, and reading the engine's own logs when nothing else explains what it is doing.

Topics in This Chapter