Chapter Twelve
Production Operations
Keeping a live Terraform estate fast, recoverable, and current — large-state performance, state surgery, version upgrades, zero-downtime replacement, state disaster recovery, and debugging when the engine misbehaves.
Everything to this point assumed Terraform behaves. In production it sometimes does not: a plan that took ten seconds now takes ten minutes, a refactor needs to move a resource between states by hand, a provider upgrade rewrites half the plan, a replacement drops traffic, a state file gets corrupted, or a resource shows a diff that never goes away no matter how many times you apply.
This chapter is the operational half of running Terraform at scale. It covers why large state is slow and how to split it, the sharp state-surgery commands and when each is the right one, upgrading core and providers without breakage, replacing serving resources without downtime, recovering state after a corrupted write, and reading the engine's own logs when nothing else explains what it is doing.
Topics in This Chapter
-target, -refresh=false, and -parallelism are surgical tools rather than routine speedups.state rm to forget without destroying, state mv versus a moved block, -replace as the taint successor, and inspecting with state list before you cut.create_before_destroy, load-balancer draining, ASG instance refresh, and why stateful resources need a different strategy.TF_LOG levels and TF_LOG_PATH, reading the provider's API calls, diagnosing a perpetual diff, finding where an apply hangs, and filing a useful crash report.