Chapter Twelve

Production Operations

Keeping a live Terraform estate fast, recoverable, and current — large-state performance, state surgery, version upgrades, zero-downtime replacement, state disaster recovery, and debugging when the engine misbehaves.

6 topics

Everything to this point assumed Terraform behaves. In production it sometimes does not: a plan that took ten seconds now takes ten minutes, a refactor needs to move a resource between states by hand, a provider upgrade rewrites half the plan, a replacement drops traffic, a state file gets corrupted, or a resource shows a diff that never goes away no matter how many times you apply.

This chapter is the operational half of running Terraform at scale. It covers why large state is slow and how to split it, the sharp state-surgery commands and when each is the right one, upgrading core and providers without breakage, replacing serving resources without downtime, recovering state after a corrupted write, and reading the engine's own logs when nothing else explains what it is doing.

Topics in This Chapter

Large State and Performance

Why per-resource refresh makes plans slow as resource count grows, splitting one large state along blast-radius boundaries, and why -target, -refresh=false, and -parallelism are surgical tools rather than routine speedups.

OperationsState

Operating on state directly: state rm to forget without destroying, state mv versus a moved block, -replace as the taint successor, and inspecting with state list before you cut.

StateOperations

Upgrading Providers and Versions

Staying current without a giant deferred migration: reading upgrade guides, bumping one major version at a time, the one-way state-format upgrade on core, and coordinating a team on a single version.

ToolingOperations

Zero-Downtime Resource Replacement

Why a default replacement causes an outage, and the recipe to avoid it: create_before_destroy, load-balancer draining, ASG instance refresh, and why stateful resources need a different strategy.

OperationsCompute

Disaster Recovery for State

The failure modes that lose or corrupt state, S3 bucket versioning as the safety net, the restore-and-reconcile procedure, and import-based reconstruction as the last resort you never want.

Seeing what Terraform actually does: TF_LOG levels and TF_LOG_PATH, reading the provider's API calls, diagnosing a perpetual diff, finding where an apply hangs, and filing a useful crash report.

OperationsTooling