Chapter 12: Production Operations
Topic 69

State Surgery

StateOperations

Sometimes you must operate on state directly: remove a resource Terraform should forget but not destroy (state rm), rename or move a resource's address (state mv), or force a resource to be recreated on the next apply (-replace). These are the sharp tools in the kit. A wrong state rm orphans a live resource that keeps billing; a wrong state mv corrupts the mapping between your code and reality. Using them deliberately — and knowing when a declarative alternative is better — is a defining production skill.

The governing rule for all of them: look before you cut. Every surgery command takes a resource address, and the difference between fixing a problem and creating one is usually whether you confirmed the exact address first.

Removing from State

terraform state rm ADDRESS tells Terraform to forget a resource without touching the real object. The resource stays alive in AWS; Terraform simply stops managing it. This is the right tool when you are handing a resource off to another state or another team, or cleaning up after a bad import that pulled in something you did not mean to manage.

state rm — forget, do not destroy
# Terraform stops tracking the bucket; the bucket still exists in AWS
terraform state rm aws_s3_bucket.legacy_logs

The orphan risk is the whole danger. After state rm, nothing in Terraform knows the resource exists, so nothing will ever clean it up — it keeps running and keeps billing, invisible to every future plan. state rm is for handing off management on purpose, never for "getting rid of" a resource; for that you want destroy.

Moving an Address

terraform state mv OLD NEW changes a resource's address in state without destroying and recreating the object — the imperative way to record a rename or a move into a module. It works, but it is a local one-off command: every teammate and every pipeline that shares the state must run the identical command separately, and none of it shows up in a plan or a review.

state mv — imperative rename in place
# rename the address; the underlying instance is untouched
terraform state mv aws_instance.web aws_instance.api

# move a resource into a module
terraform state mv aws_instance.api module.compute.aws_instance.this

For anything in shared state, a moved block in your configuration is the better tool — it is in version control, shows up in the plan, and applies once for everyone including CI. Reserve state mv for genuine one-off surgery where no declarative path fits.

state mv vs a moved block
state mv
Imperative. Run locally, by each person, on every shared copy of state. Bypasses the plan and review entirely.
moved block
Declarative. Lives in your code, shows up in the plan, gets reviewed, and applies once for everyone including CI.

Forcing Recreation with -replace

When a resource is in a bad runtime state that its config does not capture — a corrupted instance, a stuck cache node — you force Terraform to destroy and recreate it on the next apply with -replace=ADDRESS. This is the successor to the old taint workflow: where terraform taint mutated state out of band so the next plan recreated the resource, -replace is a plan-time flag that shows the replacement in the plan you review before it happens.

-replace — superseded the taint workflow
# show the destroy/recreate in the plan, then apply it
terraform plan -replace=aws_instance.api
terraform apply -replace=aws_instance.api

# the old way — deprecated, mutated state with no plan preview
# terraform taint aws_instance.api

-replace is strictly better because it is reviewable: the destroy-then-create appears as -/+ in the plan output, so you see exactly what is about to happen instead of discovering it after a separate taint command already changed state. Prefer it over taint in any current version.

Inspecting Before Surgery

No surgery command should run before state list and state show. state list prints every address in the state so you can confirm the exact target — including the indexed forms (aws_instance.web[0], aws_instance.web["api"]) that are easy to get wrong from memory. state show ADDRESS dumps a resource's recorded attributes so you can verify you are operating on the object you think you are.

look before you cut
# list every address in state, filter to what you care about
terraform state list | grep instance

# confirm the exact resource and its attributes
terraform state show aws_instance.api

A targeting mistake on a surgery command is a self-inflicted incident: state rm on the wrong address orphans the wrong resource, state mv to the wrong target corrupts the mapping. The thirty seconds spent confirming the address is the cheapest insurance in the chapter.

Backups and Recovery

Every state subcommand writes a local backup file before it changes anything, named with a timestamp so you can restore the prior state if a command went wrong. That local backup is a convenience, not the real safety net — it lives on the operator's machine and is easy to lose.

The dependable recovery mechanism is versioning on the remote state bucket. With S3 bucket versioning enabled, every state write — including the ones your surgery commands make — is a recoverable object version, so a botched state rm is a restore of the previous version rather than a reconstruction project. Rely on bucket versioning as the backstop and treat the local backup as the quick first option.

state mv vs moved block

state mv — an imperative, local, one-off command that each operator and pipeline must run separately, bypassing review entirely. Reserve it for genuine one-off surgery where no declarative path fits — fixing one operator's diverged state, an awkward move no moved block expresses cleanly.

moved block — a declarative statement in your configuration: in version control, visible in the plan, reviewed like any change, and applied exactly once for everyone including CI. Use it for any shared refactor — a rename or a module extraction that every state needs to follow.

Common Mistakes
  • Running state rm and forgetting the resource still exists in AWS, orphaning it so it keeps running and billing with nothing tracking it.
  • Using state mv for a refactor everyone needs instead of a moved block, leaving every teammate's and CI's state diverged until each runs the command by hand.
  • Operating on state without state list or state show first and targeting the wrong address — orphaning or corrupting a resource you never meant to touch.
  • Still reaching for terraform taint to force recreation instead of -replace, giving up the plan preview that shows the destroy/recreate before it happens.
  • Trusting the local timestamped backup alone when surgery goes wrong, instead of S3 bucket versioning that survives losing the operator's machine.
Best Practices
  • Inspect with state list and state show before any surgery to confirm the exact target address, including indexed forms.
  • Use moved blocks for shared refactors; reserve state mv and state rm for true one-off surgery.
  • Prefer -replace over the deprecated taint workflow so the destroy/recreate shows up in a plan you review first.
  • Remember state rm forgets without destroying — use destroy when you actually want the resource gone.
  • Rely on S3 bucket versioning as the recovery mechanism and verify a known-good version exists before risky surgery.
Comparable tools CloudFormation limited equivalents via resource import and removal Pulumi state delete, state rename, and --replace Ansible no direct equivalent — it keeps no state to operate on

Knowledge Check

What does terraform state rm do to the real resource?

  • Nothing — the resource keeps running in AWS; Terraform just stops tracking it, risking an orphan
  • Destroys it immediately in AWS, exactly the same as running terraform destroy against that resource
  • Schedules the underlying resource for deletion on the very next apply
  • Moves it into a separate quarantine state file until you choose to restore it

Why is a moved block preferred over state mv for a shared refactor?

  • It is in version control, shows up in the plan, and applies once for everyone including CI
  • It is faster because it edits the state directly without making any AWS API call
  • It can move resources between two completely separate state files in one step, which state mv cannot
  • It encrypts the moved resource's attributes inside the state file

What did -replace supersede, and why is it better?

  • It superseded taint; the destroy/recreate shows up in the plan you review before it happens
  • It superseded state mv; it moves the resource to a new address instead of recreating it from scratch
  • It superseded destroy; it removes the resource without a confirmation prompt
  • It superseded import; it brings an existing resource under management faster

A state rm went wrong and you need to recover. What is the dependable mechanism?

  • Restore the prior object version from the versioned S3 state bucket
  • Re-run the same command, which automatically reverses itself
  • Run terraform refresh, which rebuilds the removed entry from AWS
  • Delete the state file so Terraform regenerates it from the configuration

You got correct