Topic 72

Disaster Recovery for State

StateRecovery

State is the most critical and most fragile thing Terraform owns. Lose it or corrupt it and Terraform no longer knows what it manages — even though every resource it created is still running. The infrastructure is fine; the map to it is gone. The next plan, with an empty or wrong state, proposes to create everything from scratch, which collides with the resources that already exist.

Disaster recovery for state means versioning, backups, and a procedure you have actually rehearsed. Done right, a corrupted write or a fat-fingered state rm is a two-minute rollback to a prior version; done wrong, it is a reconstruction project where you rebuild the map by hand against live infrastructure you are afraid to touch.

The Failure Modes

State fails in a handful of recognizable ways. A corrupted write leaves the file truncated or malformed, usually from an interrupted apply. An accidental state rm or destroy removes entries that should have stayed. A deleted state object — someone cleaned up the wrong S3 key — leaves Terraform with nothing. And concurrent writes against a backend with no effective locking — an unlocked remote backend, or a local file shared by hand across machines — interleave two applies into a file that reflects neither correctly.

Every one of these shares the same shape: the infrastructure is untouched, but Terraform's record of it is wrong or missing. That is what makes state DR tractable — you are recovering a record, not rebuilding servers. The whole game is having a known-good version of that record to restore.

S3 Versioning as the Safety Net

Bucket versioning on the state bucket is the primary recovery mechanism, and the single most important thing in this topic. With it enabled, every write to the state object — every apply, every surgery command — creates a new object version, and the previous versions stay retrievable. Recovery from almost any failure becomes restoring the object version from just before the bad write.

Recovering a broken state

corrupted / lost state

→

restore prior S3 version

→

reconcile drift

→

plan to confirm

versioning on the state bucket — turn it on first

resource "aws_s3_bucket_versioning" "tfstate" {
  bucket = aws_s3_bucket.tfstate.id
  versioning_configuration {
    status = "Enabled"
  }
}

A state bucket without versioning is an accident waiting to happen, because a corrupted write overwrites the only copy and there is no prior version to go back to. Enable versioning before the bucket holds anything that matters; it is the difference between a quick rollback and a reconstruction.

Backups

Terraform's state commands write an automatic local backup before they mutate state, timestamped so you can restore the prior file. That backup is real but limited: it lives on whoever ran the command, covers only commands run locally, and is easy to lose. It is a convenience, not a disaster-recovery plan.

For critical states, add deliberate periodic backups on top of versioning — a scheduled copy of the state object to a separate bucket or account, so a failure that takes out the primary bucket itself still leaves a recovery point. Versioning protects against bad writes; an out-of-band backup protects against losing the bucket.

Recovery Procedure

Recovery is restore, then reconcile. Restore the last known-good object version, replacing the corrupted or missing state. Then run a plan — not an apply — to compare the restored state against current reality, because anything that changed in AWS between that version and now shows up here as drift the restored state does not know about. Reconcile that drift before applying anything.

restore a prior version, then reconcile

# list versions, identify the last good one, restore it as current
aws s3api list-object-versions --bucket acme-tfstate \
  --prefix apps/checkout/terraform.tfstate
aws s3api copy-object --bucket acme-tfstate \
  --key apps/checkout/terraform.tfstate \
  --copy-source acme-tfstate/apps/checkout/terraform.tfstate?versionId=GOOD_ID

# reconcile: plan against reality before applying anything
terraform plan

Skipping the reconcile step is the trap. Restoring an old version and immediately applying treats a stale snapshot as truth, so anything that changed since that version gets reverted or recreated. The plan after restore is what catches the gap; read it and reconcile before any apply.

Reconstruction as Last Resort

When there is no backup and no prior version — versioning was never on, or every copy is gone — the only path left is rebuilding state by importing each resource one at a time. This is slow, error-prone, and exactly the situation versioning exists to prevent. For an estate of any size it is days of careful work matching live resources to config addresses, with a destructive plan waiting on every resource you import imperfectly.

You never want to be here. The entire point of versioning, backups, and a rehearsed procedure is that reconstruction stays hypothetical. If you find yourself importing a production estate from scratch, the failure was not the lost state — it was never enabling the versioning that would have made the loss a rollback.

Common Mistakes

Running the state backend without S3 bucket versioning, so a corrupted write overwrites the only copy and there is no version to restore.
Deleting or overwriting state with no backup, leaving import-based reconstruction as the only path back.
Restoring an old state version but applying immediately without reconciling, so drift since that version is reverted or recreated.
Trusting the local timestamped backup as the whole plan, when it lives on one machine and covers only locally-run commands.
Never rehearsing the restore procedure, so it is improvised under pressure during a real incident instead of being a known drill.

Best Practices

Enable S3 bucket versioning on every state bucket and treat it as the primary recovery mechanism.
Add deliberate periodic backups to a separate bucket or account for critical states, so losing the primary bucket is still recoverable.
Document and rehearse the restore-and-reconcile procedure before you need it, so it is a drill rather than improvisation.
After restoring a version, run a plan and reconcile any drift since that version before applying anything.
Keep states isolated per environment so a recovery in one cannot touch another.

Comparable tools CloudFormation no direct equivalent — AWS manages state, no user DR needed Pulumi state export, import, and backend versioning Ansible no direct equivalent — it keeps no state to recover

Knowledge Check

What is the consequence of losing the state file while the infrastructure still exists?

The resources keep running, but Terraform no longer knows what it manages and the next plan tries to recreate everything
Every managed resource is immediately destroyed in AWS the very moment the state object goes missing from the remote backend
Terraform automatically rebuilds the state by scanning the account on the very next plan
Nothing happens — state is a disposable cache that Terraform regenerates from scratch every run

What is the primary safety net for state disaster recovery?

S3 bucket versioning, so every write is a recoverable object version to restore from
The DynamoDB lock table, which retains a timestamped copy of every state write for recovery
The local .terraform directory, which caches a full state backup after each apply
Committing the state file to a git repository so its commit history is the backup

Why must you reconcile after restoring an old state version?

Anything that changed in AWS since that version is drift the restored state cannot see, so applying blindly reverts or recreates it
The restored version is still encrypted at rest in the S3 bucket and must be decrypted by a full plan run before Terraform can use it
Terraform refuses to apply a restored state file until you pass an explicit reconcile flag
Reconciling is the step that re-enables backend locking after a restore completes

When is import-based reconstruction your only option?

When there is no backup and no prior version — versioning was never enabled — so you rebuild state by importing each resource
Whenever a single resource shows drift after an apply and the plan proposes a small change
Every time you migrate state between two remote backends, since none of the existing entries carry over and each must be rebuilt
After any state mv command, to rebuild the entry that was moved to its new address

You got correct