Drift Detection and Reconciliation
Drift is the gap between what Terraform's state says exists and what is actually in AWS. It opens when someone changes a resource in the console, an autoscaler adjusts a desired count, or another tool edits a tag — anything that touches a Terraform-managed resource outside Terraform. Detecting drift early and reconciling it deliberately is core to operating Terraform-managed infrastructure; ignore it and the next unrelated apply produces a surprising change nobody planned.
The trap is that the same apply that "fixes" drift can also undo a legitimate change. Reconciliation is a decision, not a reflex: sometimes the right move is to revert reality back to the code, and sometimes it is to accept reality into state. Getting that call wrong either re-breaks something someone fixed or bakes a console mistake into your configuration.
What Causes Drift
Most drift comes from people: a console change made during an incident, a quick fix applied by hand and never codified. The rest comes from systems — an autoscaler that owns the desired_capacity of an ASG and moves it independently, another tool that writes tags, or AWS itself applying a default to an attribute after creation. All of it is reality diverging from what Terraform last recorded, and Terraform has no idea until it next refreshes.
Detecting Drift
You detect drift by refreshing state against reality and seeing whether they disagree. The clean way to do this without touching anything is a plan -refresh-only run on a schedule in CI — nightly, say — that alerts when the refresh shows a difference. A plain scheduled plan works too, but -refresh-only isolates "reality changed" from "the config changed," which is exactly the signal you want for drift.
# scheduled nightly in CI terraform plan -refresh-only -detailed-exitcode # exit 0 = no drift, 2 = drift detected, 1 = error # a non-zero "changes" exit fires the alert
-refresh-only updates Terraform's view of reality and reports what differs from state, changing nothing. With -detailed-exitcode, an exit code of 2 means a difference was found — the pipeline turns that into an alert. This is the difference between learning about drift on your schedule and learning about it mid-incident when an unrelated apply suddenly wants to revert a console hotfix.
Reconciliation Options
Once you have found drift, there are three ways to reconcile. Let Terraform revert the change with a normal apply, putting reality back to the declared config. Accept reality into state with a -refresh-only apply, or stop fighting the attribute with ignore_changes if it will keep changing. Or, if the resource was created out of band entirely, import it so Terraform starts managing it. Which one is right depends on whether the out-of-band change should have happened.
The refresh-only Plan
A -refresh-only apply is the safe reconciliation primitive: it updates state to match reality without changing any infrastructure. It is how you accept a legitimate external change — the autoscaler moved the count, and you want state to record the real value rather than have the next apply drag it back. It touches nothing in AWS; it only rewrites Terraform's bookkeeping to agree with what is already there.
Preventing Drift
The cheapest drift to handle is the drift that never happens. Reduce console write access on Terraform-managed resources so the path of least resistance is a code change, not a click. When you find people repeatedly changing the same thing by hand, that is a signal to codify it — make the console the harder option. Drift is rarely malice; it is usually convenience, and you fix it by making the convenient path go through Terraform.
Revert (a normal apply) — puts reality back to the declared config, undoing the out-of-band change. Right when the console change was unintended or unauthorized — a manual tweak that should never have happened and the code is the truth.
Accept (refresh-only apply or ignore_changes) — updates state to record reality, or tells Terraform to stop managing that attribute. Right when the external change is legitimate — an autoscaler's count, a tag another system owns — and forcing it back would re-break something working.
- Discovering drift only when an unrelated apply suddenly wants to revert a console hotfix someone made during an incident.
- Blindly applying to "fix" drift and reverting a legitimate out-of-band change — an autoscaler's count — that should have been accepted instead.
- Never running drift detection at all, so state and reality silently diverge for months until a surprise apply surfaces it.
- Leaving broad console write access on Terraform-managed resources, inviting constant drift from quick manual fixes.
- Reaching for
ignore_changes = allto silence drift noise, then missing a real divergence the config should have caught.
- Run scheduled drift detection — a
plan -refresh-only -detailed-exitcodein CI — and alert on divergence rather than waiting for a surprise apply. - Decide per case whether to revert with a normal apply or accept with a
-refresh-onlyapply orignore_changes, based on whether the change was legitimate. - Use a
-refresh-onlyapply to record a legitimate external change into state without touching infrastructure. - Restrict console write access on Terraform-managed resources so a code change is the path of least resistance.
- Codify the things people keep changing by hand, scoping any
ignore_changesto the specific attribute rather than usingall.
pulumi refresh reconciles state with reality
driftctl a dedicated drift-detection tool (now archived)
Knowledge Check
What does a terraform plan -refresh-only run do?
- Updates Terraform's view of reality and reports what differs from state, without changing any infrastructure
- Reverts every out-of-band change made in the AWS console back to the declared configuration in a single pass
- Rewrites your
.tfconfiguration files so that they match the current live state of the infrastructure - Destroys and recreates any resource whose live attributes have drifted from state
An autoscaler legitimately changed an ASG's desired count outside Terraform. How should you reconcile that drift?
- Accept it — a
-refresh-onlyapply orignore_changeson the count — so Terraform stops dragging it back - Revert it with a normal apply, forcing the desired count back down to the original value committed in the code
- Delete the ASG from state with
terraform state rmso Terraform forgets it exists - Destroy and recreate the whole ASG to reset its count back to the declared value
Why run drift detection on a schedule instead of waiting for the next apply?
- So you learn about drift on your own schedule, not mid-incident when an unrelated apply suddenly wants to revert a console hotfix
- Because Terraform is structurally unable to detect any drift at all during a normal apply, so a separate scheduled job is the only way
- Because scheduled runs apply the reconciling changes automatically, which removes the need for any human review
- Because drift is fundamentally invisible to
planand only a dedicated scheduled job can ever see it
When is reverting drift with a normal apply the right reconciliation choice?
- When the out-of-band change was unintended or unauthorized and the code is the source of truth
- Whenever any drift at all is detected, regardless of whether the underlying change was legitimate or not
- When an autoscaler has legitimately adjusted a Terraform-managed attribute
- Only when the resource itself was created entirely outside of Terraform
You got correct