Topic 61

Drift Detection and Reconciliation

OperationsState

Drift is the gap between what Terraform's state says exists and what is actually in AWS. It opens when someone changes a resource in the console, an autoscaler adjusts a desired count, or another tool edits a tag — anything that touches a Terraform-managed resource outside Terraform. Detecting drift early and reconciling it deliberately is core to operating Terraform-managed infrastructure; ignore it and the next unrelated apply produces a surprising change nobody planned.

The trap is that the same apply that "fixes" drift can also undo a legitimate change. Reconciliation is a decision, not a reflex: sometimes the right move is to revert reality back to the code, and sometimes it is to accept reality into state. Getting that call wrong either re-breaks something someone fixed or bakes a console mistake into your configuration.

Detect, then reconcile

scheduled plan -refresh-only

→

drift detected

→

reconcile — revert or accept

What Causes Drift

Most drift comes from people: a console change made during an incident, a quick fix applied by hand and never codified. The rest comes from systems — an autoscaler that owns the desired_capacity of an ASG and moves it independently, another tool that writes tags, or AWS itself applying a default to an attribute after creation. All of it is reality diverging from what Terraform last recorded, and Terraform has no idea until it next refreshes.

Detecting Drift

You detect drift by refreshing state against reality and seeing whether they disagree. The clean way to do this without touching anything is a plan -refresh-only run on a schedule in CI — nightly, say — that alerts when the refresh shows a difference. A plain scheduled plan works too, but -refresh-only isolates "reality changed" from "the config changed," which is exactly the signal you want for drift.

drift check — refresh-only plan, alert on any difference

# scheduled nightly in CI
terraform plan -refresh-only -detailed-exitcode

# exit 0 = no drift, 2 = drift detected, 1 = error
# a non-zero "changes" exit fires the alert

-refresh-only updates Terraform's view of reality and reports what differs from state, changing nothing. With -detailed-exitcode, an exit code of 2 means a difference was found — the pipeline turns that into an alert. This is the difference between learning about drift on your schedule and learning about it mid-incident when an unrelated apply suddenly wants to revert a console hotfix.

Reconciliation Options

Once you have found drift, there are three ways to reconcile. Let Terraform revert the change with a normal apply, putting reality back to the declared config. Accept reality into state with a -refresh-only apply, or stop fighting the attribute with ignore_changes if it will keep changing. Or, if the resource was created out of band entirely, import it so Terraform starts managing it. Which one is right depends on whether the out-of-band change should have happened.

The refresh-only Plan

A -refresh-only apply is the safe reconciliation primitive: it updates state to match reality without changing any infrastructure. It is how you accept a legitimate external change — the autoscaler moved the count, and you want state to record the real value rather than have the next apply drag it back. It touches nothing in AWS; it only rewrites Terraform's bookkeeping to agree with what is already there.

Preventing Drift

The cheapest drift to handle is the drift that never happens. Reduce console write access on Terraform-managed resources so the path of least resistance is a code change, not a click. When you find people repeatedly changing the same thing by hand, that is a signal to codify it — make the console the harder option. Drift is rarely malice; it is usually convenience, and you fix it by making the convenient path go through Terraform.

Revert vs Accept Drift

Revert (a normal apply) — puts reality back to the declared config, undoing the out-of-band change. Right when the console change was unintended or unauthorized — a manual tweak that should never have happened and the code is the truth.

Accept (refresh-only apply or ignore_changes) — updates state to record reality, or tells Terraform to stop managing that attribute. Right when the external change is legitimate — an autoscaler's count, a tag another system owns — and forcing it back would re-break something working.

Common Mistakes

Discovering drift only when an unrelated apply suddenly wants to revert a console hotfix someone made during an incident.
Blindly applying to "fix" drift and reverting a legitimate out-of-band change — an autoscaler's count — that should have been accepted instead.
Never running drift detection at all, so state and reality silently diverge for months until a surprise apply surfaces it.
Leaving broad console write access on Terraform-managed resources, inviting constant drift from quick manual fixes.
Reaching for ignore_changes = all to silence drift noise, then missing a real divergence the config should have caught.

Best Practices

Run scheduled drift detection — a plan -refresh-only -detailed-exitcode in CI — and alert on divergence rather than waiting for a surprise apply.
Decide per case whether to revert with a normal apply or accept with a -refresh-only apply or ignore_changes, based on whether the change was legitimate.
Use a -refresh-only apply to record a legitimate external change into state without touching infrastructure.
Restrict console write access on Terraform-managed resources so a code change is the path of least resistance.
Codify the things people keep changing by hand, scoping any ignore_changes to the specific attribute rather than using all.

Comparable tools CloudFormation native drift detection on stacks Pulumi pulumi refresh reconciles state with reality driftctl a dedicated drift-detection tool (now archived)

Knowledge Check

What does a terraform plan -refresh-only run do?

Updates Terraform's view of reality and reports what differs from state, without changing any infrastructure
Reverts every out-of-band change made in the AWS console back to the declared configuration in a single pass
Rewrites your .tf configuration files so that they match the current live state of the infrastructure
Destroys and recreates any resource whose live attributes have drifted from state

An autoscaler legitimately changed an ASG's desired count outside Terraform. How should you reconcile that drift?

Accept it — a -refresh-only apply or ignore_changes on the count — so Terraform stops dragging it back
Revert it with a normal apply, forcing the desired count back down to the original value committed in the code
Delete the ASG from state with terraform state rm so Terraform forgets it exists
Destroy and recreate the whole ASG to reset its count back to the declared value

Why run drift detection on a schedule instead of waiting for the next apply?

So you learn about drift on your own schedule, not mid-incident when an unrelated apply suddenly wants to revert a console hotfix
Because Terraform is structurally unable to detect any drift at all during a normal apply, so a separate scheduled job is the only way
Because scheduled runs apply the reconciling changes automatically, which removes the need for any human review
Because drift is fundamentally invisible to plan and only a dedicated scheduled job can ever see it

When is reverting drift with a normal apply the right reconciliation choice?

When the out-of-band change was unintended or unauthorized and the code is the source of truth
Whenever any drift at all is detected, regardless of whether the underlying change was legitimate or not
When an autoscaler has legitimately adjusted a Terraform-managed attribute
Only when the resource itself was created entirely outside of Terraform

You got correct