Chapter 10: Collaboration and Automation
Topic 59

Plan/Apply Approval Workflows

WorkflowReview

The control that makes automated infrastructure safe is a human approving the plan before apply. A Terraform plan is a precise, reviewable artifact — it lists every resource being created, changed in place, or destroyed and recreated — so reviewing it is much like reviewing a code diff. The approval gate is where a reviewer catches the -/+ replacement of a production database before it happens instead of explaining the outage afterward.

Designing the gate well is mostly about three things: what to actually read in the plan, how short to keep the window between approval and apply, and how much stricter production should be than a sandbox. A gate that rubber-stamps summary counts is worse than no gate, because it manufactures false confidence.

The approval gate
plan artifact
human reviews destroy/replace lines
approve
apply

The Plan as a Review Artifact

Read a plan the way you read a code diff: line by line, looking for the dangerous changes. The symbols carry the meaning — + creates, ~ updates in place, - destroys, and -/+ destroys and recreates. The -/+ lines and the - lines are where outages live. A plan that says "Plan: 4 to add, 1 to change, 2 to destroy" is hiding the two things you most need to look at behind a one-line summary.

terraform plan — the line that needs a second look
# aws_db_instance.main must be replaced
-/+ resource "aws_db_instance" "main" {
    ~ instance_class = "db.t3.medium" -> "db.m5.large"
    ~ character_set_name = "AL32UTF8" -> "WE8ISO8859P1" # forces replacement
      # (12 unchanged attributes hidden)
  }

Plan: 1 to add, 0 to change, 1 to destroy.

The summary reads "1 to add, 1 to destroy" — innocuous until you notice the added and destroyed resource is the same database, replaced because character_set_name is a force-new argument: RDS cannot change a database's character set in place, so Terraform must destroy the instance and its data and recreate it. (An engine_version bump, by contrast, is an in-place upgrade.) Approve on the summary alone and you have signed off on destroying production data. The forces replacement annotation is the warning, and it only appears in the body of the plan.

Approval Gates

The gate is built from required reviewers on the pull request and an environment protection rule that holds the apply until a named approver signs off. On GitHub, a protected environment with required reviewers pauses the apply job until someone approves it in the UI; GitLab and others have equivalents. Production gets a manual apply approval on top of code review — two distinct sign-offs, because the code being correct and the plan being safe to apply right now are different judgments.

What to Scrutinize

Spend the review budget where the damage is. Replacements (-/+) of stateful resources — databases, volumes, anything holding data — are the highest-stakes lines. Large destroy counts deserve a pause to confirm they are intended, not the result of a botched refactor that dropped a moved block. IAM and security-group changes are the ones that quietly widen access or lock you out. A plan that only adds resources is low-risk; a plan that destroys, replaces, or rewrites permissions is where a reviewer earns their keep.

Drift and Stale Plans

An approved plan describes the world as it was when the plan ran. Let it sit for hours and reality drifts — someone else applies, a console change lands, an autoscaler moves a count — and the saved plan now describes a world that no longer exists. Applying it can fail or, worse, change something the plan no longer accurately represents. Keep the plan-to-apply window short, and re-plan if the approval has gone stale rather than applying an old artifact against new reality.

Segregating Production

Production should not run on the same loose process as a sandbox. Give it stricter approval — more required reviewers, a named on-call approver, a manual gate — and separate credentials, so the role CI assumes for prod is distinct from the one it uses for dev and can be revoked independently. A sandbox apply going wrong costs an afternoon; a production apply going wrong costs an incident, and the gate should reflect that asymmetry.

Common Mistakes
  • Rubber-stamping plans without reading the destroy and replace lines, approving a change that recreates a production database.
  • Letting an approved plan sit for hours while reality drifts, then applying a now-inaccurate plan against a changed world.
  • Applying the same loose approval process to production as to a sandbox, with no extra reviewers or separate credentials.
  • Reviewing only the summary counts (2 to destroy) instead of the specific resources changing and why they change.
  • Missing a forces replacement annotation buried among unchanged attributes, signing off on a destroy you did not see.
Best Practices
  • Require human approval on production applies and read the full plan body, not just the summary counts.
  • Scrutinize every -/+ replacement of a stateful resource and any IAM or security-group change specifically.
  • Keep the plan-to-apply window short and re-plan if the approval has gone stale before applying.
  • Apply stricter gates and separate, independently revocable credentials to production than to lower environments.
  • Treat large destroy counts as a stop signal to confirm intent before approving, since a missing moved block looks just like an intentional teardown.
Comparable tools Atlantis formalizes the plan-comment and apply-approval gate HCP Terraform built-in run approval UI CloudFormation change sets serve as the review artifact

Knowledge Check

Why is reading only the plan's summary counts a dangerous review habit?

  • A summary like "1 to add, 1 to destroy" can hide a stateful resource being replaced, with the forces replacement reason only in the body
  • The summary counts are randomly ordered and therefore cannot be trusted as an accurate tally
  • Summary counts silently exclude any resources that are only being created and not modified
  • The summary is only computed after the apply finishes running, so it reflects nothing at all about the plan that the reviewer is actually looking at

Which plan line deserves the most scrutiny in a production review?

  • A -/+ replacement of a stateful resource like a database or volume
  • A + create of a brand-new, empty resource that has no dependents yet
  • A ~ in-place update of a resource's cost-allocation tag value
  • A no-op line where the resource is left entirely unchanged

What is the risk of letting an approved plan sit for hours before applying it?

  • Reality can drift in the meantime, so the saved plan no longer accurately describes the world it will apply against
  • The saved plan file expires after exactly one hour, after which Terraform flatly refuses to apply it
  • The approval is automatically revoked by Terraform itself, requiring a completely fresh review from scratch every single time
  • Nothing at all — a saved plan is fully frozen, so the delay has no effect whatsoever on its correctness

How should the approval process for production differ from a sandbox?

  • Stricter gates — more reviewers, a manual apply approval — and separate, independently revocable credentials
  • Production should skip review entirely so that urgent hotfixes can be applied faster during incidents
  • Production should reuse the exact same credentials as dev so that key rotation stays simpler to manage
  • There should be no difference at all — a single shared process for every environment is by far the safest option

You got correct