Chapter 12: Production Operations
Topic 68

Large State and Performance

OperationsState

A state file that grows to thousands of resources makes every plan slow and every apply a high-blast-radius event. Terraform refreshes each resource against the provider API on every run, so a state holding 2,000 objects means roughly 2,000 read calls before it can even show you a diff — a plan that was seconds at fifty resources becomes minutes at two thousand. And because plan and apply operate on the whole state, every run puts the entire estate within reach of a single mistake.

Managing large state is about splitting it along sensible boundaries, controlling refresh and parallelism deliberately, and treating targeting as surgery rather than a habit. The fix for a slow plan is almost never a flag — it is a smaller state.

Why Large State Is Slow

Plan time is dominated by refresh, and refresh is per-resource. Before computing a diff, Terraform asks the provider for the current state of every managed object so the plan reflects reality rather than a stale snapshot. That is one or more API calls per resource, walked across the dependency graph with default concurrency of ten. The work scales linearly with resource count, so the cost is not the size of the state file on disk — it is the number of objects Terraform must read back from AWS every single run.

This is why a sprawling monolithic state degrades steadily rather than suddenly. Nothing breaks; plans just creep from ten seconds to ten minutes as the estate grows, until a routine change costs a coffee break and people start avoiding Terraform to dodge the wait.

Splitting State

The real fix is decomposition: break one large state into several smaller ones aligned to blast-radius and stability boundaries. A common split is a foundational layer (VPC, subnets, IAM, shared DNS) that changes rarely, separated from per-application layers (an app's compute, queues, and databases) that change daily. Each layer gets its own state and backend key, so a plan refreshes only its own slice instead of the whole estate.

The win is two-sided. Plans get fast again because each state holds tens or low-hundreds of resources, not thousands. And applies get contained — a botched change to one app's state cannot touch the shared network or another team's resources, because they live in separate states entirely. Cross-layer values flow through published outputs or SSM parameters rather than through one giant shared state.

One huge state vs split state
One huge state
Thousands of resources refreshed on every run, so plans creep from seconds to minutes — and one mistake reaches the entire estate.
Split state
Each layer holds tens or low-hundreds of resources: plans stay fast, and a botched apply is contained to its own slice.
backend.tf — one state key per layer
# foundation/backend.tf
terraform {
  backend "s3" {
    bucket = "acme-tfstate"
    key    = "foundation/terraform.tfstate"
    region = "us-east-1"
  }
}

# app-checkout/backend.tf — separate state, separate blast radius
terraform {
  backend "s3" {
    bucket = "acme-tfstate"
    key    = "apps/checkout/terraform.tfstate"
    region = "us-east-1"
  }
}

Same bucket, distinct keys — two independent states. The foundation layer publishes its VPC and subnet IDs as outputs, and the checkout app reads them through a parameter store lookup, so the two never share a refresh or an apply.

Controlling Refresh

Two flags scope a run when you genuinely need to. -target=ADDRESS restricts the operation to one resource and its dependencies, and -refresh=false skips the refresh phase entirely and plans against stored state. Both make a run faster by doing less work, and both are surgical tools, not routine settings.

scoping a run — surgical, not routine
# refresh and plan only this resource and its dependencies
terraform plan -target=aws_instance.api

# skip refresh entirely — plan against stored state, no API reads
terraform plan -refresh=false

The trap with -refresh=false is that it plans against a possibly stale view of reality: anything that drifted in AWS since the last run is invisible, so the plan can be wrong about what it will change. Reach for it only when you know nothing has drifted and you are deliberately trading accuracy for speed during an incident.

Parallelism

Terraform walks the graph with ten concurrent operations by default, and -parallelism=N changes that number. Raising it to push past a slow plan is tempting and usually backfires: more concurrent API calls means hitting AWS request-rate limits, and a throttled provider retries with backoff, making the run slower and flakier than it was at ten. The default is tuned for exactly this balance, and lowering it is the more common useful adjustment — drop to two or three when an account is already throttling.

If a plan is slow, parallelism is not the lever. A higher number cannot fix the fact that the state holds two thousand resources; it just changes how aggressively Terraform hits the same wall. Slow plans are a state-size problem.

Targeting Responsibly

-target exists for incident surgery: applying one fix without waiting on a full refresh of an estate that is currently on fire. Used that way it is a precise instrument. Used routinely it is a liability — every targeted run skips the full plan, so unrelated drift accumulates unseen and the next untargeted apply surfaces a pile of surprise changes nobody reviewed.

Treat a recurring urge to -target as a signal, not a workflow. If the only way to get a tolerable plan is to scope it down every time, the state is too large and the answer is to split it, not to keep narrowing the view.

Common Mistakes
  • Letting one state grow to thousands of resources, so every plan takes minutes from per-resource refresh and every apply risks the whole estate at once.
  • Using -target routinely to dodge slow plans, hiding the full diff and letting unrelated drift pile up until an untargeted apply surfaces it all.
  • Cranking -parallelism high to speed up a slow plan and triggering AWS API throttling, so retries with backoff make the run slower and flakier.
  • Setting -refresh=false as a habit and planning against a stale view, so drift in AWS is invisible and the plan is wrong about what it will change.
  • Treating a slow plan as a flag-tuning problem instead of a state-size problem, and never splitting the state that is the actual cause.
Best Practices
  • Split large state along stability and blast-radius boundaries — a rarely-changing foundation layer separate from per-app layers — so plans stay fast and applies stay contained.
  • Reserve -target and -refresh=false for surgical, one-off incident work, never as routine speedups.
  • Leave -parallelism at its default of ten unless you have measured a reason, and lower it rather than raise it when an account is throttling.
  • Pass cross-layer values through published outputs or SSM Parameter Store instead of one shared monolithic state.
  • Treat a slow plan as the signal to split state, not as a reason to disable refresh.
Comparable tools CloudFormation splits scale via nested and separate stacks Pulumi splits via stacks and micro-stacks Ansible no direct equivalent — it keeps no state to refresh

Knowledge Check

Why does a state file with thousands of resources make every plan slow?

  • Terraform refreshes each resource against the API on every run, so refresh time scales with resource count
  • The state file becomes too large to download from the S3 backend quickly over the network
  • HCL parsing time grows quadratically with the number of resource blocks in the configuration
  • Acquiring the DynamoDB state lock takes proportionally longer as the number of tracked resources climbs into the thousands

What is the right fix when a monolithic state has made plans take many minutes?

  • Split the state into smaller states aligned to blast-radius and stability boundaries
  • Raise the -parallelism flag well above the default of ten so every refresh runs faster
  • Set -refresh=false permanently in the CI pipeline to skip the slow step
  • Use a -target flag on every run so each individual plan stays small

Why is routine use of -target dangerous?

  • It skips the full plan, so unrelated drift accumulates unseen until a later untargeted apply surfaces it all
  • It corrupts the state by writing only the targeted part of the resource graph back to the remote backend on apply
  • It permanently removes every untargeted resource from the state on the next apply
  • It silently disables locking, allowing two concurrent applies to interleave their writes

What usually happens if you raise -parallelism far above the default to speed up a plan?

  • You hit AWS API rate limits, and throttled retries with backoff make the run slower and flakier
  • Terraform refuses to run at all, because any parallelism value above ten is rejected as invalid
  • The plan quietly becomes inaccurate because refresh is skipped once concurrency goes high
  • State locking is automatically disabled to allow the parallel operations to proceed

You got correct