Chapter 12: Production Operations
Topic 73

Debugging

OperationsTooling

When Terraform behaves inexplicably — a diff that never goes away, a provider error with no obvious cause, an apply that hangs for ten minutes — guessing is the slow path. The fast path is to see what the engine is actually doing, and TF_LOG is the single most useful tool for that. It exposes Terraform's internal logs, including the exact AWS API calls the provider makes, which is usually where the real answer is hiding.

This topic covers reading those logs and the systematic approach to the failures Terraform users actually hit: the perpetual diff, the hang, and the opaque provider error. The technique underneath all of them is the same — stop theorizing about what Terraform might be doing and read what it is doing.

TF_LOG Levels

TF_LOG is an environment variable that turns on logging at a chosen verbosity: TRACE, DEBUG, INFO, WARN, or ERROR, from most to least verbose. TRACE is everything including the internals of the graph walk; DEBUG is the practical default for most investigations. TF_LOG_PROVIDER scopes logging to just the provider, and TF_LOG_PATH writes the logs to a file instead of flooding your terminal.

scope the level, capture to a file
# DEBUG is usually the right level; TRACE is firehose
export TF_LOG=DEBUG

# log only the provider's activity, write it to a file
export TF_LOG_PROVIDER=DEBUG
export TF_LOG_PATH=./tf-debug.log

terraform plan

Scope the level to the question. Running TF_LOG=TRACE for a provider problem buries the relevant API call under the engine's graph-walk noise; TF_LOG_PROVIDER=DEBUG isolates exactly the provider activity you care about. Capture to TF_LOG_PATH so you can search the log rather than scroll it, then unset the variables when you are done — leaving TRACE on makes every subsequent run unreadable.

Reading Provider API Calls

The highest-value thing in the logs is the provider's API traffic. At DEBUG, the AWS provider logs each request it sends and each response it gets back, so an opaque "error creating resource" becomes a specific API error with a specific reason — an AccessDenied on a particular action, a ValidationException naming the bad parameter, a throttling response. The error Terraform prints is a summary; the log is the actual exchange with AWS.

This is the universal move. Almost every confusing Terraform failure resolves to one API call going wrong, and the log shows you which call, with what arguments, and what AWS said back. Once you can read that exchange, most opaque errors stop being opaque.

The Perpetual Diff

A perpetual diff is a resource that shows a change on every plan even though you keep applying it — apply succeeds, the next plan proposes the same change again, forever. The usual causes are a small set: the provider normalizes a value differently from how you wrote it (a JSON policy reordered, a string recased), an attribute is being changed outside Terraform by another system, or a genuine provider bug. The fix depends on which.

Diagnose it by comparing what the provider reads back against what your config declares — the DEBUG log shows both. If the difference is cosmetic normalization, match your config to the canonical form the provider expects; if another system legitimately owns the attribute, ignore_changes on that specific attribute stops the fight. Letting a perpetual diff sit for months means every plan is noise and real changes hide inside it.

Hangs and Timeouts

An apply that hangs is usually waiting on something, and the logs reveal what. API throttling shows up as repeated retry-with-backoff entries — the provider is being rate-limited and waiting between attempts. A dependency wait shows up as Terraform sitting on a resource whose prerequisite has not converged. A resource-level timeout shows up as the provider polling for a state that never arrives, like an RDS instance stuck in modifying.

Without logs, a hang is indistinguishable from a crash, and people kill the run — sometimes mid-write, corrupting state. With DEBUG on, you can see whether Terraform is throttled, blocked, or genuinely stuck, and decide whether to wait it out, lower parallelism, or fix the underlying resource. Reading the log is how you tell a slow apply from a dead one.

Crash Logs and Reporting

When Terraform itself panics, it writes a crash.log with the stack trace and the recent log output. That file plus a minimal reproduction — the smallest config that triggers the crash — is what makes a provider bug report actionable. A report that says "it crashed" with no logs and no repro is unactionable; maintainers cannot fix what they cannot reproduce.

Before filing, isolate the problem to the smallest config that still fails, and attach the scoped DEBUG log and the crash.log. The effort of building a minimal reproduction often surfaces the cause yourself — and when it does not, it is exactly what a maintainer needs to fix it fast.

Common Mistakes
  • Guessing at the cause of a perpetual diff or an opaque error instead of turning on TF_LOG to see the actual API calls behind it.
  • Running TF_LOG=TRACE and drowning in graph-walk noise when DEBUG or TF_LOG_PROVIDER would have isolated the issue.
  • Killing a hanging apply mid-write instead of reading the log to see it is throttled and waiting, sometimes corrupting state in the process.
  • Filing a provider bug with no logs and no minimal reproduction, making it unactionable for the maintainers.
  • Ignoring a perpetual diff for months instead of diagnosing the normalization or ignore_changes fix, so every plan is noise that hides real changes.
Best Practices
  • Reach for TF_LOG and TF_LOG_PATH early when behavior is inexplicable, scoping the level to DEBUG to keep the log readable.
  • Use TF_LOG_PROVIDER to isolate the provider's exact API request and response behind a failure.
  • Diagnose a perpetual diff by comparing what the provider reads back against what the config declares, then normalize the config or scope ignore_changes to that attribute.
  • Read the log before killing a hanging apply, to tell a throttled wait from a genuine stall.
  • When reporting a bug, attach the scoped log and crash.log with the smallest config that reproduces it.
Comparable tools Pulumi verbose logging via --logtostderr and -v levels CloudFormation stack events and CloudTrail for the underlying API calls Ansible -vvvv verbose output showing each module's API activity

Knowledge Check

What does TF_LOG expose that the normal error output does not?

  • Terraform's internal logs, including the exact AWS API requests and responses the provider makes
  • A rendered graphical visualization of the dependency graph, drawn in DOT and exported to an SVG file
  • The decrypted plaintext contents of the state file's sensitive values, pulled straight from the S3 backend
  • A predicted monthly cost estimate for every resource in the planned changes

You are debugging a provider error. How should you scope the log verbosity?

  • Use TF_LOG_PROVIDER=DEBUG to isolate the provider's API activity rather than TF_LOG=TRACE, which buries it in noise
  • Always use TF_LOG=TRACE with logs sent to a file, since the most detail is always the safest choice for any problem
  • Use TF_LOG=ERROR, which shows only the API calls that failed and keeps the output most concise
  • Disable logging entirely and rely on the plan summary's change counts instead

A resource shows the same change on every plan no matter how many times you apply. What is a common cause?

  • The provider normalizes the value differently from how you wrote it, or another system changes the attribute out of band
  • The state file is missing its entry for the resource, so every plan recreates it from scratch
  • Locking is disabled on the backend, so the apply reports success in the output but never actually writes the change back to the remote state
  • The resource has create_before_destroy set in its lifecycle block, which forces a fresh re-plan

What makes a provider bug report actionable for maintainers?

  • A minimal reproduction config plus the scoped DEBUG log and crash.log
  • A screenshot of the terminal showing the red error summary line and the exit code
  • The full unredacted state file so maintainers can inspect every managed resource
  • A description of the symptom with the Terraform version number only

You got correct