Chapter 10: Collaboration and Automation
Topic 57

Team Backends

StateBackend

Moving from solo to team Terraform comes down to one thing: shared, locked, durable remote state. On AWS that means an S3 backend — a bucket holding the state object, encrypted, versioned, and locked so two writers cannot clobber each other. Locking was historically a separate DynamoDB table; since Terraform 1.10 the S3 backend can lock natively with a conditional-write lock object, and DynamoDB is now the legacy path.

There is a catch the first time you set this up: the backend has to exist before Terraform can use it, and Terraform is the thing you would use to create it. That chicken-and-egg, plus access control tight enough that state's plaintext secrets stay private, is the difference between a backend that works and one that leaks or corrupts.

The S3 backend, shared by team and CI
S3 bucket
Holds the state object — versioned for a recovery point, KMS-encrypted because state stores secrets in plaintext, and public access blocked.
Lock
use_lockfile native S3 locking since 1.10, or a legacy DynamoDB table — so two writers cannot clobber each other.
Shared access
One backend the whole team and CI point at, scoped by IAM to the exact roles they assume — so every plan sees the same state.

The Full S3 Backend

A production state bucket is not just a bucket. It needs versioning so a bad write has a recovery point, default encryption (KMS) so the secrets in state are encrypted at rest, and public access blocked so a misconfigured policy cannot expose it. The backend "s3" block then points Terraform at the bucket and key. Backend configuration is read before the rest of the config, so it cannot use variables — the values are literal or supplied at init time.

backend.tf — the S3 backend with native locking
terraform {
  backend "s3" {
    bucket       = "acme-tfstate-prod"
    key          = "network/vpc/terraform.tfstate"
    region       = "us-east-1"
    encrypt      = true
    kms_key_id   = "alias/tfstate"
    use_lockfile = true
  }
}

encrypt = true with a kms_key_id encrypts the state object with a customer-managed key rather than the default S3 key, so access to state requires access to the KMS key too. use_lockfile = true turns on native S3 locking. There is no DynamoDB table in this block and no separate lock infrastructure to provision.

Locking Choices

Native S3 locking writes a small lock object next to the state and uses S3's conditional-write semantics to guarantee only one writer holds it — Terraform 1.10 and later. It needs no extra resource. The legacy approach is a DynamoDB table with a LockID partition key, set via dynamodb_table in the backend block; it predates native locking and is still the right choice only when you are stuck on an older Terraform or already run it. For any new backend, use_lockfile is the recommendation; existing DynamoDB setups keep working and can migrate deliberately later.

Bootstrapping the Backend

The bucket and KMS key that hold state cannot themselves be created by a config that already uses that backend — it does not exist yet. The clean answer is a tiny dedicated bootstrap config that uses the default local backend to create the bucket (with versioning, encryption, and public-access block) and the KMS key, applied once. After that, every real config points its backend "s3" at the bootstrapped bucket. Some teams create the bucket by hand or with a one-off script instead; either way it is a documented one-time step, not part of the day-to-day workflow.

bootstrap/main.tf — create the state bucket itself (local backend)
resource "aws_s3_bucket" "state" {
  bucket = "acme-tfstate-prod"
}

resource "aws_s3_bucket_versioning" "state" {
  bucket = aws_s3_bucket.state.id
  versioning_configuration { status = "Enabled" }
}

resource "aws_s3_bucket_public_access_block" "state" {
  bucket                  = aws_s3_bucket.state.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

This config runs against local state because it is building the remote backend everything else will use. You apply it once, then leave it alone — the bucket it creates outlives every stack that stores state in it.

Access Control

State holds secrets in plaintext — RDS passwords, generated keys, tokens — so read access to the state bucket is read access to those secrets. Lock the bucket down with an IAM policy that grants the specific roles your engineers and CI assume, and nobody else, both s3:GetObject/s3:PutObject on the state prefix and kms:Decrypt on the key. A bucket that is durable and versioned but readable by the whole account is a secrets leak waiting to be noticed.

One State per Boundary

Putting every environment and component under one state key couples their blast radius and their locks: a lock held by a staging apply blocks a prod apply, and a corrupt write takes everything with it. Map a clear key or prefix scheme instead — network/vpc/terraform.tfstate, app/api/terraform.tfstate, one per environment-and-component boundary. Separate keys mean separate locks, independent failure, and a smaller blast radius when something goes wrong in one of them.

Common Mistakes
  • Standing up an S3 backend with no bucket versioning, so a bad apply that corrupts state leaves no prior version to roll back to.
  • Granting broad read access to the state bucket, exposing every secret in state to anyone with bucket read across the account.
  • Forgetting locking entirely — no use_lockfile and no DynamoDB table — so two CI jobs writing at once silently corrupt shared state.
  • Storing all environments under one state key, coupling their locks and blast radius so a staging apply blocks prod and one bad write loses everything.
  • Hardcoding backend values you then need to vary per environment and discovering the backend block cannot use variables, forcing a copy-paste of nearly identical configs.
Best Practices
  • Provision the state bucket with versioning, KMS encryption, and public access blocked, then restrict it to the specific IAM roles your team and CI assume.
  • Use native S3 locking (use_lockfile = true) on every new backend; keep DynamoDB only for existing setups until you migrate them deliberately.
  • Bootstrap the backend with a small dedicated config on the local backend, apply it once, and document the one-time setup.
  • Map a clear key or prefix per environment and component so each state has its own lock and an isolated blast radius.
  • Supply per-environment backend settings with -backend-config files at init time rather than duplicating the whole config, since the block itself cannot take variables.
Comparable tools CloudFormation needs no backend — state lives in AWS HCP Terraform managed remote state with locking built in Pulumi its own service, or an S3/Azure Blob backend

Knowledge Check

What does enabling versioning on the S3 state bucket give you?

  • A recovery point — a prior state version to roll back to after a bad write corrupts the current one
  • Automatic state locking, so a separate DynamoDB lock table mechanism is no longer needed alongside it
  • Encryption at rest of the secret values stored inside the state object
  • Safe parallel applies, by giving each concurrent writer its own object version

For a brand-new S3 backend on Terraform 1.10+, what is the recommended way to lock state?

  • Native S3 locking with use_lockfile = true, which needs no extra resource
  • A DynamoDB table with a LockID partition key, the current default
  • No locking is needed — the S3 backend serializes writes automatically without configuration
  • A separate Redis lock service running alongside the bucket

Why does the bootstrap config that creates the state bucket use the local backend?

  • The S3 backend it is building does not exist yet, so it cannot store its own state there
  • Local state on the laptop is inherently more secure for buckets that hold secrets
  • The S3 backend is unable to manage aws_s3_bucket resources, only other resource types
  • Bootstrap configs are never actually applied, so the backend choice for them is irrelevant

Why give each environment and component its own state key instead of one shared key?

  • Separate keys mean separate locks and an isolated blast radius, so one stack's lock or corruption does not affect the others
  • A single shared state key simply cannot be encrypted at rest with KMS, whereas multiple separate keys each can be encrypted fine
  • Terraform flatly refuses to run with more than one resource stored under a single shared state key
  • One state key per component is strictly required for the native S3 locking mechanism to function

You got correct