Chapter 2: The Core Workflow
Topic 10

Data Sources

Data

A data source reads information about infrastructure Terraform does not manage — an existing AMI, the default VPC, the AWS account ID you are running in — so your configuration can reference real values without baking them in as constants. Where a resource creates and owns an object, a data block only looks one up. It never creates, changes, or destroys anything.

The payoff is a configuration that adapts to its environment instead of carrying brittle hardcoded IDs. The same code, run in a different account or region, resolves a different account ID and a different latest AMI on its own — no edit required.

Three ways to get a value into your config
Data source
Looks up a live value on every run, so it stays correct as the world changes — account ID, region, AZs, latest AMI.
Hardcoded value
A literal ID pasted in. Fast and dependency-free, but it rots: the AMI is deprecated, the VPC is replaced, launches fail.
terraform_remote_state
Reads another stack's outputs directly. Convenient, but couples two configs tightly to each other's internal layout.

Resource vs Data Source

A resource block manages a lifecycle: Terraform creates the object, tracks it in state, updates it when the config changes, and destroys it on terraform destroy. A data block does none of that. It performs a read against the provider's API and exposes the result as attributes you can reference. Removing a data block from your config removes the lookup; it never touches a real object, because the data source never owned one in the first place.

Common AWS Data Sources

A handful of data sources show up in almost every AWS configuration. aws_ami finds an image by filter — usually the latest one matching a name pattern. aws_availability_zones lists the AZs available in the current region. aws_caller_identity returns the account ID and ARN you are authenticated as. aws_region gives the active region, and aws_vpc looks up an existing VPC by tag or ID. The block below resolves the most recent Amazon Linux 2023 image owned by Amazon and feeds its ID straight into an instance.

Look up the latest AMI and use it
data "aws_ami" "al2023" {
  most_recent = true
  owners      = ["amazon"]

  filter {
    name   = "name"
    values = ["al2023-ami-*-x86_64"]
  }
}

resource "aws_instance" "web" {
  ami           = data.aws_ami.al2023.id
  instance_type = "t3.micro"
}

Read the address carefully: a data source is referenced as data.<type>.<name>.<attribute> — here data.aws_ami.al2023.id — with the data. prefix that distinguishes it from a managed resource. That prefix is the only thing separating a lookup from a thing you own.

Filtering and Selecting

A data source that matches more than one object is a problem, because most expect exactly one result. For aws_ami, most_recent = true resolves the ambiguity by taking the newest match; owners restricts the search to a trusted publisher so you do not accidentally pick up a community image with a colliding name. The filter blocks narrow by attribute, and tags pin down a specific object. Without most_recent, a filter that matches several images either fails or returns a non-deterministic result that changes between runs.

When Data Sources Read

Most data sources read at plan time, so their values are available to the plan and you see real IDs in the diff. The exception is a data source whose arguments depend on a resource that does not exist yet — its read is deferred until apply, after the resource is created. When you have a genuine ordering requirement that Terraform cannot see, add depends_on to the data block so it reads only after the managed resource exists; otherwise the data source may try to read too early and the plan errors.

Outputs from Other Configurations

The terraform_remote_state data source reads another configuration's outputs — a way to consume a network stack's VPC ID in an application stack. It works, but it couples the two configurations tightly: the consumer breaks if the producer's state moves, its outputs change shape, or its backend is reorganized. For cross-stack values, publishing to SSM Parameter Store and reading it back with the aws_ssm_parameter data source keeps the coupling loose — the consumer depends on a stable parameter name, not on another stack's internal state layout.

Data source vs Hardcoded value vs Remote state

Data source — looks up a live value on every run, so it stays correct as the world changes, at the cost of one extra API call. Use it for ambient facts like the account ID, region, AZs, or the latest AMI.

Hardcoded value — a literal ID pasted into the config. Fast and dependency-free, but it rots: the AMI gets deprecated, the VPC gets replaced, and launches fail months later. Acceptable only for genuinely permanent constants.

terraform_remote_state — reads another stack's outputs directly. Convenient but couples two configs tightly. Prefer published outputs or SSM Parameter Store for loose cross-stack coupling.

Common Mistakes
  • Hardcoding an AMI ID that AWS later deprecates, so launches fail months down the line — an aws_ami data source with a filter avoids the rot entirely.
  • Writing an aws_ami filter that matches several images without most_recent = true, producing a non-deterministic or failing lookup.
  • Omitting owners on an aws_ami lookup, so a community image with a colliding name can be selected — a real supply-chain risk.
  • Expecting a data source to wait for a resource it implicitly depends on, then hitting a plan-time error because it read too early; the fix is an explicit depends_on.
  • Coupling stacks with terraform_remote_state and creating a hidden ordering dependency that breaks the day the upstream state is reorganized.
Best Practices
  • Use data sources for ambient facts — account ID, region, AZs, latest AMI — instead of committing them as constants.
  • Constrain every aws_ami lookup with owners and most_recent = true so exactly one trusted image resolves deterministically.
  • Prefer SSM Parameter Store or published module outputs over terraform_remote_state for cross-stack values to keep coupling loose.
  • Add depends_on to a data source that must read only after a managed resource exists.
  • Reference a data source as data.<type>.<name>.<attribute>, keeping the data. prefix that marks it as a lookup, not something you own.
Comparable tools CloudFormation parameters and Fn::ImportValue cover part of this Pulumi uses get and lookup functions Ansible reads facts via gather_facts and lookup plugins

Knowledge Check

What is the defining difference between a resource and a data block?

  • A resource manages an object's lifecycle; a data block only reads and never creates, changes, or destroys anything
  • A data block creates resources faster because it skips the plan phase and writes them directly to state during apply
  • A resource is read-only while a data block performs the actual create, update, and destroy writes
  • Data blocks are stored in state but resources are not

Why is an aws_ami lookup with a name filter but no most_recent and no owners dangerous?

  • Multiple images can match, giving a non-deterministic result, and an untrusted publisher's image could be selected
  • It always returns the single oldest matching image, which ships without any of the latest operating-system security patches applied
  • It forces a full replacement of every instance referencing it on each apply
  • It cannot be referenced by a resource without a depends_on

When does a typical data source read its value?

  • At plan time, unless its arguments depend on a resource that does not exist yet, which defers it to apply
  • Always at apply time, strictly after every managed resource in the configuration has finished being created and recorded
  • Only when you explicitly run a dedicated terraform refresh command to repopulate it
  • Once at init time and then cached for the life of the state file

Why prefer SSM Parameter Store over terraform_remote_state for sharing a VPC ID between stacks?

  • The consumer depends on a stable parameter name rather than the producer stack's internal state layout, keeping coupling loose
  • terraform_remote_state is fundamentally unable to read any of the named outputs from a state file stored on a remote S3 backend
  • SSM Parameter Store is the only data source capable of reading its value during the plan phase
  • Remote state simply cannot be read across two separate AWS accounts under any configuration

You got correct