Topic 48

Compute and Auto Scaling

Compute

Running compute on AWS through Terraform means launch templates, auto scaling groups, and load balancers — not a fleet of hand-managed aws_instance resources. The launch template is the blueprint (AMI, instance type, user-data, IAM profile), the ASG is the thing that keeps the right number of instances alive across AZs, and the load balancer spreads traffic over whatever the ASG currently has running.

This is also where Terraform's model meets AWS's and has to give ground. An ASG manages its own instance count through scaling policies, so the live desired_capacity belongs to the autoscaler, not to your config. If Terraform owns that number, every apply drags it back to whatever you wrote — undoing the scale-out that just happened to absorb traffic. The fix is to let Terraform set the floor and ceiling and stop watching the middle.

How a fleet is wired

launch template

→

auto scaling group
across subnets

→

target group

→

application load balancer

Launch Templates

An aws_launch_template captures how to build one instance: which AMI, which instance type, the user-data run at boot, the security groups, and the IAM instance profile that grants the instance its permissions. It is versioned — each change creates a new version — and the ASG points at a version, which is the seam that makes rolling updates possible. Read the AMI from a data source so the latest patched image resolves at plan time rather than rotting as a hardcoded ID.

launch-template.tf — the instance blueprint

resource "aws_launch_template" "web" {
  name_prefix   = "web-"
  image_id      = data.aws_ami.al2023.id
  instance_type = "t3.micro"
  user_data     = base64encode(file("${path.module}/cloud-init.yaml"))

  iam_instance_profile { arn = aws_iam_instance_profile.web.arn }
  vpc_security_group_ids = [aws_security_group.web.id]

  lifecycle { create_before_destroy = true }
}

The user_data is base64-encoded cloud-init, not a provisioner — boot-time configuration that the instance applies to itself declaratively. create_before_destroy on the template means a replacement is built before the old one is torn down, which the ASG needs for a clean rollout.

Auto Scaling Groups and Who Owns desired_capacity

The aws_autoscaling_group ties the launch template to a set of subnets and a target group, and declares the bounds: min_size and max_size are yours to set, but desired_capacity is the live number the autoscaler moves up and down. Set it once to seed the group, then hand it over with ignore_changes so Terraform stops reconciling it. Without that, an apply run an hour after a traffic spike scaled the group to 9 instances will quietly plan it back to 3.

asg.tf — ignore_changes hands desired_capacity to the autoscaler

resource "aws_autoscaling_group" "web" {
  name_prefix         = "web-"
  min_size            = 3
  max_size            = 12
  desired_capacity    = 3
  vpc_zone_identifier = [for s in aws_subnet.private : s.id]
  target_group_arns   = [aws_lb_target_group.web.arn]
  health_check_type   = "ELB"

  launch_template {
    id      = aws_launch_template.web.id
    version = "$Latest"
  }

  instance_refresh {
    strategy = "Rolling"
    preferences { min_healthy_percentage = 90 }
  }

  lifecycle {
    create_before_destroy = true
    # the autoscaler owns the live count; don't let apply revert it
    ignore_changes        = [desired_capacity]
  }
}

ignore_changes = [desired_capacity] is scoped to exactly the one attribute that legitimately changes out of band — never all, which would also hide real drift in min_size, the template, or the subnets. health_check_type = "ELB" tells the ASG to trust the load balancer's health checks, so an instance that fails HTTP checks gets replaced even though the EC2 status check passes.

The Load Balancer in Front

An Application Load Balancer sits across the public subnets, a target group holds the registered instances, and a listener forwards traffic. The ASG never lists instances directly — it registers them into the target group by ARN as it launches them, and deregisters them as it terminates them. That indirection is what lets the count change underneath without rewiring anything: traffic always flows to whatever is currently healthy in the target group.

alb.tf — target group and listener

resource "aws_lb_target_group" "web" {
  port     = 8080
  protocol = "HTTP"
  vpc_id   = aws_vpc.main.id
  health_check { path = "/healthz" }
}

resource "aws_lb_listener" "http" {
  load_balancer_arn = aws_lb.web.arn
  port              = 80
  protocol          = "HTTP"
  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.web.arn
  }
}

Zero-Downtime Instance Refresh

When you change the launch template — a new AMI, a different instance type — existing instances do not roll automatically. Two mechanisms make the replacement graceful. create_before_destroy ensures the ASG and template aren't destroyed before their replacements exist, and the instance_refresh block tells AWS to replace running instances in batches, keeping min_healthy_percentage of capacity serving traffic throughout. Set that to 90% and the refresh swaps roughly one instance at a time on a 10-instance group, never dropping below 9 healthy.

Why ASGs Beat Individual Instances

Managing a workload as separate aws_instance resources throws away the two things that make a fleet survivable: self-healing and scaling. An ASG replaces an instance that fails its health check without a human or a plan; it adds and removes instances as load moves; and it spreads them across AZs so a zone outage costs you a fraction, not the whole tier. Reach for individual instances only for genuine pets — a bastion host, a one-off — and for everything that scales, use the launch-template-plus-ASG shape.

Common Mistakes

Managing desired_capacity in Terraform without ignore_changes, so every apply reverts the autoscaler's current count and undoes a scale-out mid-incident.
Updating a launch template and expecting running instances to pick it up, without an instance_refresh block — the change applies only to instances launched afterward.
Replacing instances without create_before_destroy, so the old capacity is torn down before the new is up and the service loses headroom during the swap.
Using ignore_changes = all to silence the desired_capacity noise, which also hides real drift in the AMI, instance type, and subnets.
Running a scalable workload as individual aws_instance resources, losing self-healing, multi-AZ spread, and autoscaling in one stroke.

Best Practices

Run scalable workloads as a launch template plus an ASG, not as hand-managed instances.
Add ignore_changes = [desired_capacity] — scoped to that one attribute — so Terraform and the autoscaler stop fighting.
Use create_before_destroy plus an instance_refresh block with min_healthy_percentage for zero-downtime rollouts.
Bake AMIs with Packer and keep boot-time user-data minimal, so instances launch fast and predictably under load.
Set health_check_type = "ELB" so the ASG replaces instances that fail application health checks, not only EC2 status checks.

Comparable tools CloudFormation ASGs with UpdatePolicy for rolling updates Pulumi awsx for higher-level compute components Packer bakes the AMIs the launch template references

Knowledge Check

Why add ignore_changes = [desired_capacity] to an ASG?

The autoscaler owns the live count, so without it every apply reverts a scale-out the autoscaler made
It lets the ASG launch instances beyond max_size when a scaling policy fires during traffic spikes
It prevents Terraform from ever destroying or replacing the ASG on a future apply, no matter the change
It is the required wiring that lets the launch template's $Latest version attach to the group

You change the launch template's AMI. What happens to the already-running instances?

Nothing, unless an instance_refresh block is configured — only new launches use the new AMI
All running instances are terminated and replaced immediately on the next apply that touches the template
The ASG halves its running capacity and replaces instances one batch at a time until every one is on the new AMI
Terraform refuses the apply because a launch template's image_id is immutable once set

Why prefer an ASG over individual aws_instance resources for a scalable workload?

It self-heals failed instances, scales with load, and spreads instances across AZs automatically
It is the only resource that can attach an IAM instance profile to grant instances permissions
Individual aws_instance resources cannot be registered in a target group behind a load balancer
An ASG removes the need for a launch template by defining the instance blueprint itself

What does min_healthy_percentage = 90 in an instance refresh guarantee?

At least 90% of capacity stays healthy and serving traffic while instances are replaced in batches
The refresh aborts and rolls back if fewer than 90% of the instances are already running the new AMI
Only 90% of the group's instances are ever replaced, leaving the remaining 10% on the old template version
Average CPU utilization across the group must stay under 90% for the refresh to proceed

You got correct