Compute and Auto Scaling
Running compute on AWS through Terraform means launch templates, auto scaling groups, and load balancers — not a fleet of hand-managed aws_instance resources. The launch template is the blueprint (AMI, instance type, user-data, IAM profile), the ASG is the thing that keeps the right number of instances alive across AZs, and the load balancer spreads traffic over whatever the ASG currently has running.
This is also where Terraform's model meets AWS's and has to give ground. An ASG manages its own instance count through scaling policies, so the live desired_capacity belongs to the autoscaler, not to your config. If Terraform owns that number, every apply drags it back to whatever you wrote — undoing the scale-out that just happened to absorb traffic. The fix is to let Terraform set the floor and ceiling and stop watching the middle.
across subnets
Launch Templates
An aws_launch_template captures how to build one instance: which AMI, which instance type, the user-data run at boot, the security groups, and the IAM instance profile that grants the instance its permissions. It is versioned — each change creates a new version — and the ASG points at a version, which is the seam that makes rolling updates possible. Read the AMI from a data source so the latest patched image resolves at plan time rather than rotting as a hardcoded ID.
resource "aws_launch_template" "web" { name_prefix = "web-" image_id = data.aws_ami.al2023.id instance_type = "t3.micro" user_data = base64encode(file("${path.module}/cloud-init.yaml")) iam_instance_profile { arn = aws_iam_instance_profile.web.arn } vpc_security_group_ids = [aws_security_group.web.id] lifecycle { create_before_destroy = true } }
The user_data is base64-encoded cloud-init, not a provisioner — boot-time configuration that the instance applies to itself declaratively. create_before_destroy on the template means a replacement is built before the old one is torn down, which the ASG needs for a clean rollout.
Auto Scaling Groups and Who Owns desired_capacity
The aws_autoscaling_group ties the launch template to a set of subnets and a target group, and declares the bounds: min_size and max_size are yours to set, but desired_capacity is the live number the autoscaler moves up and down. Set it once to seed the group, then hand it over with ignore_changes so Terraform stops reconciling it. Without that, an apply run an hour after a traffic spike scaled the group to 9 instances will quietly plan it back to 3.
resource "aws_autoscaling_group" "web" { name_prefix = "web-" min_size = 3 max_size = 12 desired_capacity = 3 vpc_zone_identifier = [for s in aws_subnet.private : s.id] target_group_arns = [aws_lb_target_group.web.arn] health_check_type = "ELB" launch_template { id = aws_launch_template.web.id version = "$Latest" } instance_refresh { strategy = "Rolling" preferences { min_healthy_percentage = 90 } } lifecycle { create_before_destroy = true # the autoscaler owns the live count; don't let apply revert it ignore_changes = [desired_capacity] } }
ignore_changes = [desired_capacity] is scoped to exactly the one attribute that legitimately changes out of band — never all, which would also hide real drift in min_size, the template, or the subnets. health_check_type = "ELB" tells the ASG to trust the load balancer's health checks, so an instance that fails HTTP checks gets replaced even though the EC2 status check passes.
The Load Balancer in Front
An Application Load Balancer sits across the public subnets, a target group holds the registered instances, and a listener forwards traffic. The ASG never lists instances directly — it registers them into the target group by ARN as it launches them, and deregisters them as it terminates them. That indirection is what lets the count change underneath without rewiring anything: traffic always flows to whatever is currently healthy in the target group.
resource "aws_lb_target_group" "web" { port = 8080 protocol = "HTTP" vpc_id = aws_vpc.main.id health_check { path = "/healthz" } } resource "aws_lb_listener" "http" { load_balancer_arn = aws_lb.web.arn port = 80 protocol = "HTTP" default_action { type = "forward" target_group_arn = aws_lb_target_group.web.arn } }
Zero-Downtime Instance Refresh
When you change the launch template — a new AMI, a different instance type — existing instances do not roll automatically. Two mechanisms make the replacement graceful. create_before_destroy ensures the ASG and template aren't destroyed before their replacements exist, and the instance_refresh block tells AWS to replace running instances in batches, keeping min_healthy_percentage of capacity serving traffic throughout. Set that to 90% and the refresh swaps roughly one instance at a time on a 10-instance group, never dropping below 9 healthy.
Why ASGs Beat Individual Instances
Managing a workload as separate aws_instance resources throws away the two things that make a fleet survivable: self-healing and scaling. An ASG replaces an instance that fails its health check without a human or a plan; it adds and removes instances as load moves; and it spreads them across AZs so a zone outage costs you a fraction, not the whole tier. Reach for individual instances only for genuine pets — a bastion host, a one-off — and for everything that scales, use the launch-template-plus-ASG shape.
- Managing
desired_capacityin Terraform withoutignore_changes, so every apply reverts the autoscaler's current count and undoes a scale-out mid-incident. - Updating a launch template and expecting running instances to pick it up, without an
instance_refreshblock — the change applies only to instances launched afterward. - Replacing instances without
create_before_destroy, so the old capacity is torn down before the new is up and the service loses headroom during the swap. - Using
ignore_changes = allto silence thedesired_capacitynoise, which also hides real drift in the AMI, instance type, and subnets. - Running a scalable workload as individual
aws_instanceresources, losing self-healing, multi-AZ spread, and autoscaling in one stroke.
- Run scalable workloads as a launch template plus an ASG, not as hand-managed instances.
- Add
ignore_changes = [desired_capacity]— scoped to that one attribute — so Terraform and the autoscaler stop fighting. - Use
create_before_destroyplus aninstance_refreshblock withmin_healthy_percentagefor zero-downtime rollouts. - Bake AMIs with Packer and keep boot-time user-data minimal, so instances launch fast and predictably under load.
- Set
health_check_type = "ELB"so the ASG replaces instances that fail application health checks, not only EC2 status checks.
Knowledge Check
Why add ignore_changes = [desired_capacity] to an ASG?
- The autoscaler owns the live count, so without it every apply reverts a scale-out the autoscaler made
- It lets the ASG launch instances beyond
max_sizewhen a scaling policy fires during traffic spikes - It prevents Terraform from ever destroying or replacing the ASG on a future apply, no matter the change
- It is the required wiring that lets the launch template's
$Latestversion attach to the group
You change the launch template's AMI. What happens to the already-running instances?
- Nothing, unless an
instance_refreshblock is configured — only new launches use the new AMI - All running instances are terminated and replaced immediately on the next apply that touches the template
- The ASG halves its running capacity and replaces instances one batch at a time until every one is on the new AMI
- Terraform refuses the apply because a launch template's
image_idis immutable once set
Why prefer an ASG over individual aws_instance resources for a scalable workload?
- It self-heals failed instances, scales with load, and spreads instances across AZs automatically
- It is the only resource that can attach an IAM instance profile to grant instances permissions
- Individual
aws_instanceresources cannot be registered in a target group behind a load balancer - An ASG removes the need for a launch template by defining the instance blueprint itself
What does min_healthy_percentage = 90 in an instance refresh guarantee?
- At least 90% of capacity stays healthy and serving traffic while instances are replaced in batches
- The refresh aborts and rolls back if fewer than 90% of the instances are already running the new AMI
- Only 90% of the group's instances are ever replaced, leaving the remaining 10% on the old template version
- Average CPU utilization across the group must stay under 90% for the refresh to proceed
You got correct