VM Scale Sets
Service 02

VM Scale Sets

ComputeAutoscale

A Virtual Machine Scale Set manages a group of identical VMs as a single elastic, self-healing resource. You define the VM configuration once; the scale set creates, replaces, and removes instances to match an autoscale rule, and replaces any instance that fails a health check. It is how you run VM-based capacity that grows and shrinks with load.

Scale sets are also the substrate underneath higher-level services — AKS node pools are scale sets, and so is much of Azure's own fleet management. If your workload runs on VMs but needs to scale, you almost never manage individual VMs; you manage a scale set.

Orchestration Modes

A scale set runs in one of two orchestration modes, fixed at creation. Flexible orchestration is the modern default: it spreads instances across Availability Zones and fault domains, supports mixing VM sizes, and treats each VM as a standard, individually addressable resource. Uniform orchestration is the older model, optimized for large fleets of strictly identical instances managed as a single unit.

For new workloads, Flexible is the right choice in almost all cases — it gives zone resilience and the flexibility to mix instance types, and it is where Azure is investing. Uniform remains only for very large, homogeneous fleets that depend on its specific behavior.

Autoscale

Autoscale rules add and remove instances based on metrics — CPU, memory, queue depth, or a custom metric — or on a schedule. A rule has a trigger, a magnitude (add two instances, or increase by 20%), and a cooldown that prevents the set from thrashing while a scale action takes effect.

Schedule-based scaling handles predictable load: scaling out every weekday morning before traffic arrives beats waiting for a CPU threshold to trip after users are already queued. Metric and schedule rules combine — a baseline schedule plus a metric rule for the unexpected spike.

Autoscale — Instance Count Follows Traffic, With a Lag
maxmintime →trafficinstances (stepped, lags traffic)
Autoscale adds instances after a metric crosses its threshold and a cooldown elapses, so capacity steps up behind the traffic curve and steps back down as it falls — bounded by the minimum and maximum you set.

Upgrade Policies

When you change the model — a new image, a new configuration — the upgrade policy decides how instances move to it. Manual leaves existing instances untouched until you act. Automatic replaces them immediately. Rolling replaces them in batches, respecting a health probe and an optional surge of extra instances so capacity never drops.

Rolling upgrades with a health probe are the safe default for production: a batch is replaced, the probe confirms the new instances are healthy, and only then does the next batch proceed. Without a health probe, a rolling upgrade will happily replace every instance with a broken build.

Zones and Fault Domains

A Flexible scale set spreads instances across the Availability Zones you select and across fault domains within each zone. This is the mechanism that turns a fleet of VMs into a resilient one: an autoscale set spread across three zones keeps serving when one zone fails, because the surviving zones hold a portion of the capacity.

Flexible vs Uniform Orchestration

Flexible — The current default. Zone-aware, supports mixed VM sizes, each instance is a standard VM resource. Choose it for essentially all new workloads.

Uniform — The legacy mode for very large fleets of identical instances managed as one unit. Choose it only when an existing workload depends on its specific behavior.

Common Mistakes
  • Creating a Uniform scale set for a new workload — Flexible is the default for a reason, and orchestration mode cannot be changed after creation.
  • Autoscaling on CPU when the real bottleneck is memory or a downstream queue — the set scales on the wrong signal and never relieves the actual pressure.
  • Running rolling upgrades with no health probe — a bad image rolls out to the entire fleet because nothing checks whether the new instances work.
  • Setting an aggressive scale-in rule with no connection draining — instances are removed mid-request, dropping in-flight work.
  • Deploying a single-zone scale set and calling it highly available — all the elasticity in the world does not survive that one zone failing.
  • Forgetting a cooldown period, so the set adds and removes instances in a thrashing loop while each scale action is still settling.
Best Practices
  • Use Flexible orchestration for all new scale sets and select multiple Availability Zones.
  • Attach a health probe and use a rolling upgrade policy so bad deployments are caught after the first batch.
  • Combine a schedule rule for predictable load with a metric rule for spikes, and set a cooldown to prevent thrashing.
  • Scale on the metric that reflects the real bottleneck — queue depth or memory, not CPU by reflex.
  • Enable connection draining (via the load balancer) so scale-in waits for in-flight requests to finish.
  • Bake an immutable image with the Compute Gallery so every instance the set creates is identical and fast to start.
Comparable servicesAWS EC2 Auto Scaling GroupsGCP Managed Instance Groups

Knowledge Check

Why is Flexible the recommended orchestration mode for a new scale set?

  • It is zone-aware, supports mixed VM sizes, and treats each instance as a standard VM resource
  • It is the only orchestration mode that supports metric-based autoscaling
  • It is cheaper per instance to run than the Uniform mode
  • It alone allows switching the orchestration mode later, after the scale set has already been created

What does a health probe protect against during a rolling upgrade?

  • Rolling a broken image out to the entire fleet — the upgrade halts if a new batch is unhealthy
  • Instances being mistakenly placed in the same Availability Zone
  • Autoscale adding far too many new instances at once
  • A region-wide outage taking down every Availability Zone at once during the rolling upgrade window

A scale-in rule removes instances the moment CPU drops. In-flight requests are being dropped. What is missing?

  • Connection draining, so scale-in waits for active requests to complete
  • A larger cooldown on the scale-out rule
  • A second Availability Zone
  • A switch from Flexible to Uniform orchestration mode on the whole scale set

You got correct