Topic 26

The Scheduler

SchedulingPlacement

The scheduler decides which node each new Pod runs on. It works in two phases — filter the nodes that could host the Pod, then score the survivors and pick the best — and you steer it with affinity, taints, tolerations, and spread constraints.

Most of the time the scheduler just works and you never think about it. When you need replicas spread across zones, Pods kept off certain nodes, or workloads pinned to specialized hardware, these are the controls.

Filter, Then Score

How the scheduler picks a node

All nodes

→

Filterdrop nodes that can't fit — resources, taints, selectors

→

Feasible nodes

→

Scorerank by spread, balance, affinity

→

Bindthe highest scorer

Scheduling is two passes. Filtering eliminates nodes that cannot run the Pod — not enough requestable resources, a taint the Pod doesn't tolerate, a node selector that doesn't match. Scoring ranks the remaining feasible nodes by spreading, resource balance, and affinity preferences, and the Pod goes to the highest scorer. If filtering leaves no node, the Pod stays Pending — which, with the Cluster Autoscaler, is the signal to add a node (Topic 28).

Node Affinity and Selectors

nodeSelector is the simple form: run only on nodes with these labels. Node affinity is the expressive form, with requiredDuringScheduling (a hard filter) and preferredDuringScheduling (a soft preference that influences scoring). Use these to pin GPU workloads to GPU nodes, keep workloads in a region, or prefer cheaper node types when available.

Pod Affinity and Anti-Affinity

Where node affinity is about node labels, pod affinity is about other Pods. Pod affinity co-locates Pods (place this cache near that app); pod anti-affinity spreads them apart (never put two replicas of this database on the same node, or in the same zone). Anti-affinity is the classic tool for availability — but required anti-affinity at large scale is computationally expensive, so prefer the soft (preferred) form or topology spread for big deployments.

Spread replicas across nodes with anti-affinity

spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app: web
          topologyKey: kubernetes.io/hostname

Taints, Tolerations, and Spread

Taints repel Pods from a node unless the Pod carries a matching toleration — the inverse of affinity. Control-plane nodes are tainted so ordinary workloads stay off; GPU nodes are often tainted so only GPU workloads land there. Separately, topology spread constraints are the modern, declarative way to distribute Pods evenly across zones or nodes, with a maxSkew that bounds imbalance — cleaner than hand-rolled anti-affinity for the common "spread my replicas" case.

Affinity vs taints/tolerations

Affinity / selectors — Pods are attracted to nodes (or other Pods) that match. The Pod opts in.

Taints / tolerations — Nodes repel Pods that don't tolerate the taint. The node opts out unless allowed.

Common Mistakes

Over-constraining with required rules so Pods can't be scheduled and sit Pending.
Using required pod anti-affinity at large scale, where it becomes expensive — prefer preferred or topology spread.
Tainting nodes without giving the intended workloads tolerations, stranding them.
Ignoring topology spread, so all replicas land in one zone and a zone failure takes the service down.
Confusing affinity (attract) with taints (repel) and applying the wrong one.

Best Practices

Use topology spread constraints for the common "distribute replicas across zones/nodes" need.
Prefer soft (preferred) affinity rules unless a constraint is truly mandatory.
Taint specialized nodes (GPU, control-plane) and add matching tolerations to the workloads that belong there.
Treat a persistently Pending Pod as either over-constrained scheduling or a signal to scale nodes.
Reserve required anti-affinity for small, critical deployments where the cost is acceptable.

RelatedRequests and limits — what filtering checks for fit (Topic 25)Cluster Autoscaler — reacts to Pending Pods by adding nodes (Topic 28)DaemonSets — tolerations let them reach tainted nodes (Topic 13)

Knowledge Check

What are the two phases of scheduling?

Filtering (which nodes can host the Pod) then scoring (which feasible node is best)
Requesting (reserve resources) then limiting (cap them)
Tainting candidate nodes first, then matching tolerations on each Pod before placement
Binding the Pod then mounting its volumes

What is the difference between node affinity and a taint?

Affinity attracts a Pod to matching nodes; a taint repels Pods unless they carry a matching toleration
They are simply two names for one mechanism, and the scheduler honors whichever syntax you happen to write
Affinity repels Pods from nodes while a taint attracts them
Node affinity targets Pods while taints are applied to Services

What is the cleanest way to spread replicas evenly across zones?

Topology spread constraints with a maxSkew
Required pod affinity keyed on the zone label
Tainting every node in each of the zones
A nodeSelector pinning each replica to a zone

You got correct