Topic 71

Overlays and SDN — VXLAN

Overlays

Modern datacenters and clouds run on overlay networks: virtual L2 or L3 networks built on top of the physical fabric by encapsulation. The physical network — the underlay — just moves IP packets between switches; the overlay wraps each tenant's frames in an outer header so two virtual machines believe they share a LAN even when they sit in different racks, or different rooms. The dominant encoding is VXLAN, which tunnels an entire Ethernet frame inside a UDP datagram so a layer-2 segment can stretch across a routed layer-3 backbone.

Software-Defined Networking is the broader shift that makes overlays manageable: separate the control plane — the decisions about where traffic goes — from the data plane that actually forwards it, and let a central controller program the whole fabric. The cloud VPC from the previous topic is itself a VXLAN overlay you never see; the provider's SDN controller pushes the tunnels and routes that make your 10.0.0.0/16 feel like a private wire. The cost of all this indirection is an encapsulation tax — extra header bytes, an MTU squeeze, and packet captures that show you the wrapper instead of the contents.

The Case for Overlays

The physical network cannot give you two things modern infrastructure demands: multi-tenancy and mobility. A flat VLAN-based fabric maxes out at 4094 segments (the VLAN ID is 12 bits), which a single large cloud region blows through in one building — and it pins a workload's L2 identity to the physical switch port it plugs into, so live-migrating a VM to another rack would change its network. Overlays cut that knot by making the virtual network independent of the physical one underneath.

Because the overlay rides on top of plain IP routing, a tenant's segment can span any underlay the packets can reach, and a VM keeps its IP and MAC when it moves because that identity lives in the encapsulated inner frame, not in any physical switch's table. The underlay only has to route the outer packets between the hosts running the tunnel endpoints. This decoupling is the whole point: the physical network becomes a dumb, fast transport, and all the per-tenant complexity moves into software at the edges.

VXLAN nests the inner frame inside outer headers

Outer Ethernet + IP

underlay delivery between VTEPs

UDP — dst port 4789

VXLAN rides UDP over routed L3

VXLAN header — VNI 5000

24-bit segment id, 16M domains

Inner Ethernet frame

the original tenant payload — MTU now 1450

VXLAN — L2 over UDP

VXLAN wraps an inner Ethernet frame in a chain of outer headers: an 8-byte VXLAN header, a UDP header (destination port 4789), an outer IP header, and an outer Ethernet header — roughly 50 bytes of overhead on every packet. The VXLAN header carries a 24-bit VNI (VXLAN Network Identifier), which is the tenant or segment tag. The endpoints that wrap and unwrap are VTEPs (VXLAN Tunnel Endpoints), usually a virtual switch in each hypervisor or node.

That 24-bit VNI is the headline number: it gives 16,777,216 segments versus the VLAN's 4094, three orders of magnitude more isolation domains, which is what a multi-tenant cloud actually needs. A frame from VM-A on VNI 5000 reaches VM-B only if VM-B's VTEP is also on VNI 5000; the VNI is the boundary. You can watch a tunnel on a Linux host with ip -d link show, which prints the VNI, the UDP port, and the local VTEP address.

# inspect a VXLAN interface: VNI 5000, UDP/4789, this host's VTEP
ip -d link show vxlan0
# vxlan0: <BROADCAST,MULTICAST,UP> mtu 1450 ...        <- note: 1450, not 1500
#   vxlan id 5000 local 10.0.3.11 dev eth0 srcport 0 0 dstport 4789
# the 50-byte encap is why mtu shows 1450 over a 1500-byte underlay

Control Plane versus Data Plane

A VTEP that wants to forward a frame on VNI 5000 needs to know which remote VTEP holds the destination MAC. Answering that question is the control plane; actually encapsulating and shipping the packet is the data plane. Early VXLAN flooded unknown destinations over IP multicast and learned MAC-to-VTEP mappings from the flood — a data-plane-learning scheme that scales badly. The SDN approach replaces it with a controller (or a BGP EVPN control plane) that distributes the full MAC-to-VTEP map directly, so no flooding is needed.

This separation is the core idea of software-defined networking, and it is exactly the split you saw with the routing protocols in chapter 4: the control plane computes the forwarding state, the data plane applies it at line rate. Centralizing the control plane in a controller buys you one place to program policy across thousands of nodes — and creates one thing whose failure or partition you must engineer around. A controller outage usually leaves existing flows forwarding (the data plane keeps its programmed state) but freezes new mappings, so the danger is a controller that is a single point of failure for change, not for steady-state traffic.

The Encapsulation Tax

Every overlay charges rent in three currencies. The first is MTU: those ~50 bytes of VXLAN headers come out of the payload, so an underlay with a 1500-byte MTU leaves only 1450 bytes for the inner frame. If a tenant VM still believes it has a full 1500-byte MTU and emits a 1500-byte packet with DF set, the VTEP cannot fit it and — if PMTUD is broken by a firewall eating the ICMP "fragmentation needed" message, the chapter-2 and chapter-12 black hole — the packet vanishes silently. Small connections work, large transfers hang. The fix is to either lower the inner MTU to 1450 or raise the underlay to jumbo frames (9000 bytes).

The second cost is visibility: a packet capture on the underlay sees the outer VXLAN/UDP headers and the VNI, not the inner conversation, so your old tcpdump filters match nothing useful until you decode the overlay. The third is a misplaced trust in reliability — the overlay is only as available as the underlay carrying it, and an underlay routing flap or congestion event shows up as inexplicable overlay packet loss. Always diagnose the underlay before blaming the virtual network sitting on top of it.

VLAN vs VXLAN

VLAN — a 12-bit tag in the Ethernet header, 4094 usable segments, carried on the physical L2 wire with no encapsulation and no MTU cost. Choose it for a single-tenant fabric within one switched domain where a few thousand segments is plenty and you want zero overhead.

VXLAN — a 24-bit VNI, 16,777,216 segments, tunneling L2 frames inside UDP over a routed L3 underlay, at roughly 50 bytes of per-packet overhead. Choose it for multi-tenant scale, for stretching a segment across racks or regions, and for VM mobility the physical network can't provide — accepting the MTU and visibility tax.

Common Mistakes

Leaving the inner MTU at 1500 over a 1500-byte underlay. The ~50-byte VXLAN overhead no longer fits, and with PMTUD blocked (chapters 2 and 12) large packets black-hole — small requests succeed while bulk transfers hang, the worst failure to diagnose.
Trying to capture inner traffic with old L2 filters on the underlay. tcpdump there sees only outer UDP/4789 and the VNI; you must capture at the VTEP or decode VXLAN to see the conversation inside, or you conclude "no traffic" when there is plenty.
Trusting the overlay's apparent health without checking the underlay. Overlay packet loss is frequently an underlay routing flap or congestion event; debugging the virtual network while the physical one is dropping packets wastes hours.
Treating the SDN controller as free of failure modes. A controller that goes down or partitions freezes new MAC/route programming across the whole fabric; existing flows keep forwarding, so the outage hides until something needs to change.
Assuming a VNI provides security isolation by itself. The VNI separates segments, but anyone who can inject onto the underlay or reach a VTEP can cross tenants — isolation still needs underlay access control and, for real confidentiality, encryption.

Best Practices

Set the underlay to jumbo frames (9000 bytes) wherever you control the fabric, so the full 1500-byte inner MTU survives encapsulation with room to spare and you never fight the overlay MTU squeeze.
Run a BGP EVPN or controller-based control plane instead of multicast flood-and-learn, so MAC-to-VTEP mappings are distributed directly and the fabric scales past the point where flooding melts down.
Capture at the VTEP and teach your tools to decode VXLAN before an incident, so when you need to see inner traffic you are reading the conversation, not staring at outer UDP headers.
Monitor underlay health — link utilization, routing convergence, loss — as a first-class signal, because overlay symptoms almost always trace to the physical network carrying the tunnels.
Design the controller for redundancy and verify the data plane keeps forwarding through a control-plane outage, so a controller failure degrades change management rather than dropping live traffic.

Comparable conceptsGRE / Geneve encapsulationCloud VPC (an overlay you don't see)BGP EVPN control plane

Knowledge Check

What does VXLAN's 24-bit VNI buy you over a VLAN's 12-bit tag?

About 16 million segments instead of 4094, by tunneling L2 frames over a routed L3 underlay
Lower per-packet latency, because UDP encapsulation forwards measurably faster than ordinary tagged frames
Built-in payload encryption for every tenant, since the frame is wrapped in UDP
Zero encapsulation overhead while still spanning a routed backbone

A controller goes down in an SDN fabric. What is the likely immediate effect?

Existing flows keep forwarding on programmed state, but new mappings and policy changes freeze
All traffic stops immediately, because the data plane has to query the controller for every single packet
Each packet is rerouted through the controller, adding a round-trip of latency
The underlay MTU shrinks, so large overlay packets begin to fragment

Large transfers hang on a VXLAN overlay while small requests work. Most likely cause?

The inner MTU was left at 1500, so encapsulated full-size packets black-hole with PMTUD broken
The two endpoints are configured on different VNIs, so the overlay segment never actually connects them
The UDP port 4789 used by VXLAN is being rate-limited by the underlay
The SDN controller is down, so only short-lived flows get programmed

You got correct