Path MTU Discovery Failures
Topic 68

Path MTU Discovery Failures

MTU

This is the cruelest bug in networking: small requests work, large ones hang forever. SSH connects and then freezes the moment output scrolls. An HTTPS handshake completes and the page never loads. A ping with default size succeeds while a ping with a big payload gets nothing. The cause is almost always MTU — a packet too big for some link on the path — combined with a broken Path MTU Discovery, because a firewall somewhere is silently dropping the one ICMP message PMTUD depends on.

What makes it brutal is that the connection establishes. The SYN, the TLS handshake, the login — all small packets, all under the MTU, all fine. The instant the application sends a full-size segment, that packet hits a link too small to carry it, gets dropped, and no error comes back. The sender retransmits the same oversized packet forever, and the session black-holes. Everything looks healthy except that nothing large gets through.

The Symptom — Small Works, Large Hangs

The signature is unmistakable once you have seen it. Interactive login succeeds; bulk transfer stalls. A typing-only SSH session is fine until a command produces a screenful of output, then it freezes. curl to an API returns small JSON but hangs on a large response body. The connection is up — handshakes use small packets — but anything that fills a segment dies.

# prove it: a small ping works, a large one with DF set does not
ping -c2 -s 1400 -M do api.example.com
# 1408 bytes from 93.184.216.34: icmp_seq=1 ... time=19 ms   OK
ping -c2 -s 1473 -M do api.example.com
# ping: local error: message too long, mtu=1500  (or silent loss)
# 0 received -> a link on the path can't carry the big packet

The diagnostic is to binary-search the payload size with the don't-fragment bit set: find the largest packet that gets through and you have found the path MTU minus headers. If 1400-byte payloads pass but 1473 vanish, some link on the path has an MTU near 1500 and PMTUD is not negotiating it down for you — which means the ICMP feedback is being lost.

Path MTU Discovery

PMTUD is the mechanism that should make this self-correcting. IPv4 hosts set the don't-fragment (DF) bit on packets, telling routers "do not fragment this — if it's too big, drop it and tell me." When a packet hits a link with a smaller MTU, the router drops it and sends back an ICMP type 3, code 4 message — "fragmentation needed" — carrying the MTU of the constricted link. The sender reads that, lowers its packet size for that destination, and retransmits.

When it works, you never notice — the sender silently adapts to the smallest link on the path within one round trip. The whole design hinges on that one ICMP message getting back to the sender. PMTUD does not probe or guess; it relies entirely on routers reporting the limit and that report reaching the source. Remove the report and the mechanism does not degrade gracefully — it fails completely and silently.

The ICMP Black Hole

The failure mode is the ICMP black hole: a firewall or router on the path blocks ICMP, including the "fragmentation needed" message PMTUD needs. Someone, reasoning that "ICMP is a security risk," blanket-drops all of it. Now the oversized packet still gets dropped at the small link, but the ICMP telling the sender to shrink never arrives. The sender, deaf to the feedback, keeps retransmitting the same too-large packet — a silent loop with no error anywhere.

This is why blanket-blocking ICMP is the number-one cause of mysterious large-transfer hangs. The connection works for everything small and dies for everything large, and because no error surfaces, it gets misdiagnosed as an application bug, a server problem, or "the network being flaky." Tunnels make it worse: a VPN or overlay adds 50-plus bytes of outer header, pushing previously-fine packets over the limit, and if PMTUD is black-holed the tunnel works for logins and hangs on real traffic.

The Fixes — MSS Clamping and ICMP

The fix that survives a broken PMTUD is MSS clamping. During the TCP handshake each side advertises its maximum segment size; a router in the path can rewrite that advertised MSS downward so both ends agree, from the start, never to send a segment larger than the smallest link can carry. This sidesteps PMTUD entirely — no ICMP is needed, because the segment size is negotiated correctly at connection setup rather than discovered after a drop.

# clamp TCP MSS to the path MTU on a tunnel/gateway (iptables)
iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN \
  -j TCPMSS --clamp-mss-to-pmtu
# or pin it explicitly for a known tunnel MTU of 1400:
# ... -j TCPMSS --set-mss 1360   (1400 - 40 bytes TCP/IP header)

The other fixes are to stop breaking PMTUD in the first place and to lower the MTU where you know it is small. Allow ICMP type 3, code 4 through every firewall — it is not the attack vector people imagine, and blocking it breaks large transfers across the board. On tunnels, set the interface MTU to account for the encapsulation overhead so the host knows the real limit. MSS clamping is the belt; allowing the ICMP and setting the tunnel MTU are the suspenders — use both, because clamping only covers TCP and ICMP-dependent PMTUD still serves UDP and everything else.

PMTUD working — and the black hole when the ICMP is dropped
1500 B, DF set
Hits 1400 MTU link
Router: ICMP “frag needed”
Sender resizes to 1400
ICMP dropped → black hole
Fragmentation vs PMTUD vs MSS Clamping

Fragmentation is fragment-and-hope: a router splits an oversized packet into pieces the small link can carry. It works without DF set, but it is slow, the reassembling host carries the cost, and IPv6 forbids in-path fragmentation entirely — so it is the legacy fallback, not a strategy.

PMTUD is discover-the-limit: the sender sets DF and learns the smallest link from ICMP feedback, then shrinks. Elegant when the ICMP flows, but it black-holes silently the moment a firewall drops the "fragmentation needed" message it depends on.

MSS clamping is negotiate-down-at-handshake: a gateway rewrites the advertised TCP MSS so both ends agree on a safe segment size up front, needing no ICMP at all. It is the dependable tunnel fix — but it only covers TCP, so PMTUD and proper ICMP still matter for everything else.

Common Mistakes
  • Blanket-blocking ICMP on a firewall. Dropping all ICMP kills the "fragmentation needed" message PMTUD relies on, so large transfers black-hole while small packets pass — this is the single most common cause of the bug.
  • Adding a tunnel without accounting for overhead. A VPN or VXLAN header adds 50-plus bytes; without lowering the MTU or clamping MSS, previously-fine packets now exceed the path limit and the tunnel hangs on real traffic.
  • Diagnosing an MTU hang as an application bug. Because no error surfaces, the silent retransmit loop gets blamed on the server or the app, sending you debugging code when a DF-bit ping would have named MTU in seconds.
  • Testing only with small payloads. A login that succeeds and a small ping that returns prove nothing about MTU; the failure only appears once a packet fills a full segment, so small tests give false confidence.
  • Assuming both directions share one MTU. A path can be asymmetric, with a smaller link in one direction; clamping or lowering MTU on only one side leaves the other direction still black-holing large packets.
Best Practices
  • Allow ICMP type 3, code 4 through every firewall, because that message is what PMTUD depends on and blocking it breaks large transfers far more often than it stops any attack.
  • Clamp TCP MSS to the path MTU on tunnel and gateway devices with --clamp-mss-to-pmtu, so segment size is negotiated correctly at handshake and the connection never relies on ICMP that may be filtered.
  • Lower the interface MTU on tunnels to account for encapsulation overhead, so the host knows the real limit instead of discovering it through a dropped packet.
  • Confirm an MTU hang with a DF-bit ping sweep — ping -M do -s SIZE across increasing sizes — to find the exact path MTU before changing anything, rather than guessing at the application layer.
  • Test large payloads explicitly when validating a path, since logins and small pings pass regardless of MTU and only a full-size segment exposes the black hole.
Comparable conceptsMTU / jumbo frames (Topic 10)IPsec / VXLAN tunnel overhead (Topics 51/71)PLPMTUD (probe-based discovery)

Knowledge Check

SSH connects fine but freezes the moment a command prints a screenful of output. What is the most likely cause?

  • An MTU mismatch with PMTUD broken, so full-size packets are silently dropped
  • SSH authentication is failing intermittently after login
  • General packet loss on the path is randomly dropping segments, which only becomes noticeable once a command produces a burst of output
  • DNS is resolving the host to a slow secondary address

Why does blanket-blocking all ICMP on a firewall cause large transfers to hang?

  • It drops the 'fragmentation needed' message PMTUD needs, so the sender never learns to shrink
  • It blocks the ICMP echo replies that TCP relies on to continuously pace its sending rate against the measured round trip
  • It removes the ICMP wrapper that large TCP segments are carried inside
  • It prevents the TCP handshake from completing for any new connection

On a VPN gateway where you cannot guarantee ICMP flows end to end, what is the most dependable fix for the small-works/large-hangs failure?

  • Clamp the TCP MSS to the path MTU, so segment size is agreed at handshake without ICMP
  • Increase the tunnel interface MTU so that the larger full-size packets are finally able to fit through the constricted link
  • Turn on router fragmentation so oversized packets are split in transit
  • Clear the don't-fragment bit on all packets leaving the gateway

You got correct