Topic 69

The Bandwidth-Delay Product

Throughput

A single TCP flow on a long-distance link often uses a small fraction of the available bandwidth — and the reason is almost never that the link is slow. It is that the window is too small for the bandwidth-delay product. To keep a pipe full, you must have enough unacknowledged data in flight to cover the entire round trip; if the window holds less than that, the sender runs out of data to send, stalls waiting for acknowledgments, and the link sits idle. The math is exact, and it explains nearly every "why is my transcontinental transfer so slow" ticket.

The bandwidth-delay product is the number of bytes that fit "on the wire" at once: bandwidth times round-trip time. On a 1 Gbps link with a 100 ms RTT, that is 12.5 megabytes — meaning you need 12.5 MB in flight before the first acknowledgment returns, or the sender starves. Default socket buffers were sized for an era of low-latency LANs and are nowhere near that. This is the long-fat-network problem, and it is a tuning problem, not a capacity problem.

The Bandwidth-Delay Product

BDP is bandwidth × RTT, and it is the amount of data that must be in flight to saturate a path. The intuition: light takes time to cross the link, so at any instant a full pipe holds RTT-worth of bytes that have been sent but not yet acknowledged. If your window — the cap on unacknowledged data — is smaller than the BDP, you cannot fill the pipe, full stop.

# BDP for a 1 Gbps path with 100 ms round-trip time
#   1,000,000,000 bits/s  /  8  =  125,000,000 bytes/s
#   125,000,000  x  0.100 s     =  12,500,000 bytes  (~12.5 MB)
# you must keep ~12.5 MB in flight to keep the link full
# default rcv buffer of 256 KB caps you at 256KB/0.1s = 2.5 MB/s = 20 Mbps

That last line is the whole problem in one calculation. With a 256 KB receive buffer on a 100 ms path, throughput tops out around 20 Mbps no matter that the link is rated at 1000 Mbps — you are using 2% of it. The fix is not more bandwidth; it is a window large enough to hold the BDP. Buy a faster link and the transfer does not speed up at all, because the buffer, not the pipe, is the ceiling.

Window Size as the Throughput Ceiling

The governing equation for a single flow is throughput ≈ window ÷ RTT. The window is the smaller of the receiver's advertised window and the sender's congestion window, both bounded by the socket buffers. Because RTT is in the denominator, the same window yields less throughput the farther away the peer is — a 1 MB window gives 80 Mbps at 100 ms RTT but only 27 Mbps at 300 ms. Distance does not just add latency; it lowers your throughput ceiling for a fixed window.

The original TCP window field is only 16 bits, capping the window at 64 KB — useless on any fat path. Window scaling (a TCP option negotiated in the handshake) multiplies that ceiling up to a gigabyte, and it is what makes long-fat-network throughput possible at all. If a middlebox strips the window-scaling option during the handshake, both ends fall back to 64 KB and a transcontinental transfer crawls at a few megabits, no matter how the buffers are tuned.

Socket Buffer Sizing and Autotuning

Modern Linux autotunes receive and send buffers, growing them toward the measured BDP up to a configured maximum. For most paths this just works and you never think about it. The catch is the maximum: net.ipv4.tcp_rmem and tcp_wmem set the ceiling autotuning may grow to, and on a true long-fat network the default ceiling is often below the BDP, so autotuning grows right up to a limit that still starves the flow.

# the autotuning ceilings: min  default  max (bytes)
sysctl net.ipv4.tcp_rmem net.ipv4.tcp_wmem
# net.ipv4.tcp_rmem = 4096   131072   6291456   (~6 MB max)
# 6 MB max < 12.5 MB BDP -> autotuning caps below what the path needs
# raise the max so autotuning can reach the BDP:
sysctl -w net.ipv4.tcp_rmem="4096 131072 33554432"
# verify the live window with: ss -tin  (look at the cwnd/rwnd)

So the tuning move on a long-fat network is to raise the autotuning maximum above the BDP, not to pin a fixed buffer size — pinning defeats autotuning's ability to shrink for nearby peers and waste memory on every socket. Compute the BDP, set the tcp_rmem and tcp_wmem maxima comfortably above it, and let autotuning grow into the headroom. Confirm the live window with ss -tin, which prints the actual congestion and receive windows in use.

Parallelism versus Bigger Windows

When you cannot tune the endpoints — a managed service, a host you do not control — the workaround is multiple parallel streams. N connections each get their own window, so the aggregate in-flight data is N times one window, and tools like iperf3 -P, GridFTP, and parallel-segment downloaders exploit exactly this. Four streams of 3 MB each fill the same pipe one well-tuned 12 MB stream would, without touching a sysctl.

The caveat is that parallelism is brute force masking an untuned flow, and it has costs. Many streams are unfair to other traffic — they claim N times the share of a shared bottleneck — and they multiply connection overhead and complicate retransmission behavior. The clean answer is one correctly-sized window; parallel streams are the pragmatic answer when you cannot reach the tuning knobs. Use them knowingly, not as a substitute for understanding why the single flow was slow.

Window vs BDP on a 1 Gbps × 100 ms pipe (BDP = 12.5 MB)

BDP needed
12.5 MB

tuned window
12.5 MB

default window
~0.2 MB

0 MB6 MB12.5 MB

A default 200 KB socket buffer fills barely 2% of the pipe — the sender starves waiting for ACKs. Throughput tops out at window ÷ RTT, so you must size the window to bandwidth × RTT.

Bigger Window vs More Parallel Streams

A bigger window on one flow is the correct fix: size the socket buffers to the BDP and a single connection fills the pipe efficiently and fairly. Choose it when you control the endpoints, because it is the tuning-correct answer with no per-stream overhead and no unfairness to neighbors.

More parallel streams split the path across N connections so the aggregate window reaches the BDP without tuning either end. Choose it when you cannot change the endpoints' buffers — a managed service, a black-box host — but know it is brute force: it is unfair on a shared bottleneck and adds connection overhead, masking rather than fixing the untuned single flow.

Common Mistakes

Blaming bandwidth for a window-limited transfer. A 1 Gbps link delivering 20 Mbps on a 100 ms path is not short of capacity — the window is below the BDP, and buying more bandwidth changes nothing because the buffer is the ceiling.
Leaving tiny socket buffers on a long-fat network. A 256 KB buffer caps a 100 ms path at about 20 Mbps regardless of link speed; the default ceilings were sized for LANs and starve any high-latency high-bandwidth flow.
Ignoring a middlebox that strips window scaling. If the scaling option is dropped in the handshake, both ends fall back to a 64 KB window and a transcontinental transfer crawls, no matter how high you set the buffer maxima.
Adding parallel streams to mask an untuned single flow. More connections hide the symptom but inherit the unfairness and overhead of N windows; you have papered over a one-line buffer fix with brute force you will carry forever.
Pinning a fixed buffer instead of raising the autotuning ceiling. A hard-coded size defeats autotuning's ability to shrink for nearby peers, wasting memory on every socket while still being wrong for some paths.

Best Practices

Compute the BDP as bandwidth × RTT before tuning anything, so you size the window to the path you actually have instead of guessing at a buffer number.
Raise the tcp_rmem and tcp_wmem maxima above the BDP and let autotuning grow into the headroom, rather than pinning a fixed buffer that breaks short-RTT efficiency.
Confirm window scaling is active in the handshake with a capture before blaming buffers, since a middlebox stripping the option caps the window at 64 KB no matter how the maxima are set.
Reach for parallel streams only when you cannot tune the endpoints, and treat them as a stopgap for a black-box host rather than the default answer to a slow transfer.
Read the live congestion and receive windows with ss -tin while a transfer runs, so you tune against the window the kernel is actually using instead of the value you hoped you set.

Comparable conceptsWindow scaling (Topic 28)BBR / CUBIC congestion control (Topic 29)Little's Law (queueing analog)

Knowledge Check

A single TCP transfer over a 1 Gbps, 100 ms RTT path sustains only 20 Mbps. The link is otherwise idle and lossless. What is the bottleneck?

The socket buffer is smaller than the 12.5 MB BDP, so the window caps throughput
Packet loss is repeatedly collapsing the congestion window, cutting it on every drop and keeping the average rate far below the link
The link is saturated and 20 Mbps is its real capacity
DNS lookups are inserting delay into each segment

For a single flow, how does the achievable throughput relate to window size and RTT?

Throughput is roughly window divided by RTT, so the same window does less at higher RTT
Throughput rises with RTT because more data is buffered in flight
Throughput depends only on link bandwidth, independent of window size
Throughput is the window multiplied by the RTT, so it grows steadily with distance as more data accumulates on the wire

You cannot change the buffer settings on a managed endpoint, but a transfer to it is window-limited. What is the pragmatic workaround, and its main downside?

Open several parallel streams so the aggregate window fills the pipe, at the cost of unfairness
Provision a faster, higher-bandwidth link to the managed endpoint and simply accept the substantially higher recurring monthly cost
Reduce the MTU so more packets fit in the window
Disable window scaling so the connection negotiates faster

You got correct