Topic 28

Flow Control and the Sliding Window

Flow Control

Flow control answers one narrow question: how does a fast sender avoid drowning a slow receiver? A 10 Gbps server streaming to a phone that can only process data so quickly will overflow the phone's buffer and force drops unless something paces it. TCP's answer is the receive window — the receiver advertises, in every ACK, exactly how much buffer space it has free, and the sender is forbidden from having more unacknowledged data in flight than that.

The crucial thing to keep straight is what flow control is not. It protects the receiver, not the network. A separate mechanism — congestion control, the next topic — protects the shared links in between. The amount a sender may transmit is bounded by the smaller of these two limits: the receive window (rwnd) and the congestion window (cwnd). Confusing the two sends you tuning the wrong knob when throughput disappoints.

Flow controlrwnd

Protects the receiver. The receiver advertises its free buffer in every ACK; the sender never has more unacknowledged data in flight than that window. End-to-end pacing, decided by how fast the application drains the socket.

Congestion controlcwnd

Protects the network. The sender estimates the shared path's capacity and caps in-flight data to avoid overrunning the links between. In-flight data is bounded by the smaller of cwnd and rwnd.

The Receive Window

The receive window (rwnd) is a 16-bit field in every TCP header carrying the receiver's currently free buffer space in bytes. As the application drains data out of the socket, the window opens; as the buffer fills because the application is slow to read, the window shrinks. The sender reads this number on each ACK and never lets its in-flight unacknowledged bytes exceed it — pure end-to-end pacing, decided entirely by how fast the receiver consumes.

Being 16 bits, the raw field maxes out at 65535 bytes. On a fast, high-latency link that ceiling is brutal: with a 64 KB cap and a 100 ms round trip, a single connection tops out near 5 Mbps no matter how much bandwidth the path actually has, because the sender must stop and wait for an ACK after every 64 KB. That limit is the entire reason window scaling exists.

The Sliding Window

The window slides: as ACKs confirm older bytes, the left edge advances, and as the receiver frees buffer, the right edge advances, so the sender continuously injects new data without waiting for each segment to be confirmed individually. At any instant the unacknowledged data in flight is bounded by the window size, but within that bound the sender keeps the pipe full — this pipelining is what makes TCP fast despite acknowledging everything.

The actual limit on in-flight data is the minimum of cwnd and rwnd. If the receiver advertises 256 KB but congestion control has the cwnd at 64 KB, the sender may have only 64 KB outstanding — the network is the binding constraint. If the cwnd has grown to 1 MB but the receiver advertises 64 KB, the receiver is the binding constraint. Reading which of the two is smaller tells you immediately whether to fix the receiver's buffers or the path.

# window-scale factor, send/recv windows, and the two limits
ss -tie
# ... wscale:7,7 ... cwnd:128 ... rcv_space:262144 ...
#       ^scale shift   ^cwnd (network limit)  ^receiver buffer
# the autotuning ceilings that cap how large rwnd can grow
sysctl net.ipv4.tcp_rmem net.ipv4.tcp_wmem

Window Scaling

Window scaling is the TCP option that lifts the 64 KB ceiling. Negotiated once, in the SYN of the handshake, it defines a left-shift applied to the advertised window — a scale of 7 multiplies it by 128, lifting the effective maximum to roughly 8 MB, and the protocol's largest legal shift of 14 reaches roughly 1 GB. Without it, no amount of buffer tuning helps, because the 16-bit field simply cannot express a larger number.

This matters on long-fat networks — paths with high bandwidth and high latency, where the bandwidth-delay product (the data that fits "in flight") exceeds 64 KB. A 1 Gbps transcontinental link with an 80 ms RTT has a BDP near 10 MB; capped at 64 KB it would run at under 1% of capacity. Because scaling is set only in the handshake, a connection that negotiated it badly cannot be fixed mid-stream — the fix is correct buffer ceilings before the connection opens.

Zero-Window and Silly-Window

When a receiver's buffer fills completely, it advertises a zero window — "send nothing more." The sender stops and periodically probes with a one-byte segment until the receiver reopens the window. A zero window is a clear signal that the receiving application is not reading fast enough; it is not loss, not congestion, and chasing it as a network problem wastes time. Read the window field and the cause is unambiguous.

The silly-window syndrome is the pathological case where the window reopens in tiny increments — a few bytes at a time — and the sender responds with tiny segments, spending more on headers than payload. TCP defends against it on both ends: the receiver delays advertising a larger window until it can offer a meaningful chunk, and the sender (via Nagle's algorithm) coalesces small writes. The two together keep a slow-draining receiver from degenerating into header-dominated dribble.

Flow Control vs Congestion Control

Flow control protects the receiver. Its limit is the receive window (rwnd), advertised by the receiver based on its free buffer space, and it prevents a fast sender from overrunning a slow consumer. It says nothing about the links in between.

Congestion control protects the network. Its limit is the congestion window (cwnd), computed by the sender from loss and delay signals, and it prevents a flow from overwhelming shared links. The sender may transmit only the smaller of cwnd and rwnd — so when throughput is low, the first diagnostic is which of the two is the binding limit.

Common Mistakes

Leaving window scaling off — or letting a middlebox strip the option — on a long-fat network. The connection is pinned to the 64 KB ceiling and runs at a fraction of the path's capacity no matter how fast the link is.
Capping socket buffers too small for the path's BDP. If the buffer is smaller than bandwidth times RTT, the receive window can never open wide enough to fill the pipe, and a high-bandwidth transfer stalls below its potential.
Reading a zero-window stall as network loss. A zero window means the receiving application is not draining its socket fast enough; debugging the wire instead of the consumer wastes the investigation entirely.
Manually pinning huge static buffers everywhere. Oversized fixed buffers waste kernel memory across thousands of mostly-idle connections, where autotuning would have sized each one to its actual BDP.
Blaming congestion control when the receiver is the limit. If rwnd is the smaller of the two windows, changing the congestion algorithm does nothing — the fix is the receiver's buffer, not the sender's CC.

Best Practices

Confirm window scaling is negotiated on any high-BDP path by checking wscale in ss -tie, since a stripped option silently caps the connection at 64 KB.
Size tcp_rmem and tcp_wmem ceilings to at least the path's bandwidth-delay product so the window can open wide enough to keep a long-fat link full.
Lean on kernel autotuning rather than pinning static buffers, letting each connection grow its window to its own BDP instead of wasting memory on idle sockets.
Diagnose low throughput by comparing cwnd and rwnd directly, fixing the receiver's buffers when rwnd is smaller and the path or algorithm when cwnd is.
Treat a persistent zero window as an application backpressure signal and speed up the consumer's reads, rather than retuning the network underneath it.

Comparable conceptsBandwidth-delay product sizingApplication-level backpressure

Knowledge Check

A connection between two high-bandwidth hosts 80 ms apart runs far below the link's capacity. What is the most likely cause?

Window scaling is not in effect, pinning the window at 64 KB
The TCP checksum is far too weak for a link this fast, forcing the receiver into constant revalidation of every single segment
Nagle's algorithm is batching the data and holding back the bulk of the transfer
The path MTU is small, so each segment carries too little payload to fill the link

How does flow control differ from congestion control?

Flow control runs only during the handshake, while congestion control runs during data transfer
Flow control protects the receiver via rwnd; congestion control protects the network via cwnd
Flow control protects the network links, while congestion control protects the receiver's buffer
They are two names for the same sliding window, used interchangeably by the stack

A sender stalls and the receiver is advertising a zero window. What does that indicate?

Heavy packet loss on the path has forced the sender to pause and retransmit
Congestion control has collapsed the sender's window down to nothing
The receiving application is not reading fast enough, so its buffer filled
The three-way handshake quietly failed partway through, leaving the connection stuck half-open and completely unusable for data

You got correct