Topic 29

Congestion Control

Congestion

Congestion control is the reason the internet does not collapse under its own load. Every TCP sender independently probes for how much bandwidth the path can give it and backs off the moment the network shows strain, so a single flow plays nicely with the millions of others sharing the same links. There is no central coordinator — just every endpoint running the same restraint algorithm and trusting the others to do likewise.

The control variable is the congestion window (cwnd): the sender's own estimate of how much unacknowledged data the network can absorb. It grows when delivery succeeds and shrinks on a congestion signal. The signal differs by algorithm — Reno and CUBIC treat packet loss as the cue, while BBR builds a model of the path's bandwidth and round-trip time — and that choice decides how aggressively a flow fills the pipe and how it shares it with neighbors.

The cwnd cycle — ramp fast, probe gently, back off, recover

Slow startcwnd doubles / RTT

→

Congestion avoidance+1 segment / RTT

→

Losscongestion signal

→

Recoverycut cwnd, climb again

Slow Start and Congestion Avoidance

A new connection has no idea what the path can carry, so it begins in slow start: cwnd starts small — around 10 segments on modern Linux — and roughly doubles every round trip. Despite the name, this is exponential and fast, designed to find the path's capacity in a handful of RTTs rather than crawling up to it. Slow start is why a connection's throughput ramps over its first moments rather than hitting full speed instantly.

When cwnd reaches a threshold or the first loss appears, the sender switches to congestion avoidance, where cwnd grows linearly — about one segment per RTT — gently probing for a little more headroom. On a loss it cuts back and resumes the cautious climb. The exponential-then-linear shape is the classic AIMD pattern: ramp quickly to get into the right range, then inch upward so you do not overshoot and trigger the very congestion you are probing for.

Loss-Based versus Delay-Based Control

The deep divide between algorithms is what counts as the signal to back off. Loss-based control — Reno, CUBIC — keeps pushing until a packet is dropped, taking the loss as proof the queue overflowed, then halves or sharply reduces cwnd. It is simple and has run the internet for decades, but it deliberately fills buffers until they overflow, which inflates latency (bufferbloat) and misreads random loss on a clean wireless link as congestion.

BBR takes a different premise: instead of waiting for loss, it continuously estimates the path's bottleneck bandwidth and minimum round-trip time, then sends at exactly that rate — enough to fill the pipe but not the buffers. Because it does not need loss as a signal, BBR keeps throughput high on lossy paths where CUBIC would repeatedly cut its window, and it keeps queues short rather than driving them full.

# which algorithm is in use, and which are available
sysctl net.ipv4.tcp_congestion_control
# net.ipv4.tcp_congestion_control = cubic
sysctl net.ipv4.tcp_available_congestion_control
# reno cubic bbr
# switch the default to BBR (needs the module loaded)
sysctl -w net.ipv4.tcp_congestion_control=bbr

CUBIC

CUBIC is the Linux default and the workhorse of most of the internet. It is loss-based but replaces Reno's linear growth with a cubic function of the time since the last loss: after a cutback it climbs cautiously near the previous peak, then accelerates again if no loss recurs. That curve lets it recover and probe far faster than Reno on high-bandwidth-delay paths, which is exactly where plain Reno's one-segment-per-RTT growth was hopelessly slow.

CUBIC's strength and weakness are the same coin. Because it still waits for loss, it fills the bottleneck buffer until it overflows — fine for throughput, but it adds queuing latency and stumbles on paths where loss is random rather than congestion-driven. On a clean, high-BDP link it scales well; on a lossy mobile link, every random drop makes it back off when nothing was actually congested.

BBR

BBR, from Google, models the path rather than reacting to loss. By tracking the maximum delivery rate and the minimum RTT it has seen, it computes the bandwidth-delay product and paces output to fill exactly that — keeping the pipe full and the queue near empty. On long, lossy paths it can deliver multiples of CUBIC's throughput, because a few percent random loss no longer forces a window collapse. This is why large content providers adopted it for their edge.

BBR is not free of controversy. Early versions could be unfair to loss-based flows sharing a bottleneck — BBR would hold its modeled rate while CUBIC, seeing loss, kept yielding, so BBR took more than its share. BBRv2 and later work to coexist better, but the lesson stands: mixing BBR and CUBIC on the same congested link can produce uneven sharing, and blaming the wrong flow for the imbalance is a common diagnostic error.

CUBIC vs BBR

CUBIC is loss-based: it grows cwnd along a cubic curve and backs off when a packet drops, treating loss as the congestion signal. It is the Linux default, scales well on clean high-BDP links, but fills buffers and misreads random loss on lossy paths as congestion.

BBR is model-based: it measures bottleneck bandwidth and minimum RTT and paces to fill the pipe without filling the queue. It wins big on long, lossy paths and keeps latency low, but early versions could grab more than their share when sharing a bottleneck with loss-based flows — the coexistence debate that still surrounds it.

Common Mistakes

Reading one lossy link as "broken" when CUBIC is simply backing off. A few percent random loss makes a loss-based algorithm cut its window repeatedly, so throughput drops even though the path is up and forwarding.
Mixing BBR and CUBIC on the same bottleneck and blaming the wrong flow. BBR can hold its rate while CUBIC yields to loss, producing uneven sharing that looks like a CUBIC bug but is the interaction.
Running tiny router or host buffers that trigger premature loss. Buffers smaller than the path's BDP drop packets before the pipe is full, signaling congestion to a loss-based sender that has not actually saturated the link.
Switching the default to BBR everywhere without testing fairness. BBR helps on lossy long-fat paths but can be aggressive against neighboring loss-based flows on a shared link, so a blanket change can degrade others' throughput.
Expecting slow start to reach full speed instantly. A fresh connection ramps cwnd over several RTTs, so a short transfer may finish before it ever uses the path's full capacity — which no algorithm choice fixes.

Best Practices

Reach for BBR on long, lossy paths — cross-continent or mobile edges — where loss-based CUBIC keeps collapsing its window on random drops it misreads as congestion.
Keep CUBIC as the default on clean, well-buffered networks, where its loss-based growth is well understood and shares bottlenecks fairly with the rest of the internet.
Size buffers to roughly the path's bandwidth-delay product, avoiding tiny buffers that trigger premature loss and oversized ones that inflate latency through bufferbloat.
Test fairness before rolling BBR out broadly, measuring how it shares a congested link with the loss-based flows already on it rather than assuming coexistence.
Diagnose low throughput by checking the congestion algorithm and the loss rate together with ss -ti, so you can tell a backing-off CUBIC from a genuinely failing path.

Comparable conceptsECN (explicit congestion signaling)QUIC (pluggable user-space CC)

Knowledge Check

How does slow start differ from congestion avoidance in how the congestion window grows?

Slow start doubles cwnd each RTT; congestion avoidance adds roughly one segment per RTT
Slow start grows cwnd strictly linearly, while congestion avoidance is the phase that doubles it on every single round trip
Slow start holds cwnd fixed, while congestion avoidance grows it only after a loss
Slow start halves cwnd each round trip, while congestion avoidance resets it to zero per ACK

On a long path with a few percent random packet loss, why does BBR often outperform CUBIC?

BBR retransmits each of the lost packets considerably faster, so the random losses are quietly repaired before they ever matter
BBR paces to its bandwidth model and ignores random loss, avoiding the window collapses CUBIC suffers
BBR is allowed a larger maximum window than CUBIC, so it ignores the receiver's limit
BBR adds forward error correction so lost packets never need retransmission

Two flows share a congested bottleneck and split it unevenly. Which interaction is a known cause?

Two CUBIC flows compete and one starves the other because they share the loss signal
The two flows negotiated different MTUs, so one fits more data per packet than the other
A BBR flow holds its modeled rate while a CUBIC flow yields on loss, so BBR takes more than its share
One flow advertised a zero window, forcing the shared bottleneck to hand over the entirety of its bandwidth to the other

You got correct