TCP Tuning and Connection Lifecycle
Topic 30

TCP Tuning and Connection Lifecycle

Tuning

The TCP problems that actually bite in production are rarely about congestion algorithms. They are about the connection lifecycle: sockets that pile up in TIME_WAIT, sockets that leak in CLOSE_WAIT, buffers sized wrong for the path, and connection churn that throws away the handshake on every request. These show up as port exhaustion, file-descriptor leaks, and latency long before any exotic tuning is warranted.

This is the operator's view of TCP — what you read off ss at 2 a.m. when a service is failing to connect. The single most useful skill is telling a normal state from a bug: thousands of TIME_WAIT entries on a busy client is expected and benign, while a growing pile of CLOSE_WAIT is almost always your own code forgetting to close sockets. Misreading which is which sends you tuning the kernel when you should be fixing the application.

The connection lifecycle states you read off ss
LISTENawaiting SYN
SYN-RCVDhalf-open
ESTABLISHEDdata flowing
FIN-WAITteardown started
TIME_WAIT2×MSL hold

TIME_WAIT

TIME_WAIT is the state the active closer — the side that sent the first FIN — sits in after a connection closes, for twice the maximum segment lifetime (2×MSL), typically 60 seconds on Linux. It exists for a concrete reason: to hold the 5-tuple long enough that any delayed segment from the just-closed connection cannot be mistaken for data on a new connection reusing the same ports. It is correctness machinery, not waste.

It becomes a problem on a busy active closer — a load generator or a forward proxy opening short connections to one backend — where TIME_WAIT entries accumulate and consume ephemeral ports until new connections fail. The fix is usually not to disable it. Enabling net.ipv4.tcp_tw_reuse lets the kernel safely reuse a TIME_WAIT tuple for a new outbound connection, and connection pooling avoids the churn entirely; the long-deprecated tcp_tw_recycle should never be touched, as it breaks behind NAT.

# count sockets by state — TIME_WAIT and CLOSE_WAIT at a glance
ss -tan | awk '{print $1}' | sort | uniq -c
#   8423 TIME-WAIT   <- normal on a busy active closer
#    312 CLOSE-WAIT  <- suspicious: your app isn't calling close()
# safe reuse of TIME_WAIT tuples for new outbound connections
sysctl -w net.ipv4.tcp_tw_reuse=1

Keepalives

A TCP connection with no data flowing is invisible to both ends — if the peer vanishes, neither side knows until it tries to send. Keepalives are periodic empty probes that detect a dead peer, but Linux's defaults are nearly useless for this: the first probe waits tcp_keepalive_time of 7200 seconds — two hours — before it even starts. A connection to a crashed peer can sit ESTABLISHED for hours.

The sharper problem is a dead NAT mapping. A firewall or NAT between you and the peer silently drops its state entry for an idle connection after a few minutes; your two-hour keepalive never fires in time to keep it alive, so the next send hits a black hole. For any long-lived connection through a NAT, lower tcp_keepalive_time to a few minutes so probes refresh the mapping and detect death long before the default would.

Socket Buffers and Autotuning

The receive and send buffers — governed by net.ipv4.tcp_rmem and tcp_wmem — cap how large the windows can grow, and therefore the maximum throughput on a high-BDP path. Modern Linux autotunes them: each connection grows its buffer toward its measured bandwidth-delay product, up to the configured ceiling, so most workloads need no manual sizing at all.

Override only when you have measured a reason. The honest case is a long-fat path where the autotuning ceiling is below the BDP — raise the maximum so the window can open wide enough to fill the pipe. The dishonest case is cranking buffers globally "for performance," which wastes kernel memory across thousands of mostly-idle connections and can worsen latency by encouraging deeper queues. Measure the BDP, set the ceiling to match, and let autotuning do the rest.

Connection Reuse

Every new connection pays the handshake's one round trip — plus one or two more for TLS — before a single byte of payload moves. Connection reuse amortizes that cost: HTTP keep-alive holds a connection open for subsequent requests, and a connection pool keeps a set of warm connections that requests borrow and return. A request served on a pooled connection skips the handshake entirely.

Reuse also cures the lifecycle problems above at the source. A pooled connection is never closed, so it never enters TIME_WAIT and never returns its ephemeral port to the pool — which is why pooling, not tcp_tw_reuse, is the real fix for port exhaustion on a busy client. The whole cluster of TCP performance complaints — handshake latency, TIME_WAIT pileup, ephemeral exhaustion — collapses to one practice: stop opening a fresh connection per request.

TIME_WAIT vs CLOSE_WAIT

TIME_WAIT is on the side that closed first (the active closer) and is normal: the kernel holds the 5-tuple for 2×MSL so stale segments cannot pollute a new connection. Thousands of them on a busy client are expected; the fix, if ports run short, is pooling or tcp_tw_reuse, not panic.

CLOSE_WAIT is on the side whose peer closed first and which has not yet called close() itself. A growing pile of CLOSE_WAIT is almost always a bug in your application — it is leaking sockets it should have closed — and no kernel setting fixes it. The cure is in your code, not in sysctl.

Common Mistakes
  • Blaming the network for a CLOSE_WAIT pile-up. CLOSE_WAIT means your application received the peer's FIN and never called close(); it is a socket leak in your code that will exhaust file descriptors, and no sysctl repairs it.
  • Opening a fresh connection per request on a busy client. Without reuse, every request pays the handshake RTT and leaves a TIME_WAIT, draining ephemeral ports until connects fail with EADDRNOTAVAIL.
  • Cranking socket buffers without measuring the BDP. Oversized fixed buffers waste kernel memory across idle connections and can deepen queues, adding latency, where autotuning would have sized each connection correctly.
  • Leaving keepalive timers at the two-hour default behind a NAT. The NAT drops its idle mapping in minutes while your first probe waits 7200 seconds, so the connection silently black-holes long before TCP notices the peer is gone.
  • Enabling tcp_tw_recycle to clear TIME_WAIT. It breaks connections from clients behind NAT by rejecting their timestamps, causing intermittent failures that are brutal to diagnose; it was removed from the kernel for exactly this reason.
Best Practices
  • Pool and reuse connections on any client that talks repeatedly to a backend, amortizing the handshake and eliminating the TIME_WAIT churn that exhausts ports.
  • Treat a rising CLOSE_WAIT count as an application bug and fix the missing close() in your code, rather than reaching for any kernel tunable.
  • Lower tcp_keepalive_time to a few minutes on long-lived connections through a NAT, so probes refresh the mapping and detect a dead peer before the two-hour default ever fires.
  • Set tcp_tw_reuse on a busy active closer to safely recycle TIME_WAIT tuples, but never tcp_tw_recycle, which breaks clients behind NAT.
  • Size buffer ceilings to the measured bandwidth-delay product and let autotuning fill in, instead of pinning large static buffers that waste memory and inflate latency.
Comparable conceptsConnection pooling (app/proxy layer)conntrack table limits

Knowledge Check

A service accumulates a growing number of sockets in CLOSE_WAIT. Where is the fix?

  • In the application code, which got the peer's FIN but never called close()
  • In the kernel configuration, by lowering the TIME_WAIT timer so that all of the stuck sockets end up being reclaimed considerably sooner
  • In the network, since CLOSE_WAIT means the peer's close packets are being dropped
  • In a sysctl, by turning on tcp_tw_reuse so the CLOSE_WAIT entries are recycled

Why does TIME_WAIT exist on the side that closes a connection first?

  • To finish flushing any unsent application data before the socket is released
  • To hold the 5-tuple for 2×MSL so stale segments cannot pollute a new connection
  • To wait for the remote peer to explicitly confirm back that it is now safe to go ahead and reuse this connection's ports
  • To keep probing the peer with keepalives until it acknowledges the close

A busy client opening a new connection per request to one backend starts failing to connect. What addresses the root cause?

  • Enlarging the listen backlog queue so that the client is able to hold a great many more pending half-open connections at any one time
  • Enabling tcp_tw_recycle so TIME_WAIT entries are cleared aggressively and immediately
  • Connection pooling, so warm connections are reused instead of churned
  • Increasing the socket send and receive buffers so each connection carries more data

You got correct