Topic 75

Common Anti-Patterns

Anti-Patterns

Most network outages are not exotic. They are the same handful of mistakes repeated across teams and stacks: the flat network with no segmentation, the single DNS resolver, the firewall rule that defaults to allow, the subnet sized with no room to grow, the ICMP someone blocked "for security." Each is cheap to avoid at design time and expensive to suffer in production — and each maps cleanly back to a mechanism covered earlier in this course.

This topic is the field guide. The four sections below group the recurring anti-patterns by family — addressing and routing, security, reliability, and the silent killers — and for each one names the specific consequence and the fix, pointing at the chapter that explains it in depth. Read it as a checklist of things to recognize on a whiteboard before they become a 3 a.m. page.

Addressing and Routing Anti-Patterns

The most expensive addressing mistake is the overlapping CIDR. Two VPCs carved from 10.0.0.0/16 independently cannot peer, and a network you later need to connect to on-prem that reuses the same range cannot be routed to without renumbering one side — a project, not a config change (CIDR planning, Topic 13). The fix is a single authoritative IP plan allocated top-down before the first subnet is created.

Two more belong here. No growth headroom — sizing a subnet to today's host count so it fills and forces a renumber (Topic 13). And the flat layer-2 network, one giant broadcast domain where every host can reach every other: broadcast traffic scales with host count, a single misbehaving NIC floods everyone, and there is no segmentation boundary to contain a compromise (Topic 53). The fix for both is to subnet deliberately, with room to grow and a routing boundary between tiers.

Security Anti-Patterns

Security anti-patterns share a theme: trusting by default. Default-allow egress lets a compromised host phone home and exfiltrate freely, because nobody filtered outbound — the inbound rules were the only ones anyone wrote (default-deny, Topic 49). Perimeter-only trust assumes anything inside the network is friendly, so a single breached host has lateral run of a flat internal network with no further checks.

Public-by-default is the cloud version: instances assigned public IPs and databases reachable from the internet because that was the path of least resistance, leaving admin ports exposed to the entire address space. Each of these is closed by the same discipline — default-deny in both directions, private-by-default addressing, and identity-based authentication between services rather than implicit network trust (the hardening checklist, Topic 76).

Reliability Anti-Patterns

The reliability anti-patterns are the single-of-everything family. A single DNS provider is a SPOF that no amount of multi-region backend redundancy can survive: when it goes down, every region behind it goes dark together (designing for failure, Topic 74). A single upstream transit is the same trap one layer down.

Untested failover is the one that bites hardest — a standby that has never carried live traffic is a hypothesis, and the config has usually drifted or the DNS TTL is too long to fail over in time. No connection draining rounds out the family: removing an instance without draining resets every in-flight request, so each deploy and each health-check eviction becomes a visible blip. Run a second DNS provider, exercise failover on a schedule, and drain on every removal.

The Silent Killers

The worst anti-patterns produce no error message — they degrade silently and intermittently. Blocked ICMP is the canonical one: a firewall that drops ICMP "type 3, code 4" (fragmentation needed) breaks Path MTU Discovery, so large packets vanish into a black hole while small ones sail through, and the connection hangs only on big payloads (MTU/PMTUD, Topic 68). The fix is one line: allow PMTUD ICMP through every firewall.

The other silent killers are saturation you are not watching. An unmonitored conntrack table fills under a connection flood and starts dropping new flows with no log entry on the application side (Topic 49). Ephemeral-port exhaustion on a busy NAT or proxy caps new outbound connections the same invisible way. And ignored TTLs — DNS records or routes with TTLs too long to converge — mean a failover or correction takes effect long after you made it. Each is invisible until you put a graph and an alert on it.

# the silent killers, made visible — watch these before they drop traffic
sysctl net.netfilter.nf_conntrack_count net.netfilter.nf_conntrack_max
# count / max — if count nears max, new flows are dropped silently
sysctl net.ipv4.ip_local_port_range   # ephemeral pool for outbound conns
# confirm PMTUD ICMP is allowed, not dropped at the firewall
iptables -L INPUT -n | grep -i icmp

The four anti-pattern families

Anti-patterns

cheap to avoid at design, expensive in prod

Addressing & routing

overlapping CIDRs, no growth headroom, flat L2

Security

default-allow egress, perimeter-only trust, public-by-default

Reliability

single DNS, single transit, untested failover, no draining

Silent killers

blocked ICMP, full conntrack, port exhaustion, long TTLs

"Works in Dev" vs "Survives Prod"

Works in dev rests on four quiet assumptions: one host so no segmentation is needed, small payloads that never trip an MTU limit, idle links that never queue, and a single region so DNS and failover are never exercised. Every one of them holds right up to the moment real traffic arrives.

Survives prod is what is left when those assumptions break: many hosts demand a real IP plan and segmentation, large payloads demand working PMTUD, busy links demand conntrack and bandwidth headroom, and multiple regions demand a second DNS provider and tested failover. The anti-patterns in this topic are precisely the places where a dev-grade assumption silently becomes a prod-grade outage.

Common Mistakes

Overlapping CIDR ranges across VPCs or with on-prem (Topic 13) — peering and routing refuse to establish, and there is no fix short of renumbering one side after it is already in production.
Default-allow egress (Topic 49) — a compromised host exfiltrates and phones home freely because only inbound rules were ever written, making the breach an open outbound channel.
A single DNS provider as the global SPOF (Topic 74) — its outage darkens every region at once, and multi-region backend redundancy behind it buys nothing.
Blocked PMTUD ICMP (Topic 68) — dropping "fragmentation needed" black-holes large packets while small ones pass, so connections hang only on big payloads with no error logged.
Unmonitored conntrack or ephemeral-port saturation (Topic 49) — the table or port pool fills under load and silently drops new connections, surfacing as intermittent failures with nothing in the app logs.

Best Practices

Allocate one authoritative IP plan top-down before any subnet exists, leaving growth headroom, so overlaps and renumbers (Topic 13) never happen.
Default-deny in both directions and address privately by default (Topic 49), then open only the specific ports and egress destinations each tier actually needs.
Run a second DNS provider and a second transit upstream (Topic 74), and exercise failover on a schedule so the standby is proven and the TTLs are short enough to converge.
Allow PMTUD ICMP "type 3, code 4" through every firewall (Topic 68), and prefer it over clamping MSS so large payloads are never silently black-holed.
Graph and alert on conntrack count, ephemeral-port usage, and link utilization (Topic 49) so the silent saturation killers become visible long before they drop traffic.

Comparable conceptsCIDR planning (Topic 13)Default-deny (Topic 49)PMTUD / MTU (Topic 68)Conntrack saturation (Topic 49)

Knowledge Check

A connection works for small requests but hangs whenever the payload is large, with nothing logged. Which anti-pattern is the most likely cause?

A firewall dropping PMTUD ICMP, so oversized packets are black-holed
Overlapping CIDR ranges between the client network and the server network breaking the route entirely
A default-allow egress rule letting the large packets leak out
A saturated conntrack table dropping the bigger flows first

Why does a single DNS provider undermine an otherwise multi-region, multi-replica deployment?

It is one failure domain across all regions, so its outage takes every region down at once
It slows DNS resolution everywhere because every one of the regions has to query the same shared set of nameservers
It forces the regions to share overlapping CIDR ranges
It gives every region anycast resilience automatically

A design assumes a single host, small payloads, idle links, and one region. Which prod reality breaks the "idle links" assumption specifically?

Busy links queue and saturate state tables, so conntrack and bandwidth headroom now matter
Large payloads trip an MTU limit somewhere on the path and get silently black-holed by a firewall that blocks PMTUD
Multiple regions require a second DNS provider and tested failover
Many hosts demand segmentation and a real IP plan

You got correct