Topic 64

The Layer-by-Layer Method

Method

Network troubleshooting without a method is guessing, and guessing under pressure is how a small outage becomes a long one. The reliable approach walks the stack one layer at a time — is the link up? does the host have an IP? can it route off the subnet? does DNS resolve? does the target port answer? does the application respond? — and stops the moment a layer fails the test. Each question is cheap to answer and rules out everything below it, so you converge on the broken layer in a handful of steps instead of randomly poking at config.

The payoff is that "the network is down" almost always resolves to one specific layer, and the method finds which without you having to know the answer in advance. A senior engineer is not faster because they know more commands; they are faster because they isolate before they touch. The two failure modes that wreck a diagnosis are skipping a layer you assumed was fine and changing several things at once so you can never tell what helped — both are disciplines, not knowledge.

Bottom-Up versus Top-Down

Bottom-up starts at layer 1 and climbs: confirm the cable or link is up, then the IP and subnet mask, then the default gateway, then a route off-subnet, then DNS, then the port, then the application. It never assumes a foundation; it proves each one. This is the right default when nothing is known — a host that "can't reach anything," a freshly provisioned box, a site that just lost connectivity — because the foundation is exactly where those failures live.

Top-down starts at the symptom and works down only as far as needed. If a single API call returns a TLS error while everything else on the host works, you do not re-verify the cable — you start near the top and descend. Top-down is faster when most of the stack is provably healthy and the symptom is specific; bottom-up is faster when the symptom is broad and the foundation is in doubt. Picking the wrong direction wastes time but never gives a wrong answer — the method is self-correcting.

The Isolation Checklist

The checklist is the same eight steps every time, in order, each with one command and one yes/no answer. Link up. Has an IP. Can reach the gateway. Has a route to the destination. DNS resolves the name. The destination port answers. TLS negotiates. The application replies correctly. The first no is your layer — everything above it is irrelevant until you fix it, and everything below it is already proven.

# walk the layers top to bottom, stop at the first failure
ip link show eth0          # L1/L2: state UP, not NO-CARRIER?
ip addr show eth0          # L3: an address + the right /prefix?
ping -c1 10.0.0.1          # gateway reachable on the local subnet?
ip route get 93.184.216.34 # a route off-subnet, via which gateway?
getent hosts api.example.com # does the name resolve the way the app sees it?
nc -vz api.example.com 443  # does the port actually answer?

The order is not arbitrary: each step depends on the one before it. There is no point testing DNS on a host with no route, and no point testing a port on a name that does not resolve. Run them top to bottom and the failure announces itself — the command that returns the first no names the layer, and you have spent maybe ninety seconds.

Bisection

When the path is long — client, local switch, firewall, WAN, remote load balancer, server — testing only from the client tells you it fails, not where. Bisection tests from both ends and the middle to cut the search space in half each time. Ping the destination from the client; if that fails, ping it from a host on the same segment as the server. If the second works, the problem is between the two test points, not at the server — you have eliminated half the path with one probe.

The same logic applies to layers, not just geography. If a curl to the service fails, try the raw TCP connect with nc; if that succeeds, the network is fine and the problem is in TLS or the application above it. Each bisecting test answers "is the fault above or below / nearer or farther than here," and a path of ten hops collapses to the guilty one in three or four probes. The trap is testing only from where you happen to be sitting — one vantage point can never localize a fault.

Changing One Thing at a Time

Once you suspect a cause, the discipline that keeps diagnosis from becoming damage is changing exactly one thing, then re-testing. Flip a firewall rule, lower an MTU, swap a DNS server — one change, observe, decide. Change three things and the symptom clears, and you have learned nothing: you cannot say which fix mattered, you cannot safely revert, and you have likely introduced a second problem masked by the first.

Write down the baseline before you touch anything and revert each change that does not help before trying the next. This feels slow and is faster, because every change you leave in place is a variable in the next test. The engineers who resolve incidents cleanly are not the ones who try the most things — they are the ones who try one thing, prove it, and keep a trail they can walk back.

The isolation checklist — the first “no” names your layer

Link up?

→

Has IP?

→

Route?

→

DNS resolves?

→

Port open?

→

App responds?

Bottom-Up vs Top-Down Troubleshooting

Bottom-up verifies the foundation first — link, IP, gateway, route — before looking at names or apps. It assumes nothing and is the right default when the symptom is broad ("can't reach anything") or the host is freshly provisioned, because that is exactly where such failures live.

Top-down starts at the symptom and descends only as far as the evidence forces. It is faster when most of the stack is provably healthy and the failure is narrow — one TLS error, one slow endpoint — so re-checking the cable would be wasted motion. Pick by how much of the stack is already in doubt.

Common Mistakes

Declaring "the network is down" without isolating a layer. Until you can name the failing step — no link, no route, no DNS, no port — you have a symptom, not a diagnosis, and you will hand the wrong team a ticket they cannot act on.
Changing several things at once. The symptom clears and you cannot tell which change fixed it, cannot revert cleanly, and have probably added a latent second fault that surfaces the next time.
Skipping DNS as a suspect. Name resolution causes a disproportionate share of "it's broken" incidents — a stale record, a wrong search domain, a dead resolver — yet it is the step people assume works and never test.
Testing from only one vantage point. A single failing probe from your laptop cannot tell you whether the fault is at the client, in the middle, or at the server; without a test from the far end you are guessing at location.
Trusting an assumption instead of a test. "The firewall is fine, nobody changed it" is a belief; an nc -vz to the port is evidence, and the belief is wrong often enough to cost you an hour.

Best Practices

Walk the checklist in order — link, address, gateway, route, DNS, port, TLS, app — and stop at the first failure, because everything above a broken layer is noise until that layer is fixed.
Bisect a long path by testing from both ends and the middle, so each probe halves the search space and a ten-hop path collapses to the guilty hop in three or four tests.
Change one variable, re-test, and revert it if it did not help before trying the next, so every test has exactly one unknown and you keep a trail you can walk back.
Suspect DNS early with getent hosts or dig, since name resolution causes far more outages than its share of the stack would suggest and is the cheapest layer to rule out.
Record the baseline state before touching anything — current routes, rules, and timings — so you can prove what you changed and restore it exactly when the incident is over.

Comparable conceptsOSI model as a checklist (Topic 02)USE method (utilization/saturation/errors)RED method (rate/errors/duration)

Knowledge Check

A freshly provisioned host reports it "can't reach anything." Which approach localizes the fault fastest, and why?

Bottom-up, because a broad symptom on a new box usually lives in the foundation — link, IP, or route
Top-down, starting at the application and the symptom and then descending one layer at a time only after each upper layer is explicitly disproven by a test
Change several likely settings at once to clear it quickly, then narrow down
Reboot the host first, since a fresh boot resets every layer at once

Why is changing one thing at a time worth the apparent slowness during an incident?

Each test then has a single unknown, so the result is attributable and cleanly reversible
A single change can never break anything else on the system
Because the very first change an experienced engineer reaches for is statistically the one most likely to resolve the underlying incident
It is always faster in raw wall-clock time than batching changes

A curl to a service times out from your laptop. Pinging the server from a host on the server's own subnet succeeds. What has bisection told you?

The fault is somewhere between your laptop and the server's segment, not at the server
The server's application port is confirmed open and healthy
DNS resolution is conclusively proven to be the single broken layer responsible for the curl timeout you observed
The entire path end to end is proven healthy

You got correct