Topic 72

Network Troubleshooting

Foundations

Network troubleshooting on Linux is the discipline of walking a connection failure down the stack — from "is the link up" to "did the application's TCP handshake complete" — using a small set of tools that each answer one layer's question. The modern Debian and Ubuntu baseline is the iproute2 suite (ip, ss), not the deprecated net-tools commands (ifconfig, netstat, route) that older guides still reach for. The old commands are not even installed on a fresh Ubuntu Server image; ip and ss always are.

The operational cost of a bad mental model is wasted downtime. An engineer who pings a host, sees no reply, and declares "the network is down" has skipped four other explanations: a firewall dropping ICMP, an MTU mismatch black-holing large packets, a DNS answer pointing at the wrong address, or a service that simply is not listening on the port. Each layer has its own tool, and naming the layer that actually failed turns a two-hour outage into a five-minute fix.

The Layered Method

Work bottom-up and stop at the first layer that is broken. Layer 1 and 2: is the link up and is there a MAC neighbor? Layer 3: does the host have an IP, a route to the destination, and can it reach the gateway? Layer 4: is a process listening locally and is the remote port open? Layer 7: does the application protocol — HTTP, TLS, DNS — actually complete? Reading TLS errors from curl is a waste of time when the default route is missing, because the connection never left the box.

On Debian and Ubuntu, ip answers every layer-3 question and ss answers every layer-4 question. The two together replace the whole net-tools package, run an order of magnitude faster on machines with tens of thousands of sockets, and report state — link flags, route metrics, socket timers, retransmit counts — that ifconfig and netstat never exposed.

# Layer 1-3: link, addresses, routes, neighbor (ARP/NDP) table
ip -br link            # brief view: UP/DOWN state of each interface
ip -br addr            # brief view: addresses per interface
ip route                # routing table; check for "default via"
ip neigh                # neighbor cache; FAILED here = no L2 reachability

# Layer 4: who is listening, who is connected
ss -tlnp              # TCP, Listening, Numeric, with Process (needs root)
ss -tn state established # live established TCP flows

Reachability: ping, traceroute, MTR

ping tests one thing — round-trip ICMP echo to one address — and a missing reply has several causes, not one. Many hosts and almost every cloud or corporate firewall drop ICMP by policy, so silence does not prove the host is down; it proves only that ICMP echo did not return. The signal worth reading is in the numbers: ping reports per-packet round-trip time and a packet-loss percentage, and steady loss above zero on an otherwise-up path points at congestion or a flapping link, not a hard failure.

traceroute maps the path hop by hop by sending packets with increasing TTL and reading the ICMP "time exceeded" each router returns. A row of asterisks usually means a hop that refuses to answer, not a break — the path can still work end to end. mtr combines ping and traceroute into one continuously-updating screen, which is the right tool for intermittent loss: it shows you which hop starts dropping packets over hundreds of probes instead of the single snapshot a one-shot traceroute gives.

# Debian/Ubuntu: install if missing
apt install traceroute mtr-tiny

ping -c 5 10.0.0.1        # 5 probes, then a loss/RTT summary
traceroute -T -p 443 host  # TCP SYN to 443 — gets through ICMP-blocking firewalls
mtr -rwc 100 host          # report mode, 100 cycles: per-hop loss over time

DNS Resolution

DNS is the single most common cause of failures that look like a network outage, because a wrong or stale answer sends a perfectly healthy connection to the wrong IP. The right diagnostic is dig, which queries a resolver directly and prints exactly what came back — the answer record, the TTL, and the response code (NXDOMAIN means the name does not exist; SERVFAIL means the resolver itself failed). Do not debug DNS with ping, which hides whether the failure was the lookup or the connection.

On Ubuntu the resolver path has a twist: systemd-resolved runs a stub listener on 127.0.0.53, and /etc/resolv.conf is a symlink pointing at it rather than at your real upstream servers. So dig example.com with no server argument queries the stub, while resolvectl status shows the actual upstream DNS each link is using. When an application resolves a name differently from dig, the cause is almost always /etc/nsswitch.conf ordering or a cached answer in the stub — flush it with resolvectl flush-caches.

dig example.com +short           # just the answer addresses
dig @1.1.1.1 example.com         # bypass the local resolver, ask Cloudflare directly
resolvectl status               # Ubuntu: real upstream DNS per link
resolvectl query example.com    # resolve through systemd-resolved exactly as apps do

Sockets and Listeners with ss

When a client cannot connect, the first question is whether anything is listening on the server, and ss -tlnp answers it precisely. A common false alarm is a service bound to 127.0.0.1 instead of 0.0.0.0 — it shows up as listening, accepts local connections, and refuses every remote one, because it never bound to a routable address. The Local Address:Port column tells you which: 127.0.0.1:5432 is loopback-only, 0.0.0.0:5432 or *:5432 is all interfaces.

ss also exposes the TCP state machine, which decodes the symptom. A connection stuck in SYN-SENT means your SYN left but no SYN-ACK came back — a firewall is dropping it silently, or nothing is listening on the far side. A pile of TIME-WAIT sockets is normal after a busy short-lived-connection workload and rarely a problem. A pile of CLOSE-WAIT, by contrast, is an application bug: the peer closed but your process never called close() on its file descriptor.

State	What it means	What to suspect
`LISTEN`	Socket waiting for connections	Healthy — check the bound address
`SYN-SENT`	SYN sent, no SYN-ACK back	Firewall drop or no listener at peer
`ESTAB`	Connection established	Healthy
`TIME-WAIT`	Local close, waiting out 2×MSL	Normal after high connection churn
`CLOSE-WAIT`	Peer closed, you have not	Application leaking sockets

Packet Capture with tcpdump

When the higher tools disagree with reality, tcpdump settles the argument by showing the packets actually on the wire. It is the tool for "the client says it sent the request and the server says it never arrived" — capture on both ends and one of them is wrong. Always pin the capture with a filter and an interface, because an unfiltered capture on a busy host floods the terminal and can drop packets it fails to write fast enough.

The two flags that make a capture readable are -n (do not resolve names, so the capture does not generate its own DNS traffic and stall) and -w file.pcap (write raw packets to a file for later analysis in Wireshark). For a stuck TCP connection, the telltale pattern is a SYN with no SYN-ACK (the far side is dropping), or repeated retransmissions of the same sequence number (the path is losing packets). Reading a three-packet handshake in tcpdump is faster than guessing across three other tools.

# Capture HTTPS to/from one host, numeric, on one interface
tcpdump -ni eth0 host 10.0.0.5 and port 443

# Capture to a file for Wireshark; -c caps the packet count
tcpdump -ni eth0 -w /tmp/cap.pcap -c 2000 port 5432

# Watch a TCP handshake: SYN, SYN-ACK, ACK
tcpdump -ni eth0 'tcp[tcpflags] & (tcp-syn|tcp-ack) != 0' and host db01

The Firewall as a Suspect

A connection that fails only between two hosts but works locally is almost always a firewall, and on Debian/Ubuntu that means three layers to check in order. ufw is the high-level front end most Ubuntu servers use; ufw status verbose shows its rules. Underneath, ufw programs nftables (or iptables on older systems), and nft list ruleset shows the real rules the kernel enforces — including ones a cloud-init script or another tool added behind ufw's back. Above both sits the network: a cloud security group or an upstream router ACL drops packets before they ever reach the host.

The diagnostic that distinguishes "firewall dropping" from "nothing listening" is the connection's behavior. A REJECT rule sends back an error — an ICMP port-unreachable by default, or a TCP RST with --reject-with tcp-reset — so the client fails fast with "connection refused." A DROP rule says nothing, so the client hangs until it times out — the same symptom as a packet lost in transit. On Red Hat systems the equivalent front end is firewalld (firewall-cmd --list-all), which also sits on top of nftables.

ufw status verbose          # Ubuntu high-level rules
nft list ruleset            # the actual kernel ruleset (nftables)
iptables -L -n -v          # legacy view; -v shows packet/byte counters

# Test a TCP port without a client: /dev/tcp is a bash builtin
timeout 3 bash -c 'cat < /dev/tcp/10.0.0.5/443' && echo OPEN

DNS vs Routing vs Firewall

DNS problem — the name resolves to the wrong address or fails outright, while the IP is perfectly reachable. The test that isolates it: dig host +short returns nothing or the wrong record, but connecting to the known-good IP directly works. Fix the resolver, not the network.

Routing problem — even the raw IP is unreachable because no path exists or the gateway is wrong. The test: ip route get IP shows the route the kernel would use, and traceroute or mtr dies at the first hop or the gateway. Connecting by name versus by IP makes no difference, because both fail the same way.

Firewall problem — the route exists and DNS is correct, but a specific port is blocked. The test: ping and traceroute reach the host, yet a TCP connection to the port hangs (DROP) or is refused (REJECT) while other ports work. The discriminator is per-port reachability, not host reachability.

Common Mistakes

Concluding "the host is down" from an unanswered ping when the firewall simply drops ICMP echo — the host is up and serving traffic on its real ports, and you have wasted an hour on the wrong diagnosis.
Binding a service to 127.0.0.1 and expecting remote clients to reach it. ss -tlnp shows it listening, local curl works, and every connection from another machine is refused because the socket never bound to a routable address.
Debugging DNS on Ubuntu by querying /etc/resolv.conf directly and finding only 127.0.0.53. That is the systemd-resolved stub; the real upstream servers live in resolvectl status, so you draw the wrong conclusion about which resolver is failing.
Reaching for ifconfig, netstat, and route on a fresh Ubuntu Server — they are not installed by default, and the muscle memory costs you minutes during an incident when ip and ss were there the whole time.
Confusing a DROP firewall rule with a REJECT one. A hang-until-timeout looks identical to packet loss in transit, while a fast "connection refused" tells you a rule actively rejected you — reading the wrong symptom sends you to the wrong layer.
Running tcpdump with no filter on a busy production host, flooding the terminal and forcing the kernel to drop the very packets you needed to see because the capture could not write them fast enough.
Ignoring a growing pile of CLOSE-WAIT sockets in ss as if it were harmless like TIME-WAIT — it is an application that never closed its file descriptors, and it leaks until the process runs out of them.

Best Practices

Troubleshoot bottom-up: confirm ip route has a default gateway and ip neigh resolves it before you ever run curl or read a TLS error.
Use ss -tlnp as the first check for any "cannot connect" report, and read the Local Address column to catch services bound to loopback instead of 0.0.0.0.
Test DNS with dig +short and confirm the resolver with resolvectl status on Ubuntu — never infer DNS health from a ping that hides which step failed.
Reach for mtr -rwc 100 for intermittent loss instead of a single traceroute; it pinpoints the exact hop that starts dropping over many probes.
Always filter tcpdump by host and port and pin it to one interface with -ni, and write to a .pcap with -w for anything you want to inspect in Wireshark later.
Check the firewall in all three layers — ufw status, then nft list ruleset for what the kernel actually enforces, then the upstream security group or router ACL — before assuming the application is at fault.
Keep traceroute, mtr-tiny, and tcpdump pre-installed in your base image; installing diagnostic tools mid-incident on a box with broken networking is exactly when apt cannot reach the mirror.

Comparable toolsWindows — ipconfig, Test-NetConnection, and pktmon/Wireshark for the ip/ss/tcpdump rolesmacOS / BSD — BSD ifconfig, netstat, route, and tcpdump; no iproute2, so the old syntax persistsWireshark / tshark — GUI and CLI analysis of the .pcap files tcpdump writes

Knowledge Check

A client cannot reach a service. ss -tlnp on the server shows it listening on 127.0.0.1:8080. What is the problem?

The service is bound to loopback only, so it accepts local connections but refuses every remote one — it must bind to 0.0.0.0 to be reachable
Port 8080 falls in the privileged range, so the service needs root capabilities before the kernel will let it accept any external traffic on that port
The firewall is dropping the inbound packets at the FORWARD chain before they ever reach the listening socket on the server
The listener itself is perfectly healthy, so the failure must be DNS resolving the server's hostname to the wrong address and sending the client to a dead endpoint

A connection to a remote port hangs until it times out rather than failing immediately. Which is the most likely cause?

A firewall DROP rule (or lost packets) is silently discarding the SYN, so no RST comes back and the client waits out its timeout
A firewall REJECT rule is sending an RST back to the client, and that returned reset is what always forces the connection into a slow timeout instead of a fast failure
The destination server returned an NXDOMAIN response for the connection request, leaving the client to retry until it gives up
The outbound socket is stuck in TIME-WAIT, which blocks every new connection to that same port until the old entry finally clears

On Ubuntu, /etc/resolv.conf points only at 127.0.0.53. What does that tell you?

It is the systemd-resolved stub listener; the real upstream servers are shown by resolvectl status
DNS is misconfigured and pointing at the loopback address by mistake — it should instead list the public upstream resolvers directly
The host has no working DNS configured at all, so every name lookup on the system will fail outright
A local caching server like dnsmasq is the only supported resolver on Ubuntu

Why is mtr -rwc 100 a better choice than a single traceroute for diagnosing intermittent packet loss?

It probes every hop continuously over many cycles and reports per-hop loss percentages, so it pinpoints which hop drops packets instead of one snapshot
It probes with TCP segments instead of ICMP echo packets, and TCP is the only transport protocol that can actually expose per-hop packet loss along the route
It bypasses the intermediate firewalls that a plain traceroute can never cross, so it reaches hops the snapshot tool simply cannot see
It resolves every hop's hostname through reverse DNS, which a single traceroute run never bothers to do on its own

You got correct