Network Troubleshooting
Network troubleshooting on Linux is the discipline of walking a connection failure down the stack — from "is the link up" to "did the application's TCP handshake complete" — using a small set of tools that each answer one layer's question. The modern Debian and Ubuntu baseline is the iproute2 suite (ip, ss), not the deprecated net-tools commands (ifconfig, netstat, route) that older guides still reach for. The old commands are not even installed on a fresh Ubuntu Server image; ip and ss always are.
The operational cost of a bad mental model is wasted downtime. An engineer who pings a host, sees no reply, and declares "the network is down" has skipped four other explanations: a firewall dropping ICMP, an MTU mismatch black-holing large packets, a DNS answer pointing at the wrong address, or a service that simply is not listening on the port. Each layer has its own tool, and naming the layer that actually failed turns a two-hour outage into a five-minute fix.
The Layered Method
Work bottom-up and stop at the first layer that is broken. Layer 1 and 2: is the link up and is there a MAC neighbor? Layer 3: does the host have an IP, a route to the destination, and can it reach the gateway? Layer 4: is a process listening locally and is the remote port open? Layer 7: does the application protocol — HTTP, TLS, DNS — actually complete? Reading TLS errors from curl is a waste of time when the default route is missing, because the connection never left the box.
On Debian and Ubuntu, ip answers every layer-3 question and ss answers every layer-4 question. The two together replace the whole net-tools package, run an order of magnitude faster on machines with tens of thousands of sockets, and report state — link flags, route metrics, socket timers, retransmit counts — that ifconfig and netstat never exposed.
# Layer 1-3: link, addresses, routes, neighbor (ARP/NDP) table ip -br link # brief view: UP/DOWN state of each interface ip -br addr # brief view: addresses per interface ip route # routing table; check for "default via" ip neigh # neighbor cache; FAILED here = no L2 reachability # Layer 4: who is listening, who is connected ss -tlnp # TCP, Listening, Numeric, with Process (needs root) ss -tn state established # live established TCP flows
Reachability: ping, traceroute, MTR
ping tests one thing — round-trip ICMP echo to one address — and a missing reply has several causes, not one. Many hosts and almost every cloud or corporate firewall drop ICMP by policy, so silence does not prove the host is down; it proves only that ICMP echo did not return. The signal worth reading is in the numbers: ping reports per-packet round-trip time and a packet-loss percentage, and steady loss above zero on an otherwise-up path points at congestion or a flapping link, not a hard failure.
traceroute maps the path hop by hop by sending packets with increasing TTL and reading the ICMP "time exceeded" each router returns. A row of asterisks usually means a hop that refuses to answer, not a break — the path can still work end to end. mtr combines ping and traceroute into one continuously-updating screen, which is the right tool for intermittent loss: it shows you which hop starts dropping packets over hundreds of probes instead of the single snapshot a one-shot traceroute gives.
# Debian/Ubuntu: install if missing apt install traceroute mtr-tiny ping -c 5 10.0.0.1 # 5 probes, then a loss/RTT summary traceroute -T -p 443 host # TCP SYN to 443 — gets through ICMP-blocking firewalls mtr -rwc 100 host # report mode, 100 cycles: per-hop loss over time
DNS Resolution
DNS is the single most common cause of failures that look like a network outage, because a wrong or stale answer sends a perfectly healthy connection to the wrong IP. The right diagnostic is dig, which queries a resolver directly and prints exactly what came back — the answer record, the TTL, and the response code (NXDOMAIN means the name does not exist; SERVFAIL means the resolver itself failed). Do not debug DNS with ping, which hides whether the failure was the lookup or the connection.
On Ubuntu the resolver path has a twist: systemd-resolved runs a stub listener on 127.0.0.53, and /etc/resolv.conf is a symlink pointing at it rather than at your real upstream servers. So dig example.com with no server argument queries the stub, while resolvectl status shows the actual upstream DNS each link is using. When an application resolves a name differently from dig, the cause is almost always /etc/nsswitch.conf ordering or a cached answer in the stub — flush it with resolvectl flush-caches.
dig example.com +short # just the answer addresses dig @1.1.1.1 example.com # bypass the local resolver, ask Cloudflare directly resolvectl status # Ubuntu: real upstream DNS per link resolvectl query example.com # resolve through systemd-resolved exactly as apps do
Sockets and Listeners with ss
When a client cannot connect, the first question is whether anything is listening on the server, and ss -tlnp answers it precisely. A common false alarm is a service bound to 127.0.0.1 instead of 0.0.0.0 — it shows up as listening, accepts local connections, and refuses every remote one, because it never bound to a routable address. The Local Address:Port column tells you which: 127.0.0.1:5432 is loopback-only, 0.0.0.0:5432 or *:5432 is all interfaces.
ss also exposes the TCP state machine, which decodes the symptom. A connection stuck in SYN-SENT means your SYN left but no SYN-ACK came back — a firewall is dropping it silently, or nothing is listening on the far side. A pile of TIME-WAIT sockets is normal after a busy short-lived-connection workload and rarely a problem. A pile of CLOSE-WAIT, by contrast, is an application bug: the peer closed but your process never called close() on its file descriptor.
| State | What it means | What to suspect |
|---|---|---|
LISTEN | Socket waiting for connections | Healthy — check the bound address |
SYN-SENT | SYN sent, no SYN-ACK back | Firewall drop or no listener at peer |
ESTAB | Connection established | Healthy |
TIME-WAIT | Local close, waiting out 2×MSL | Normal after high connection churn |
CLOSE-WAIT | Peer closed, you have not | Application leaking sockets |
Packet Capture with tcpdump
When the higher tools disagree with reality, tcpdump settles the argument by showing the packets actually on the wire. It is the tool for "the client says it sent the request and the server says it never arrived" — capture on both ends and one of them is wrong. Always pin the capture with a filter and an interface, because an unfiltered capture on a busy host floods the terminal and can drop packets it fails to write fast enough.
The two flags that make a capture readable are -n (do not resolve names, so the capture does not generate its own DNS traffic and stall) and -w file.pcap (write raw packets to a file for later analysis in Wireshark). For a stuck TCP connection, the telltale pattern is a SYN with no SYN-ACK (the far side is dropping), or repeated retransmissions of the same sequence number (the path is losing packets). Reading a three-packet handshake in tcpdump is faster than guessing across three other tools.
# Capture HTTPS to/from one host, numeric, on one interface tcpdump -ni eth0 host 10.0.0.5 and port 443 # Capture to a file for Wireshark; -c caps the packet count tcpdump -ni eth0 -w /tmp/cap.pcap -c 2000 port 5432 # Watch a TCP handshake: SYN, SYN-ACK, ACK tcpdump -ni eth0 'tcp[tcpflags] & (tcp-syn|tcp-ack) != 0' and host db01
The Firewall as a Suspect
A connection that fails only between two hosts but works locally is almost always a firewall, and on Debian/Ubuntu that means three layers to check in order. ufw is the high-level front end most Ubuntu servers use; ufw status verbose shows its rules. Underneath, ufw programs nftables (or iptables on older systems), and nft list ruleset shows the real rules the kernel enforces — including ones a cloud-init script or another tool added behind ufw's back. Above both sits the network: a cloud security group or an upstream router ACL drops packets before they ever reach the host.
The diagnostic that distinguishes "firewall dropping" from "nothing listening" is the connection's behavior. A REJECT rule sends back an error — an ICMP port-unreachable by default, or a TCP RST with --reject-with tcp-reset — so the client fails fast with "connection refused." A DROP rule says nothing, so the client hangs until it times out — the same symptom as a packet lost in transit. On Red Hat systems the equivalent front end is firewalld (firewall-cmd --list-all), which also sits on top of nftables.
ufw status verbose # Ubuntu high-level rules nft list ruleset # the actual kernel ruleset (nftables) iptables -L -n -v # legacy view; -v shows packet/byte counters # Test a TCP port without a client: /dev/tcp is a bash builtin timeout 3 bash -c 'cat < /dev/tcp/10.0.0.5/443' && echo OPEN
DNS problem — the name resolves to the wrong address or fails outright, while the IP is perfectly reachable. The test that isolates it: dig host +short returns nothing or the wrong record, but connecting to the known-good IP directly works. Fix the resolver, not the network.
Routing problem — even the raw IP is unreachable because no path exists or the gateway is wrong. The test: ip route get IP shows the route the kernel would use, and traceroute or mtr dies at the first hop or the gateway. Connecting by name versus by IP makes no difference, because both fail the same way.
Firewall problem — the route exists and DNS is correct, but a specific port is blocked. The test: ping and traceroute reach the host, yet a TCP connection to the port hangs (DROP) or is refused (REJECT) while other ports work. The discriminator is per-port reachability, not host reachability.
- Concluding "the host is down" from an unanswered
pingwhen the firewall simply drops ICMP echo — the host is up and serving traffic on its real ports, and you have wasted an hour on the wrong diagnosis. - Binding a service to
127.0.0.1and expecting remote clients to reach it.ss -tlnpshows it listening, localcurlworks, and every connection from another machine is refused because the socket never bound to a routable address. - Debugging DNS on Ubuntu by querying
/etc/resolv.confdirectly and finding only127.0.0.53. That is thesystemd-resolvedstub; the real upstream servers live inresolvectl status, so you draw the wrong conclusion about which resolver is failing. - Reaching for
ifconfig,netstat, androuteon a fresh Ubuntu Server — they are not installed by default, and the muscle memory costs you minutes during an incident whenipandsswere there the whole time. - Confusing a
DROPfirewall rule with aREJECTone. A hang-until-timeout looks identical to packet loss in transit, while a fast "connection refused" tells you a rule actively rejected you — reading the wrong symptom sends you to the wrong layer. - Running
tcpdumpwith no filter on a busy production host, flooding the terminal and forcing the kernel to drop the very packets you needed to see because the capture could not write them fast enough. - Ignoring a growing pile of
CLOSE-WAITsockets inssas if it were harmless likeTIME-WAIT— it is an application that never closed its file descriptors, and it leaks until the process runs out of them.
- Troubleshoot bottom-up: confirm
ip routehas a default gateway andip neighresolves it before you ever runcurlor read a TLS error. - Use
ss -tlnpas the first check for any "cannot connect" report, and read theLocal Addresscolumn to catch services bound to loopback instead of0.0.0.0. - Test DNS with
dig +shortand confirm the resolver withresolvectl statuson Ubuntu — never infer DNS health from apingthat hides which step failed. - Reach for
mtr -rwc 100for intermittent loss instead of a singletraceroute; it pinpoints the exact hop that starts dropping over many probes. - Always filter
tcpdumpby host and port and pin it to one interface with-ni, and write to a.pcapwith-wfor anything you want to inspect in Wireshark later. - Check the firewall in all three layers —
ufw status, thennft list rulesetfor what the kernel actually enforces, then the upstream security group or router ACL — before assuming the application is at fault. - Keep
traceroute,mtr-tiny, andtcpdumppre-installed in your base image; installing diagnostic tools mid-incident on a box with broken networking is exactly whenaptcannot reach the mirror.
ipconfig, Test-NetConnection, and pktmon/Wireshark for the ip/ss/tcpdump rolesmacOS / BSD — BSD ifconfig, netstat, route, and tcpdump; no iproute2, so the old syntax persistsWireshark / tshark — GUI and CLI analysis of the .pcap files tcpdump writesKnowledge Check
A client cannot reach a service. ss -tlnp on the server shows it listening on 127.0.0.1:8080. What is the problem?
- The service is bound to loopback only, so it accepts local connections but refuses every remote one — it must bind to
0.0.0.0to be reachable - Port 8080 falls in the privileged range, so the service needs root capabilities before the kernel will let it accept any external traffic on that port
- The firewall is dropping the inbound packets at the FORWARD chain before they ever reach the listening socket on the server
- The listener itself is perfectly healthy, so the failure must be DNS resolving the server's hostname to the wrong address and sending the client to a dead endpoint
A connection to a remote port hangs until it times out rather than failing immediately. Which is the most likely cause?
- A firewall
DROPrule (or lost packets) is silently discarding the SYN, so no RST comes back and the client waits out its timeout - A firewall
REJECTrule is sending an RST back to the client, and that returned reset is what always forces the connection into a slow timeout instead of a fast failure - The destination server returned an
NXDOMAINresponse for the connection request, leaving the client to retry until it gives up - The outbound socket is stuck in
TIME-WAIT, which blocks every new connection to that same port until the old entry finally clears
On Ubuntu, /etc/resolv.conf points only at 127.0.0.53. What does that tell you?
- It is the
systemd-resolvedstub listener; the real upstream servers are shown byresolvectl status - DNS is misconfigured and pointing at the loopback address by mistake — it should instead list the public upstream resolvers directly
- The host has no working DNS configured at all, so every name lookup on the system will fail outright
- A local caching server like
dnsmasqis the only supported resolver on Ubuntu
Why is mtr -rwc 100 a better choice than a single traceroute for diagnosing intermittent packet loss?
- It probes every hop continuously over many cycles and reports per-hop loss percentages, so it pinpoints which hop drops packets instead of one snapshot
- It probes with TCP segments instead of ICMP echo packets, and TCP is the only transport protocol that can actually expose per-hop packet loss along the route
- It bypasses the intermediate firewalls that a plain
traceroutecan never cross, so it reaches hops the snapshot tool simply cannot see - It resolves every hop's hostname through reverse DNS, which a single
tracerouterun never bothers to do on its own
You got correct