Topic 66

DNS and Socket Tools — dig, ss

Tools

Past raw reachability, the questions get specific: is it DNS, is anything listening, and what process owns this connection? Three tools answer them. dig interrogates DNS precisely — which server, which record type, what TTL — without the indirection your application's resolver adds. ss (and the deprecated netstat it replaced) shows socket state: who is listening, who is connected, how many sockets are stuck in each state. lsof closes the loop by mapping a socket back to the process holding it.

These are the three questions you reach for the moment ping succeeds but the application still fails. A successful ping proves the IP is reachable; it says nothing about whether the name resolved to that IP, whether a process is bound to the port, or whether a firewall sits between the listener and the client. Most "the service is up but I can't connect" tickets are exactly one of those three, and these tools separate them in under a minute.

dig — Precise DNS Queries

dig queries DNS directly and lets you control every variable: which server with @server, which record type, and how much output. The +short form gives you just the answer; the full form shows the answer, authority, and additional sections plus the TTL, which is what tells you whether you are looking at a fresh record or a cached one about to expire. Crucially, dig bypasses your host's NSS stack, so it talks to DNS even when /etc/hosts would override the application.

# ask a specific resolver, see the TTL in the answer section
dig @1.1.1.1 api.example.com A +noall +answer
# api.example.com.  300  IN  A  93.184.216.34   (300s TTL)
# follow the delegation from the root down — bypasses caches
dig api.example.com +trace
# .            NS a.root-servers.net.
# com.         NS a.gtld-servers.net.
# example.com. NS ns1.example.com.    <- authoritative answer here

The +trace mode is the one that settles arguments. It walks the delegation from the root servers down to the authoritative nameserver, ignoring every cache on the way, so it shows you the real current record rather than whatever your resolver cached an hour ago. When the app sees one address and you "know" the record changed, +trace proves whether the change has actually propagated to authority or is still stuck behind a TTL somewhere.

ss and netstat — Socket State

ss reads socket state straight from the kernel and answers two questions fast: what is listening, and what connections exist in which state. ss -tlnp lists TCP listeners with their owning process; ss -tn state established shows live connections. The state column is where the diagnosis lives — a pile of sockets in one state usually names the bug.

# what is listening, on which address, owned by which pid
ss -tlnp
# State  Local Address:Port   Process
# LISTEN 0.0.0.0:443          users:(("nginx",pid=812))
# LISTEN 127.0.0.1:5432       users:(("postgres",pid=901))  <- localhost only!
# count connection states at a glance
ss -tan | awk '{print $1}' | sort | uniq -c
#   2 LISTEN
# 184 TIME-WAIT   <- normal churn from a busy client
# 137 CLOSE-WAIT  <- the app is not close()-ing sockets

Two state counts carry meaning. A large TIME-WAIT pile is normal on a busy client that opens many short connections — it is the kernel holding the socket for twice the maximum segment lifetime to absorb stray packets, and it clears itself. A growing CLOSE-WAIT pile is an application bug, not a network one: the peer closed, but your app never called close(), so the socket is stuck waiting on the app. Reading a CLOSE-WAIT pileup as a network fault sends you debugging the wrong machine entirely.

netstat answers the same questions but parses /proc the slow way and is deprecated on modern Linux; ss is the same questions with a faster, kernel-native answer. The only reason to reach for netstat is a host too old to ship ss.

lsof — Socket to Process

lsof maps a port or connection back to the exact process and file descriptor that owns it — the answer to "what is holding port 8080" or "which process opened this connection to the database." On a host where you already have the listener address from ss, lsof -i tells you the binary, its pid, and the user, which is everything you need to restart the right thing or find the leak.

# who owns the listener on 8080, and as which user
lsof -nP -iTCP:8080 -sTCP:LISTEN
# COMMAND   PID   USER   FD   TYPE  NODE NAME
# java     4471   app   77u  IPv4  TCP  *:8080 (LISTEN)

Port Not Listening versus Firewalled versus DNS Wrong

The payoff is telling three failures apart that all present as "can't connect." Port not listening: ss -tlnp on the server shows nothing bound to the port — the process is down or bound to 127.0.0.1 instead of 0.0.0.0, a one-character config bug that looks like a network outage. Listening but firewalled: ss shows the listener present and healthy, but a connect from the client times out — the listener is fine and a firewall in between is dropping the SYN.

DNS wrong: the name resolves to the wrong IP, so you are connecting to a host that was never going to answer — dig +short against the right address, and a getent hosts to see what the app actually resolves, separate this in one step. The diagnostic flow is mechanical: dig the name to confirm the IP, ss on the server to confirm the listener, then a client-side connect to test the path between — whichever step fails names the layer.

Is it DNS? Which server returns what record, with what TTL?→dig

Is anything actually listening on the port, in which socket state?→ss

What process owns this socket or port number?→lsof

ss vs netstat

ss queries socket state directly through the kernel's netlink interface, so it is fast even with hundreds of thousands of sockets and supports rich state and address filters. It is the answer on any current Linux — listeners, connection states, owning process, all in one command.

netstat answers the same questions but scrapes /proc/net line by line, which crawls on a busy host, and it is deprecated and often not installed by default. Use it only on a legacy box too old to ship ss; otherwise ss is a drop-in replacement with the same intent and better performance.

Common Mistakes

Querying the system resolver when you meant to test a specific server. A bare dig name hits whatever your resolver caches; without @server you cannot tell a propagated change from a stale cache, and you "confirm" the wrong answer.
Reading a CLOSE-WAIT pileup as a network issue. CLOSE-WAIT means the peer closed and your app never called close() — it is an application file-descriptor leak, and debugging the network wastes hours on the wrong host.
Assuming a listening port is reachable. ss showing LISTEN proves only that the local process is bound; a firewall between client and server can still drop every SYN, so the listener is healthy and the connection still times out.
Missing a service bound to 127.0.0.1 instead of 0.0.0.0. The process is up and ss shows it listening, but only on localhost, so every remote client is refused — a one-line bind-address bug that mimics an outage.
Confusing the host resolver cache with authoritative data. The cached answer can be hours stale; only dig +trace or a query straight to the authoritative nameserver shows the record that actually exists right now.

Best Practices

Pin the resolver with dig @server when testing a DNS change, so you are reading a known server's answer rather than an opaque cache that may lag the real record.
Confirm propagation with dig +trace, which walks the delegation from the root and bypasses every cache, before declaring a record updated or broken.
List listeners with ss -tlnp on the server first, so you separate "nothing is bound" from "bound to localhost only" from "listening fine but unreachable" before touching the network.
Map a port to its owner with lsof -i or ss -p before restarting anything, so you act on the actual process holding the socket instead of guessing.
Treat a growing CLOSE-WAIT count as an application bug to fix in code, not a network fault, and a large TIME-WAIT count as normal churn you can usually ignore.

Comparable conceptsnslookup (legacy DNS tool)netcat (port reachability probe)fuser (port-to-process, like lsof)

Knowledge Check

On a server, ss -tlnp shows the service in LISTEN, yet remote clients time out connecting. What does this point to?

A firewall dropping the SYN in the path, or a bind to 127.0.0.1 rather than 0.0.0.0
The service process has crashed and is no longer running, which is why connections to its advertised port are not completing
DNS is returning the wrong IP for the service name
The port is closed and the kernel is sending RST to each client

A host accumulates thousands of sockets in CLOSE-WAIT over hours. Where is the bug?

In the application, which is failing to call close() on sockets the peer already closed
In the network path, which is intermittently dropping the connection-close packets exchanged when each peer tears down its side
In the kernel, which is holding sockets too long after a normal close
In DNS, which is resolving the peer to a stale address

You changed a DNS record an hour ago, but the app still uses the old IP. Which command shows whether the change actually reached authority?

dig name +trace, which walks from the root and skips all caches
dig name aimed at your host's default configured resolver
ss -tan to inspect the open sockets
getent hosts name run directly on the affected host to read the value the application sees

You got correct