Operating DNS
Topic 36

Operating DNS

Operations

Running DNS for a real service is more than typing records into a panel. You replicate a zone across multiple servers so a single failure does not make you unresolvable, you serve different answers to internal and external clients with split-horizon, and you steer users to the nearest region with GeoDNS. At that point DNS has stopped being a lookup table and become a traffic-management layer — the first decision point in every request, before a single packet reaches your application.

The lever and its limit are the same thing: DNS decides where a client goes, but only at resolution time, and the answer is then cached for a TTL. Every technique here — failover, latency steering, weighted load balancing — is bounded by that caching. DNS is a coarse, sticky tool for traffic control, excellent for "send most users to the closer region" and wrong for "fail over in under a second."

How an edit reaches the secondaries
edit primary
bump SOA serial
secondaries poll
AXFR / IXFR
in sync

Primary, Secondary, and the SOA Serial

A zone is typically served by one primary (where you edit records) and several secondaries that copy the zone from it. The copy mechanism is a zone transfer: AXFR pulls the whole zone, IXFR pulls only the changes since the secondary's last version. The secondaries are what give you resilience — clients can resolve from any of them, and the published NS records list them all.

The trigger for a transfer is the SOA serial number. Every zone has one SOA record carrying a serial that you must increment on every change; secondaries poll the primary's SOA, compare serials, and transfer only when the primary's is higher. Forget to bump the serial and the secondaries see no change and never update — the single most common reason an edit "didn't take" on some servers but not others.

# the SOA carries the serial — bump it on every edit (YYYYMMDDnn convention)
example.com.  IN  SOA  ns1.example.com. admin.example.com. (
        2026060401   ; serial — secondaries transfer only if this rises
        7200         ; refresh — how often secondaries poll the SOA
        3600         ; retry
        1209600      ; expire
        300 )        ; minimum — negative-cache TTL

Split-Horizon DNS

Split-horizon (or split-view) DNS returns a different answer for the same name depending on who asks. An internal client querying db.example.com gets a private 10.x address; an external client gets a public one, or NXDOMAIN, or a different host entirely. The server picks the view by the source of the query — typically internal resolver versus external — and serves the matching zone.

It is how internal service names stay reachable inside a network without exposing topology outside. The failure mode is a leak: a misconfigured view, or an internal name accidentally published in the external zone, hands an attacker a map of your private hostnames and addresses. Split-horizon is only as good as the boundary between the two views, and that boundary is easy to get subtly wrong.

GeoDNS and Latency Steering

GeoDNS answers the same query with different records based on where the client appears to be — usually inferred from the resolver's IP. A user in Frankfurt resolving www.example.com gets the European region's address; a user in São Paulo gets the South American one. The decision is made once, at resolution time, to put each user on the closest healthy endpoint and shave the round-trip latency that dominates page load.

The structural weakness is that GeoDNS sees the resolver's location, not the user's. A client using a distant public resolver, or a centralized corporate one, gets steered by the resolver's geography and may land in the wrong region — which is the gap that EDNS Client Subnet partly closes and that anycast routing sidesteps entirely by deciding per packet.

DNS-Based Load Balancing and Failover

Health-checked record sets turn DNS into a load balancer and a failover mechanism: the provider probes each endpoint and returns only the addresses that are passing, dropping a dead host from the answer set. Combined with weights, it can split traffic across regions or shift it gradually. It works at the resolution layer, so it needs no inline device in the request path.

The caveat is the one that recurs through this chapter — TTL stickiness. A client that resolved just before a host failed keeps the dead address cached for the full TTL, so DNS failover is measured in the TTL's seconds-to-minutes, never sub-second. For fast failover you need a layer below DNS: an anycast address or an inline load balancer that reroutes without waiting for a cache to expire. DNS gets you coarse, eventually-correct steering, and that is its ceiling.

GeoDNS vs Anycast

GeoDNS steers at the DNS layer: it returns a region-specific address at resolution time based on the resolver's location, then that answer is cached for a TTL. Use it to put users on a nearby endpoint — but expect the TTL-stickiness gap and the resolver-not-user location error.

Anycast steers at the routing layer: one address is announced from many locations and the network delivers each packet to the nearest instance, deciding per packet with no caching. Use it when you need fast, fine-grained failover that does not wait for a TTL — at the cost of running BGP and identical infrastructure everywhere.

Common Mistakes
  • Running DNS-based load balancing with a high TTL. Clients pin the address they first resolved, so when a host dies they keep hitting it for the full TTL — turning a load balancer into a way to send users at a dead server.
  • Leaking internal names through a misconfigured split-horizon. An internal hostname accidentally published in the external view hands an attacker a map of your private topology and addresses you meant to hide.
  • Forgetting to bump the SOA serial after an edit. Secondaries compare serials and transfer only when the primary's rises, so an unchanged serial means the change lives only on the primary and never reaches the secondaries.
  • Relying on DNS failover for sub-second recovery. TTL caching means failover takes the TTL's seconds to minutes, so a service expecting instant failover keeps sending traffic to the dead endpoint until caches expire.
  • Assuming GeoDNS sees the user's location. It sees the resolver's IP, so a client on a distant or centralized resolver gets steered to the wrong region, and the latency win you designed for evaporates.
Best Practices
  • Increment the SOA serial on every zone edit (a YYYYMMDDnn convention works), so secondaries detect the change and transfer the new version instead of silently staying stale.
  • Keep TTLs low (60–300s) on any record set used for DNS load balancing or failover, so a dead host clears from caches in minutes rather than pinning clients to it.
  • Run at least two secondaries on separate networks and providers, so a primary outage or one provider's failure does not make your whole zone unresolvable.
  • Reach for anycast or an inline load balancer when you need sub-second failover, and use DNS-based steering only for coarse, latency-level routing where a TTL delay is acceptable.
  • Audit the external view of a split-horizon zone regularly for leaked internal names, so a misconfiguration does not quietly expose private hostnames and addresses.
Comparable conceptsAnycast (routing-layer steering)Global load balancers (managed GeoDNS)

Knowledge Check

You edit a record on the primary, but the secondaries still serve the old value. The most likely cause?

  • You changed the record but did not bump the SOA serial
  • The record's TTL is too high for the secondaries to refresh
  • DNSSEC signing must complete before any transfer can run
  • The primary failed to push the change to each secondary

A service runs DNS-based failover with a 1-hour TTL. A host dies. What happens to clients that resolved just before?

  • They keep hitting the dead host for up to the full hour, until the cache expires
  • The resolver instantly invalidates every cached copy of the address the moment the health check fails
  • The network reroutes their packets to a healthy host immediately
  • They retry against a fresh answer within about five minutes

You need sub-second failover between regions. Why is GeoDNS the wrong layer, and what fits better?

  • GeoDNS is cached for a TTL; anycast decides per packet and fails over instantly
  • GeoDNS lookups are too slow per query; anycast resolves names faster
  • GeoDNS cannot detect a dead region at all, whereas anycast is the one that adds working health checks
  • GeoDNS runs on one server; anycast simply adds a second server

You got correct