Topic 72

Container and Kubernetes Networking

Containers

Containers turned networking inside out. Instead of a handful of long-lived VMs each with one IP, a Kubernetes node runs dozens of pods that each get their own IP and churn constantly — created, killed, and rescheduled in seconds. A node with 110 pods is handing out 110 routable addresses that change identity all day, and a CNI (Container Network Interface) plugin is what wires each new pod into the network the instant it starts. None of the underlying ideas are new: it is IPs, routing, NAT, DNS, and load balancing from the earlier chapters, recomposed at container scale and container lifecycle.

Kubernetes mandates one flat model: every pod can reach every other pod by IP, across nodes, with no NAT in between. On top of that flat fabric it layers three abstractions that make the churn survivable — Services give a stable virtual IP in front of a set of pods whose real IPs come and go, cluster DNS turns service names into those virtual IPs, and NetworkPolicy segments what would otherwise be a fully-open network where any pod can talk to any other. Get those three and the rest of cluster networking is detail.

From node down to containers, with the Service in front

Node

node-a — /24 slice of the pod CIDR

Pod

10.244.1.23 — own IP and namespace

Containers

share localhost — coordinate ports

Service

10.96.0.42 — stable VIP over churning pods

The Pod Network Namespace

A pod is a group of containers that share one Linux network namespace — one IP, one routing table, one set of ports, which is why containers in the same pod talk over localhost and must not collide on ports. The CNI plugin builds that namespace by creating a veth pair: a virtual cable with one end inside the pod's namespace (its eth0) and the other end plugged into a bridge or routing construct in the node's root namespace. Traffic leaving the pod crosses the veth into the node, which then forwards it on.

Every pod's IP comes from a per-node slice of the cluster's pod CIDR — a large block like 10.244.0.0/16 carved into a /24 per node. This is where the chapter-3 overlap trap reappears with a vengeance: if the pod CIDR overlaps the VPC's CIDR, or two clusters you later want to connect share a pod range, routing becomes ambiguous and traffic blackholes — and just like VPC peering, the only fix is renumbering the cluster, which means rebuilding it. Plan the pod, service, and node CIDRs to be disjoint from each other and from every network you will ever route to.

CNI — Overlay versus Routed

CNI is a contract, not a product: Kubernetes calls a plugin binary with "set up networking for this pod" and the plugin — Calico, Cilium, Flannel, the cloud's own — decides how. The decision that matters most is overlay versus routed. An overlay CNI encapsulates pod-to-pod traffic in VXLAN or Geneve (the previous topic, now per-pod), so it works on any underlay that can carry UDP between nodes, at the cost of the same ~50-byte MTU tax and reduced visibility. A routed CNI instead programs real routes — often via BGP, as Calico does — so pod packets travel native with no encapsulation, faster and fully visible, but only if the underlying fabric or VPC will accept routes to pod CIDRs.

Cilium pushes the data plane further by using eBPF in the kernel instead of iptables, which matters once a node has thousands of services and the linear iptables ruleset becomes a bottleneck. The choice is the classic portability-versus-performance call: overlay runs anywhere and is the safe default; routed is faster and observable but needs the fabric's cooperation, which on a cloud VPC means the CNI integrating with the provider's routing so pod IPs are first-class VPC addresses.

# a node's pods each hold an IP from this node's slice of the pod CIDR
kubectl get pods -o wide
# NAME        READY   STATUS    IP            NODE
# web-7d9f    1/1     Running   10.244.1.23   node-a    <- per-pod IP
# web-8c2a    1/1     Running   10.244.2.11   node-b    <- different node slice
# api-5b1c    1/1     Running   10.244.1.24   node-a

Services and kube-proxy

Pod IPs are disposable, so you never target them directly. A Service is a stable virtual IP — a ClusterIP — that fronts the current set of pods matching a label selector; the set of real pod IPs behind it (the endpoints) updates automatically as pods come and go. A client connects to the ClusterIP, and the node's data plane rewrites the destination to one of the live pods. This is load balancing and a layer of NAT, applied to a target set that changes every few seconds.

The component doing that rewrite has historically been kube-proxy, which programs iptables (or IPVS) rules on every node so that traffic to a ClusterIP DNATs to a backend pod. The catch is that a ClusterIP is purely virtual — it answers no ARP, hosts no process, and lives only as a forwarding rule — so a packet capture for the ClusterIP shows nothing, because by the time the packet leaves the node its destination is already a real pod IP. Newer data planes (Cilium's eBPF, kube-proxy's IPVS mode) replace the linear iptables chains with hash lookups that scale to thousands of services.

# the ClusterIP is virtual; EXTERNAL-IP <none> means in-cluster only
kubectl get svc web
# NAME   TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)
# web    ClusterIP   10.96.0.42      <none>        80/TCP
# traffic to 10.96.0.42:80 is DNAT'd to a live endpoint pod by the node

Cluster DNS and Network Policy

A Service also gets a DNS name — web.default.svc.cluster.local — served by the in-cluster DNS (CoreDNS), so application code connects to web instead of memorizing a virtual IP. This is the chapter-5 DNS pattern scoped to the cluster: names resolve to ClusterIPs, ClusterIPs forward to pods, and the whole churn underneath stays invisible to the caller. A pod's /etc/resolv.conf is wired to CoreDNS automatically by the kubelet.

By default that flat pod network is fully open — any pod can connect to any other pod and any Service, across namespaces, with nothing in the way. A NetworkPolicy is the segmentation you add back: a namespaced object selecting pods by label and declaring which ingress and egress is allowed, which the CNI enforces (only CNIs that implement policy, like Calico and Cilium, honor it — Flannel alone ignores it). The crucial default-deny semantics are subtle: a pod is unrestricted until some policy selects it, after which only explicitly-allowed traffic flows — so the foundational policy is an empty-allow default-deny that forces every connection to be justified.

# default-deny ingress for a namespace: selects all pods, allows nothing
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
spec:
  podSelector: {}        # {} = every pod in the namespace
  policyTypes: [Ingress]   # no ingress rules listed = deny all in

Overlay vs Routed CNI

Overlay CNI — encapsulates pod traffic in VXLAN or Geneve, so it runs on any underlay that carries UDP between nodes without the fabric knowing pod CIDRs exist. Choose it as the safe default, on networks you don't control, and accept the ~50-byte MTU tax and the loss of native visibility.

Routed CNI — programs real routes (Calico via BGP, or the cloud's own integration) so pod packets travel native with no encapsulation: faster, fully visible to captures, pod IPs first-class in the VPC. Choose it when you control the fabric and can let it learn routes to the pod CIDRs.

Common Mistakes

Choosing a pod CIDR that overlaps the VPC range or another cluster's pod range. Routing goes ambiguous and traffic blackholes; like VPC peering (chapter 3) the only fix is renumbering, which for a cluster means rebuilding it.
Running an overlay CNI with the inner MTU left at 1500. The VXLAN/Geneve overhead no longer fits and, with PMTUD broken (chapter 12's small-works/large-hangs failure), pod-to-pod bulk transfers stall while health checks pass.
Hardcoding or caching pod IPs. They are ephemeral — a rescheduled pod gets a new IP — so anything pointing at a pod IP instead of a Service breaks the next time the pod restarts.
Shipping a cluster with no NetworkPolicy. The pod network is flat and fully open by default, so one compromised pod can reach every database and Service in every namespace — lateral movement with nothing to stop it.
Expecting a NetworkPolicy to be enforced under a CNI that ignores it. Flannel does not implement policy; the object applies cleanly and silently does nothing, leaving you believing you are segmented when you are wide open.

Best Practices

Plan pod, Service, and node CIDRs to be mutually disjoint and disjoint from every VPC and on-prem range before creating the cluster, because renumbering a live cluster is a rebuild, not a setting change.
Always address workloads through Services and DNS names, never pod IPs, so the stable ClusterIP absorbs the pod churn and your config survives every reschedule.
Apply a default-deny NetworkPolicy per namespace and open only the specific flows each workload needs, turning the flat open network into one where every connection is justified — and confirm your CNI actually enforces policy.
Match the inner MTU to the overlay's encapsulation (typically 1450, or raise the underlay to jumbo frames), so pod-to-pod bulk transfers don't black-hole on full-size packets.
Use an eBPF or IPVS data plane (Cilium, kube-proxy IPVS) once a cluster runs thousands of Services, so service lookup is a hash rather than a linear iptables chain that adds latency per rule.

Comparable conceptsDocker bridge / overlay (single-host)Service mesh (the L7 layer)CNI plugins: Calico / Cilium

Knowledge Check

Why do Kubernetes Services exist instead of just connecting to pod IPs?

Pod IPs are ephemeral, so a Service gives a stable virtual IP that tracks the live pods behind it
A Service transparently encrypts the pod-to-pod traffic that would otherwise travel across the cluster in plaintext
Pods cannot be reached by IP at all, so a Service is the only way to address them
A Service assigns each pod a public IP so external clients can reach it

What is the tradeoff between an overlay CNI and a routed CNI?

Overlay runs on any underlay but taxes MTU and visibility; routed is faster but needs fabric cooperation
Routed mode encapsulates every packet in an outer header, while overlay mode is the one that sends pod traffic natively with no added headers at all
Only overlay CNIs can enforce NetworkPolicy; routed CNIs cannot segment traffic
Overlay raises the usable MTU, whereas routed mode forces fragmentation of large packets

A cluster ships with no NetworkPolicy. What is the security consequence?

The pod network is flat and open, so one compromised pod can reach everything across namespaces
Pods are isolated by default, so no pod can talk to any other pod until a policy explicitly allows that flow
Cluster DNS stops resolving, because NetworkPolicy is required for CoreDNS to work
Services stop load-balancing, since policy is what distributes traffic to endpoints

You got correct