Topic 02

Containers and Runtimes

RuntimeContainers

Kubernetes runs containers, so it is worth being precise about what a container actually is. It is not a small virtual machine. It is an ordinary process on the host, isolated by features the Linux kernel already provides, started from an image that bundles the process and everything it needs to run.

Between Kubernetes and that running process is a chain of components — the kubelet, a standard interface called the CRI, and a runtime such as containerd and runc. Knowing the chain explains why "Docker was removed from Kubernetes" did not break anyone's images, and why you can swap runtimes without touching your workloads.

Images and Layers

An image is a read-only template: a stack of filesystem layers plus metadata that says which process to start, what environment to set, and which ports it expects. You build it once and run it anywhere a compatible runtime exists — the laptop, the test cluster, and production all see the same bytes. That reproducibility is what ended the era of "it works on my machine."

Layers are cached and shared. A base layer like a minimal Linux userland is downloaded once and reused across many images, so only your application layer moves on a rebuild. Images live in a registry — a content-addressable store — and are pulled to a node by digest or tag. The format is standardized by the Open Container Initiative (OCI), which is why images are portable across runtimes.

What the Kernel Actually Does

A container is isolation assembled from two kernel features. Namespaces give a process its own view of the world — its own process tree, network interfaces, mounts, and hostname — so it cannot see the rest of the host. cgroups limit and account for what it can consume: CPU, memory, I/O. Put a process in its own namespaces and a cgroup and you have a container. There is no hypervisor and no guest kernel involved.

This is the key difference from a virtual machine. A VM virtualizes hardware and runs a full guest operating system; a container shares the host kernel and isolates at the process level. Containers are far lighter and start in milliseconds, but the shared kernel means the isolation boundary is weaker — a kernel vulnerability is a shared risk. When you need a VM-grade boundary for untrusted code, you reach for a sandboxed runtime, covered below.

The Container Runtime Interface

The kubelet — the agent on every node — does not run containers itself. It speaks the Container Runtime Interface (CRI), a gRPC API, to whatever runtime is installed. This indirection is deliberate: Kubernetes defined a stable contract so any conformant runtime can plug in without changes to Kubernetes core.

It helps to separate two layers. A high-level (CRI) runtime like containerd or CRI-O manages images, pulls, and the container lifecycle, and implements the CRI. A low-level (OCI) runtime like runc is the small tool that actually creates the namespaces and cgroups and starts the process. The high-level runtime calls the low-level one. You configure the high-level runtime on the node; runc is usually just there underneath.

From kubelet to a running process

kubelet

→ CRI

containerd / CRI-O

→ OCI

runc

→

your process (namespaces + cgroups)

Choosing a Runtime

containerd and CRI-O are the two common CRI runtimes. containerd is a general-purpose runtime (it also powers Docker) and is the default on most managed clusters. CRI-O is purpose-built for Kubernetes and ships in OpenShift. For ordinary workloads the choice rarely matters; both run OCI images through runc.

The choice that does matter is the isolation boundary. runc shares the host kernel. When you run untrusted or multi-tenant code and want a stronger boundary, a sandboxed runtime swaps in at the OCI layer: gVisor intercepts syscalls in a userspace kernel, and Kata Containers wraps each container in a lightweight VM. Both trade some performance and density for isolation.

Runtime	Isolation	Cost	Use when
runc	Shared host kernel	Lowest overhead	Trusted workloads (the default)
gVisor	Userspace syscall filter	Some CPU/syscall overhead	Untrusted code, stronger boundary
Kata Containers	Lightweight VM per container	VM-level overhead	Hard multi-tenant isolation

The Dockershim Removal

Docker Engine predates the CRI and does not speak it. Early Kubernetes bridged the gap with an adapter called the dockershim. Maintaining a special case for one runtime was a burden, and the shim was removed in Kubernetes 1.24. Headlines read "Kubernetes drops Docker," which scared people unnecessarily.

Nothing happened to your images. Docker builds OCI images, and OCI images run fine on containerd and CRI-O — the same images you already had. What changed is the node-level component the kubelet talks to. Cluster operators moved from dockershim to containerd; application teams did nothing. The lesson worth keeping: the runtime under the kubelet is an operational detail, decoupled from the images you ship.

containerd vs CRI-O vs Docker Engine

containerd — general-purpose CRI runtime, default on most managed clusters; also the engine inside Docker.

CRI-O — a lean CRI runtime built specifically for Kubernetes; the default in OpenShift.

Docker Engine — a developer toolchain built on containerd. Great for building images locally; not what the kubelet talks to since the dockershim was removed.

Common Mistakes

Believing "Docker was removed" means your Docker-built images stopped working — they are OCI images and run unchanged.
Running containers as root because the image does, instead of setting a non-root user and dropping capabilities.
Shipping fat images with build tools and a full OS, inflating pull time and attack surface — prefer minimal or distroless bases.
Treating a container as a security boundary equivalent to a VM; with runc the host kernel is shared.
Confusing the image (the template) with the container (a running instance of it) when reasoning about state.

Best Practices

Build minimal images with multi-stage builds; ship only the runtime artifact, not the toolchain.
Run as a non-root user, set a read-only root filesystem where possible, and drop unneeded Linux capabilities.
Pin images by digest, not a moving tag like latest, so a node always pulls the exact bytes you tested.
Match the runtime to the trust level — runc for your own code, gVisor or Kata for untrusted or hard multi-tenant workloads.
Keep the node's CRI runtime (containerd/CRI-O) an operations concern, decoupled from how application teams build images.

RelatedDocker — builds the OCI images and provides the local dev runtimeOCI — the image and runtime standards that make images portableKata / Firecracker — VM-grade isolation when a shared kernel is not enough

Knowledge Check

What does the Container Runtime Interface (CRI) provide?

A stable API the kubelet uses to drive any conformant runtime, so runtimes can be swapped without changing Kubernetes
A faster on-disk image format that replaces OCI and compresses layers more aggressively
A built-in image registry that every node in the cluster pulls images from directly
A sandbox layer that isolates every container from the host kernel by running its own intercepting userspace kernel

Why did removing the dockershim in Kubernetes 1.24 not break existing images?

Docker builds OCI images, which run unchanged on containerd and CRI-O — only the kubelet's runtime changed
Kubernetes automatically rebuilt every running image into a new non-OCI format during the node upgrade to 1.24
Docker-built images were never actually used by Kubernetes in production clusters
The dockershim adapter was quietly re-added in a later 1.24 patch release

You need to run untrusted third-party code with a stronger boundary than runc gives. What fits?

A sandboxed runtime like gVisor or Kata Containers, accepting some performance cost
A larger CPU and memory limit on the Pod to contain any misbehaving process
Switching the high-level runtime from containerd to CRI-O to get a tighter host kernel boundary
Running the container as root so it gets its own fully isolated kernel namespace

You got correct