Topic 63

Data and ML Platform

DataML

The second case is a data and machine-learning platform: batch processing, GPU training jobs, large datasets, and pipeline orchestration. It pushes Kubernetes in directions the web-tier pattern never does — expensive specialized hardware, heavy state, and workflows rather than long-running services.

It is also a case where Kubernetes both fits well and strains, and being honest about the seams matters more than forcing everything onto the platform.

The Workload Shape

This platform is mostly Jobs, not services: batch ETL, model training runs, and scheduled pipelines, plus some long-running inference services that do look like the web pattern. Much of it is GPU work — costly, scarce, and to be packed tightly. Datasets are large, so data locality and throughput matter. The mix of run-to-completion batch and steady inference means two patterns coexist in one cluster.

GPU Scheduling

GPUs are exposed to Kubernetes through device plugins that advertise them as schedulable resources; Pods request a GPU like CPU or memory. The hard part is utilization: GPUs are expensive and easily left idle, so techniques like time-slicing or MIG (partitioning a GPU) and careful bin-packing matter. GPU nodes are tainted so only GPU workloads land on them, and spot/preemptible GPU capacity cuts cost for restartable training.

Pipelines and Storage

Orchestration uses workflow engines built on Kubernetes — Argo Workflows for DAGs of Jobs, or ML-specific platforms like Kubeflow — rather than hand-wired CronJobs, because real pipelines have dependencies, retries, and fan-out. Storage is the other strain: large datasets need high-throughput volumes (CSI) and, critically, attention to data locality — moving terabytes to compute is slow and expensive, so compute often goes to the data, not the reverse.

Component	Choice
Batch / training	Jobs, GPU via device plugins, spot capacity
Inference	Deployments + HPA (the web pattern)
Pipelines	Argo Workflows / Kubeflow, not raw CronJobs
Data	High-throughput CSI volumes; mind data locality

Where Kubernetes Strains

Honesty about the seams: Kubernetes is an excellent scheduler and packager for these workloads, but it is not a data warehouse, a feature store, or a distributed training framework. The platform shines at placing GPU jobs, scaling inference, and running pipelines; it leans on managed data services and specialized frameworks for the data and ML internals. Alternatives considered: managed data/ML platforms (EMR, Dataproc, SageMaker, Vertex AI) do much of this with less operational burden — the team chose Kubernetes for unification across teams and portability, accepting more operational work for it. A smaller shop should weigh the managed route seriously.

Kubernetes for data/ML vs managed platforms

Kubernetes — one platform for batch, GPU, inference, and pipelines; portable and unified — more to operate.

Managed (EMR/Dataproc, SageMaker/Vertex) — less operational burden and ML-tuned — at the cost of portability and unification.

Common Mistakes

Leaving expensive GPUs idle through poor bin-packing or no time-slicing/MIG.
Treating Kubernetes as a data warehouse instead of a scheduler that leans on data services.
Ignoring data locality and shuttling terabytes to compute instead of compute to data.
Hand-wiring pipelines with CronJobs instead of a workflow engine with dependencies and retries.
Running fault-tolerant training on full-price on-demand GPUs instead of spot.

Best Practices

Schedule GPUs via device plugins, taint GPU nodes, and pack them tightly (time-slicing/MIG).
Use spot/preemptible capacity for restartable training to cut GPU cost.
Orchestrate pipelines with Argo Workflows or Kubeflow, not raw CronJobs.
Design for data locality; bring compute to the data for large datasets.
Weigh managed data/ML platforms honestly — they may beat self-running for smaller teams.

AlternativesManaged data platforms — EMR / Dataproc as the without-Kubernetes batch optionManaged ML platforms — SageMaker / Vertex AI for training/servingArgo Workflows / Kubeflow — the pipeline layer on Kubernetes

Knowledge Check

How are GPUs made schedulable in Kubernetes?

Device plugins advertise them as resources Pods request like CPU/memory
They are mounted as PersistentVolumes backed by a CSI driver claim
The scheduler detects each node's GPUs automatically with no plugin installed
A NetworkPolicy grants Pods access to the GPU hardware

What is the honest limit of Kubernetes for a data/ML platform?

A great scheduler/packager but not a warehouse or training framework — it leans on specialized systems
It cannot run GPU workloads at all, even with a device plugin advertising the accelerator cards to the scheduler
It cannot run batch Jobs to completion for the long-running distributed training runs this platform needs
It replaces the need for any external data store, warehouse, or object storage by holding all of that state itself

Why bring compute to the data for large datasets?

Data gravity — moving terabytes to compute is slow and expensive, so locality matters
Compute simply cannot read storage that lives on a remote node
Kubernetes forbids mounting remote or networked volumes in a Pod
It improves GPU scheduling and accelerator placement on tainted nodes specifically

You got correct