Data and ML Platform
The second case is a data and machine-learning platform: batch processing, GPU training jobs, large datasets, and pipeline orchestration. It pushes Kubernetes in directions the web-tier pattern never does — expensive specialized hardware, heavy state, and workflows rather than long-running services.
It is also a case where Kubernetes both fits well and strains, and being honest about the seams matters more than forcing everything onto the platform.
The Workload Shape
This platform is mostly Jobs, not services: batch ETL, model training runs, and scheduled pipelines, plus some long-running inference services that do look like the web pattern. Much of it is GPU work — costly, scarce, and to be packed tightly. Datasets are large, so data locality and throughput matter. The mix of run-to-completion batch and steady inference means two patterns coexist in one cluster.
GPU Scheduling
GPUs are exposed to Kubernetes through device plugins that advertise them as schedulable resources; Pods request a GPU like CPU or memory. The hard part is utilization: GPUs are expensive and easily left idle, so techniques like time-slicing or MIG (partitioning a GPU) and careful bin-packing matter. GPU nodes are tainted so only GPU workloads land on them, and spot/preemptible GPU capacity cuts cost for restartable training.
Pipelines and Storage
Orchestration uses workflow engines built on Kubernetes — Argo Workflows for DAGs of Jobs, or ML-specific platforms like Kubeflow — rather than hand-wired CronJobs, because real pipelines have dependencies, retries, and fan-out. Storage is the other strain: large datasets need high-throughput volumes (CSI) and, critically, attention to data locality — moving terabytes to compute is slow and expensive, so compute often goes to the data, not the reverse.
| Component | Choice |
|---|---|
| Batch / training | Jobs, GPU via device plugins, spot capacity |
| Inference | Deployments + HPA (the web pattern) |
| Pipelines | Argo Workflows / Kubeflow, not raw CronJobs |
| Data | High-throughput CSI volumes; mind data locality |
Where Kubernetes Strains
Honesty about the seams: Kubernetes is an excellent scheduler and packager for these workloads, but it is not a data warehouse, a feature store, or a distributed training framework. The platform shines at placing GPU jobs, scaling inference, and running pipelines; it leans on managed data services and specialized frameworks for the data and ML internals. Alternatives considered: managed data/ML platforms (EMR, Dataproc, SageMaker, Vertex AI) do much of this with less operational burden — the team chose Kubernetes for unification across teams and portability, accepting more operational work for it. A smaller shop should weigh the managed route seriously.
Kubernetes — one platform for batch, GPU, inference, and pipelines; portable and unified — more to operate.
Managed (EMR/Dataproc, SageMaker/Vertex) — less operational burden and ML-tuned — at the cost of portability and unification.
- Leaving expensive GPUs idle through poor bin-packing or no time-slicing/MIG.
- Treating Kubernetes as a data warehouse instead of a scheduler that leans on data services.
- Ignoring data locality and shuttling terabytes to compute instead of compute to data.
- Hand-wiring pipelines with CronJobs instead of a workflow engine with dependencies and retries.
- Running fault-tolerant training on full-price on-demand GPUs instead of spot.
- Schedule GPUs via device plugins, taint GPU nodes, and pack them tightly (time-slicing/MIG).
- Use spot/preemptible capacity for restartable training to cut GPU cost.
- Orchestrate pipelines with Argo Workflows or Kubeflow, not raw CronJobs.
- Design for data locality; bring compute to the data for large datasets.
- Weigh managed data/ML platforms honestly — they may beat self-running for smaller teams.
Knowledge Check
How are GPUs made schedulable in Kubernetes?
- Device plugins advertise them as resources Pods request like CPU/memory
- They are mounted as PersistentVolumes backed by a CSI driver claim
- The scheduler detects each node's GPUs automatically with no plugin installed
- A NetworkPolicy grants Pods access to the GPU hardware
What is the honest limit of Kubernetes for a data/ML platform?
- A great scheduler/packager but not a warehouse or training framework — it leans on specialized systems
- It cannot run GPU workloads at all, even with a device plugin advertising the accelerator cards to the scheduler
- It cannot run batch Jobs to completion for the long-running distributed training runs this platform needs
- It replaces the need for any external data store, warehouse, or object storage by holding all of that state itself
Why bring compute to the data for large datasets?
- Data gravity — moving terabytes to compute is slow and expensive, so locality matters
- Compute simply cannot read storage that lives on a remote node
- Kubernetes forbids mounting remote or networked volumes in a Pod
- It improves GPU scheduling and accelerator placement on tainted nodes specifically
You got correct