· AI Infrastructure · 12 min read
The Kubernetes for AI Paradigm
Native K8s orchestration is evolving to handle GPU scheduling, checkpointing, and live migration at the scale that AI demands.

- Kubernetes was not built for GPU workloads: The default scheduler has no concept of GPU topology, NVLink bandwidth, or NUMA affinity, and naively placing GPUs from different racks into the same pod destroys throughput.
- Native K8s is catching up fast: GPU topology-aware scheduling, MIG (Multi-Instance GPU) management, checkpoint and resume support, and live migration are no longer vendor extensions. They are becoming first-class K8s primitives.
- The orchestration layer matters more than you think: The difference between a well-orchestrated AI cluster and a poorly one is not hardware. It is the cluster scheduler and the checkpointing layer, and they are both now built on top of Kubernetes itself.
I often see my teams helping other teams debug a problem that turned out to be a scheduling issue.
They had a cluster of 384 H200 GPUs, which was every expensive decision that anyone in the organization ever had to defend. They had a training job that could have been efficient. Instead, it wasted roughly 18 percent of its compute cycles sitting in an idle state, waiting for data that never arrived because the scheduler had placed its ranks into separate racks that did not have a fast interconnect between them.
The hardware was perfect. The model was correct. The data pipeline was optimized. The scheduler had done something dumb.
This is why Kubernetes, the platform everyone assumed was a solved problem, is suddenly the most important piece of infrastructure in the AI era.
Why the Default Scheduler Breaks AI
Kubernetes was designed for web services. A web service is a small container that serves HTTP requests. It needs CPU, a bit of memory, and maybe some disk I/O. You can place a web service on any node, and it will work. The scheduler’s default algorithm — place the next pending pod on the node that is least loaded — works perfectly for this.
GPU workloads are not web services. A single training job on a cluster of distributed GPUs is not a single unit of work. It is roughly sixty-four separate jobs that have to agree on everything happening at the same time. If two of those jobs are placed on GPUs that are not connected by NVLink, or on nodes in racks that are connected to different spine switches, the collective communication algorithms slow down dramatically.
The NCCL library that most deep learning frameworks use for distributed communication makes assumptions about the physical topology of the GPUs in your cluster. Those assumptions go out the window when the scheduler places your ranks across racks.
The result is a training run that finishes 20 percent slower than it should, and nobody knows why because the metrics all look green.
GPU Topology-Aware Scheduling
The fix is called GPU topology-aware scheduling, and it is the first thing that any serious AI cluster needs to implement on top of base Kubernetes.
The idea is straightforward. Before the scheduler places a pod, it queries the GPU topology of each node. It checks which GPUs are connected to the same NVLink domain. It checks whether the node’s network interface is on the same NUMA node as the GPU. It checks the PCIe switch that connects the GPUs to the CPU.
When scheduling a multi-GPU pod, the scheduler places all the GPUs into nodes that share a fast interconnect, rather than scattering them across the cluster.
This is not a new idea. The Kubernetes GPU device plugin has supported topology-aware placement for a while. The GPU-feature-discovery operator maps the hardware topology and exposes it as node labels. The scheduler reads those labels and makes placement decisions accordingly.
But here is the thing that surprises people. Just installing these tools does not fix the problem. You have to understand your cluster topology and configure the scheduling predicates correctly, and that is where most teams stumble.
A typical datacenter has layers of topology. Within a node, you have GPUs connected via NVLink (the fastest interconnect, roughly 900 GB/s bidirectional between any two GPUs in an 8-GPU node). Between nodes in the same rack, you have PCIe switching or direct rack-level Ethernet (maybe 100 to 400 GB/s). Between racks, you have spine switches with high-bandwidth but higher-latency interconnects (InfiniBand or RoCE).
The scheduler needs to know about all three layers. And it needs to make placement decisions that respect those layers.
The visual difference is critical for understanding distributed training performance. When the topology-aware scheduler places a 64-GPU training job, all the GPUs in each node stay connected via NVLink — the fastest possible interconnect. The default scheduler might place GPU 0 in Rack A and GPU 1 in Rack C, forcing NCCL’s Ring All-Reduce algorithm to traverse spine switches for every single communication step. That adds hundreds of microseconds of latency to every collective operation.
MIG: Slicing the GPU When You Cannot Buy More
Not every workload needs a full H200. Many inference workloads run comfortably on a fraction of a GPU’s compute and memory. Multiprocessing Inference GPUs (MIG) is NVIDIA’s technology for slicing a single GPU into smaller, isolated instances that can be scheduled independently.
A single H200 can be partitioned into seven 1-GPU MIG instances, each with its own dedicated memory, compute, and caching. This means you can fit roughly seven times as many inference workloads on the same physical GPU.
But MIG is a scheduling problem as much as it is a hardware problem. The Kubernetes scheduler has to know that a node offering MIG slices can fulfill a request for “0.14 of an H200” and not just the all-or-nothing single-GPU granularity that the device plugin originally exposed.
The MIG operator solves this by exposing each slice as a separate allocatable resource. A pod requesting one MIG slice gets scheduled onto a node that has a free slice. The resource request looks like nvidia.com/gpu: "1" but the scheduler understands that “one slice” is not the same as “one full GPU.”
This is critical for inference workloads. When you have a mix of small, medium, and large models running simultaneously on the same cluster, MIG lets you pack them efficiently. A small classification model gets 0.14 GPU. A medium one gets 0.5 GPU. A large one gets a full GPU. No waste.
Checkpoint and Resume: The GPU That Doesn’t Forget
Training a large model takes weeks. The chance of a GPU failing, a node losing power, or a network cable being unplugged during that window is not a question of if. It is a question of when.
The standard Kubernetes restart policy is to kill the pod and start it fresh. That is fine for a web service that does not have state. It is catastrophic for a model that has trained for twelve days.
Checkpoint and resume changes this. The training job periodically saves the full optimizer state, gradients, model weights, and the random number generator state to persistent storage. If the job gets killed, it loads the checkpoint and continues training from where it left off.
The problem is that Kubernetes, in its default configuration, has no concept of checkpointing. It does not know how to trigger a checkpoint. It does not know when to save it. It does not know how to restore it.
This is where the orchestration layer becomes part of your infrastructure stack. You need a Kubernetes operator that watches your training pods, periodically snapshots the state to distributed storage, and when the pod is recreated, loads the last good checkpoint before resuming.
The Kubernetes-native checkpoint solution typically involves a sidecar container that runs alongside the training process. The sidecar intercepts the checkpoint signals, coordinates the save operation with the running process (so the model training can pause for a few seconds while the state is serialized and written), and then signals the training process to resume.
When the pod is rescheduled — because the node failed or because it was evicted for resourcing — the sidecar intercepts the startup, finds the most recent checkpoint on the persistent volume, and loads it before the training process begins the forward pass.
Live Migration of GPU Workloads
This is the edge case, but it is becoming mainstream.
Kubernetes live migration lets you move a running pod from one node to another with minimal interruption. For CPU workloads, this has been standard for years. The container’s memory state gets copied over the network to the target node, the process continues running on the new node, and the DNS or service mesh routing switches transparently.
For GPU workloads, live migration has been impossible because a GPU is not a piece of RAM that can be serialized. The GPU state includes the contents of VRAM, the status of the streaming multiprocessors, the configuration of the interconnect engines, and dozens of other hardware registers.
Well, it is not impossible anymore.
The new generation of GPU live migration uses a technique called GPU memory pre-copy. The GPU workload is paused briefly. The entire VRAM contents are copied to a buffer (this is fast because it is intra-node memory-to-memory). The buffer is then streamed over the network to the target node. The target node loads the buffer into its GPU VRAM. The workload resumes.
The pause duration is on the order of milliseconds to a few seconds, depending on the GPU memory size. An H200 with 141 GB of VRAM takes roughly two to three seconds to pre-copy, assuming a 100 GB/s interconnect.
Live migration is not yet part of standard Kubernetes. It requires a GPU driver and a migration operator that both the source and target nodes support. But it is being built into the stack. AMD, NVIDIA, and Intel are all shipping GPU migration support in their latest drivers, and multiple Kubernetes operators now provide the orchestration layer.
Why does this matter? Because it means you can drain a node for maintenance without stopping a training or inference job. You can rebalance the cluster without killing active workloads. You can perform rolling upgrades on GPU nodes while the models keep running.
The Operator Ecosystem
This is where the Kubernetes story for AI gets interesting.
You cannot build and maintain this orchestration layer yourself. The problem space is too large, the failure modes too numerous, and the hardware too fragmented. So the community has built operators that handle the complexity.
Kueue is the Kubernetes-native job queue manager. It controls admission of workloads based on resource availability, cluster policies, and priority. Instead of every training job competing for GPUs directly, they submit jobs to Kueue, which holds them in a queue until the necessary GPU resources are available. This prevents resource contention and ensures that priority workloads get what they need.
Volcano is a batch scheduling operator that understands the concept of a job as a collection of pods that must be scheduled together. For multi-node, multi-GPU training jobs, this is essential. You cannot schedule pod A on node 1 and pod B on node 2 and hope NCCL will figure it out. Volcano schedules all pods of a distributed job atomically, either all at once or none at all.
Katacoda, Kubeflow, and Ray on Kubernetes provide higher-level abstractions. Kubeflow wraps the ML workflow lifecycle (data prep, training, hyperparameter tuning, deployment). Ray on K8s provides its own distributed execution engine that speaks Kubernetes for resource management. All of these operators are built on top of K8s and extend its primitives for AI workloads.
What the Operator Story Means
The Kubernetes for AI paradigm means one thing: Kubernetes is no longer just a container runtime orchestrator. It has become the operating system for AI infrastructure. And the operators are the system services, the way the kernel is the operating system for bare metal.
When you deploy AI workloads on Kubernetes in 2026, you are not just deploying containers. You are deploying a platform that schedules GPU topology-aware jobs, manages MIG slices, checkpoints training state, migrates workloads live, queues batches by priority, and integrates with the broader ML lifecycle.
This is powerful. It means you can manage your inference and training workloads with the same tooling you use for microservices. One platform. One scheduling model. One observability stack.
But it also means you need to understand how all these pieces fit together. A GPU topology misconfiguration will destroy your distributed training performance faster than any hardware limitation. A checkpointing gap will turn a transient node failure into a week of wasted compute. A poor MIG configuration will leave you paying for full GPUs to run workloads that fit in slices.
None of this is solved by buying better hardware. It is solved by getting the orchestration layer right.



