· Rajat Pandit · AI Infrastructure · 6 min read
KV Cache Offloading in K8s: The Stateless Truce
Your beloved stateless Kubernetes architecture is fundamentally at war with the massive, stateful memory requirements of long-context LLM inference. We need a truce.

There is a quiet, brutal war happening in modern cloud infrastructure.
In one corner, we have Kubernetes. The absolute champion of the stateless era. K8s demands that applications be ephemeral. It expects to mercilessly slaughter pods, spin up replicas, and route traffic dynamically across instances without skipping a beat. It operates on the core assumption that your application holds no state. If a user hits Node A on their first request, they should be perfectly happy hitting Node B on their second.
In the other corner, we have Large Language Models interacting with massive, million-token context windows. These are the heaviest, most intensely stateful workloads we have ever put into a modern datacenter.
These two paradigms hate each other. And if you attempt to shove a long-context LLM workload into a standard K8s autoscaling group without fundamentally altering how you handle memory, the system will utterly rip itself apart.
Let’s talk about the KV Cache, and why navigating memory bottlenecks across stateless nodes is the hardest architectural challenge in AI infrastructure today.
The Physics of the KV Cache
We have to understand what the KV Cache actually is. When an LLM generates text, it auto-regressively predicts the next word based on every single word that came before it.
If I upload a 150-page PDF to Gemini 2.5 Pro and ask it a question on page 3, the model reads the entire document. During that process, it calculates intermediate mathematical representations—Keys and Values—for every single token in that PDF.
If I follow up ten seconds later and ask a second question about the same PDF, the model shouldn’t have to re-read the entire document from scratch. The initial processing was incredibly expensive. Instead, we want to skip the “reading” phase and jump straight to generating the answer.
We do this by storing those intermediate mathematical representations. This is the KV Cache. It is the model’s short-term memory.
But this memory is alarmingly heavy. As context windows scale from 32k to 1M to 2M tokens, the KV cache grows linearly. A massive context window doesn’t just ask the GPU to “think harder”; it demands a staggering amount of physical High Bandwidth Memory (HBM). Storing the KV cache for a single user interacting with a massive document can quickly consume dozens of gigabytes of VRAM.
The Routing Nightmare
Now, drop this reality into a Kubernetes cluster.
User 1 uploads their massive PDF. The ingress controller routes the request to Pod A on an Nvidia L4 GPU node on Google Cloud. Pod A spends physical wall-clock time computing the KV cache, absorbing 30GB of VRAM in the process, and generates the response.
A minute later, User 1 sends a follow-up question.
Because K8s is inherently stateless and traffic routing is probabilistic (perhaps relying on a standard round-robin ingress), User 1’s second request gets routed to Pod B.
Pod B doesn’t have the KV cache. Pod B has no idea what the PDF was. To answer the follow-up, Pod B must re-read the entire massive document from scratch. You just burned expensive GPU cycles. You crippled the user’s tail latency. And you defeated the entire purpose of the cache.
The obvious, naive solution is to use “Sticky Sessions.” You configure your ingress controller to ensure User 1 always routes to Pod A.
But sticky sessions break Kubernetes scaling. What happens when Pod A hits 100% memory utilization? It can’t accept new requests. But User 1 is hard-pinned to it. What happens when Pod A suffers network jitter and gets preempted? The cache vanishes. Sticky routing forces your elastic fleet into a brittle, fractured mess.
The Architecture of Offloading
We have to find a truce. We need the system to scale fluidly like stateless microservices, but we must protect the heavy state of the KV cache.
The mechanism to achieve this is KV Cache Offloading.
Instead of isolating the KV cache inside the VRAM of a single specific GPU out on the edge of the cluster, we treat the cache as a portable, serializable asset.
Here’s the architecture:
- The Generation Phase: User 1’s initial request hits Pod A. Pod A computes the KV cache and generates the answer.
- The Offload Phase: As soon as the generation is complete, Pod A does not hold the cache hostage in its precious HBM. It immediately serializes those KV tensors and streams them off the GPU. Using PCIe 5.0, it pushes the cache down to the local host CPU memory (RAM) or writes it to a wildly fast, localized NVMe storage tier. If the architecture is sophisticated enough, it pushes it to a centralized, high-throughput dedicated Redis cluster or robust memory store living within the same rack.
- The Resumption Phase: User 1 sends a follow-up. Based on load, the Ingress routes this seamlessly to Pod F entirely on the other side of the instance group.
- The Prefetch Phase: Before Pod F begins generating a response, its serving engine intercepts the request, notes the session ID, reaches into the centralized fast-storage tier, and yanks the serialized KV cache across the networking fabric straight into its own GPU VRAM.
Pod F instantly knows exactly what Pod A knew.
The Latency Math
Moving gigabytes of raw tensor data across a network sounds slow. Why would we introduce this massive I/O bottleneck?
Because network bandwidth is faster than matrix multiplication.
If Pod F has to recompute a 100k token context window, it might take 4 seconds of pure compute time. But if Pod F pulls a 4GB serialized cache across a high-speed Google Cloud VPC (which easily supports 50-100 Gbps), the transfer might take 500 milliseconds.
You trade network I/O for compute cycles. In modern AI infrastructure, compute cycles are the most expensive resource on the planet. I/O is cheap.
You implement robust tiered caching. Hot memory stays in VRAM. Warm memory drops to Host RAM via PCIe. Cold memory drops to local ultra-fast NVMe.
The Bottom Line
You cannot treat Large Language Models like standard web APIs. They are heavy, volatile engines with insatiable memory demands.
If you are orchestrating LLMs in a cloud-native K8s environment, your orchestration logic must become memory-aware. You must decouple the state (the cache) from the compute (the GPU). Until you implement robust KV cache offloading, every scaling event in your cluster is just a latency bomb waiting to detonate.



