Search

· Rajat Pandit · AI Infrastructure  · 6 min read

Continuous Batching in vLLM: Killing the Hardware Idle

If your GPUs are idling at 40% utilization during inference, you are burning capital on memory bottlenecks, not computation.

A 1980s cinematic sci-fi depiction of metallic server banks glowing blue with text CONTINUOUS BATCHING IN vLLM

If your high-end inference GPUs are idling at 40% utilization, you are bleeding capital. You aren’t waiting on the processor to do math. You are waiting on memory.

Let’s talk about the physical realities of self-hosting Large Language Models. When teams transition from consuming managed APIs (like Vertex AI or OpenAI) to running open-weights models on their own infrastructure, the immediate shock isn’t the difficulty of deploying the containers. It’s the unit economics.

You boot up an 8x H100 node on Google Cloud (a a3-highgpu-8g instance). You deploy a 70B parameter model. You send it a few concurrent requests. Then you open up nvidia-smi and look at the volatility.

The utilization metrics are atrocious. Your volatile GPU-Util sits somewhere between 30% and 50%. You are essentially renting a Ferrari to sit in traffic. To understand why this happens, and how to fix it, we have to look closely at how LLM inference is fundamentally broken under traditional batching paradigms, and why vLLM and Continuous Batching are absolute mandates for production workloads.

The Illusion of Static Batching

In classical deep learning (think ResNet image classification), batching was beautifully simple. You wait for 32 images to arrive in a queue. You pack them into a single tensor. You send that tensor through the GPU matrix multiplication cores simultaneously. Every image takes exactly the same amount of computation. Every prompt exits the GPU at exactly the same time. The GPU utilization stays pinned near 100%.

Large Language Models do not work this way. LLMs generate text auto-regressively, one token at a time. This introduces a fatal flaw for static batching: varying generation lengths.

Imagine you batch four requests together.

  • Request A asks for a simple “Yes/No” classification (generates 2 tokens).
  • Request B asks for a short summary (generates 50 tokens).
  • Request C asks for a thorough code review (generates 500 tokens).
  • Request D asks to write a full blog post (generates 1500 tokens).

Under a static batching paradigm, the GPU groups these four requests. It computes the first token for all four. Then the second token.

After token 2, Request A is finished. But it cannot leave the batch. The GPU memory allocated to Request A is trapped. It must wait until Request D finishes generating its 1500th token before the entire batch is released and a new set of requests can be ingested.

For 1498 iteration steps, the GPU is only doing meaningful work for Request D, but it’s holding memory for A, B, and C. This is catastrophic for throughput. It means your massively parallel hardware is effectively operating linearly.

The Memory Bottleneck: The KV Cache

To make matters worse, LLM inference is wildly memory-bound due to the Key-Value (KV) cache.

Every time the model generates a new token, it doesn’t just look at that single token; it needs the entire context of everything generated so far. Recomputing the attention scores from scratch for every step would take hours. Instead, we cache the intermediate “Key” and “Value” tensors from previous steps in the GPU’s High Bandwidth Memory (HBM).

Because we don’t know in advance how many tokens a user will generate (will they generate 10 tokens or 1000?), traditional ML serving systems like Triton had to pre-allocate the maximum possible memory for the KV cache of every single request.

If your max context length is 8k tokens, the system pre-allocates contiguous memory for 8k tokens, even if the user only generates 5. This leads to massive internal fragmentation. In practice, researchers found that up to 60-80% of GPU memory was simply wasted—pre-allocated but entirely unused.

You run out of memory long before you run out of compute power.

Enter vLLM and PagedAttention

This is exactly the problem that vLLM was built to solve. Developed initially at UC Berkeley, vLLM introduces an architecture that treats GPU memory the exact same way a traditional operating system treats system RAM.

They call it PagedAttention.

Instead of demanding large, contiguous blocks of pre-allocated memory for the KV cache, PagedAttention breaks the cache up into smaller, fixed-size blocks (pages). Each block might hold the keys and values for just 16 tokens.

These blocks do not need to be contiguous in physical memory. The vLLM engine maintains a block table—a virtual memory map. As a request generates more tokens, vLLM dynamically allocates new blocks on the fly.

This completely eliminates memory waste. You only consume the exact amount of HBM required for the tokens you actually generated. Suddenly, the memory bottleneck vanishes. Because you aren’t wasting 70% of your VRAM on pre-allocation, you can pack dramatically more concurrent requests onto the exact same piece of silicon.

The Final Piece: Continuous Batching (Iteration-Level Scheduling)

PagedAttention solves the memory fragmentation, but what about the static batching problem where Request A waits for Request D?

Because vLLM manages memory dynamically, it enables Continuous Batching.

Continuous Batching operates at the iteration level instead of the request level. The engine evaluates the pipeline after every single token is generated. When Request A hits its final token, the system instantly snips it out of the batch. It frees up Request A’s memory blocks and immediately injects a new pending request from the queue into the active batch.

The batch is fluid. Requests are constantly entering and exiting the stream asynchronously.

The GPU never halts for stragglers. The matrix multiplication cores stay saturated. It’s a beautifully violent process of constant context switching, but done so efficiently that the overhead is negligible compared to the massive gains in throughput.

The Real-World Impact on TCO

I have watched engineering teams tear their hair out arguing over quantization levels (FP8 vs INT4) trying to squeeze an extra 10% performance out of their models. They are focusing on the wrong lever.

Quantization affects the math. But inference is memory-bound.

By migrating from standard HuggingFace pipelines to a dedicated highly-optimized serving engine utilizing continuous batching (like vLLM or TensorRT-LLM), the difference isn’t 10%. It’s often a 3x to 5x increase in aggregate throughput on the exact same hardware.

You take standard GPU usage from that volatile 40% and pin it aggressively near 85%+.

When you sit down to calculate the Total Cost of Ownership (TCO) for a self-hosted enterprise deployment, throughput is the denominator. If you can serve 300 requests per second instead of 80 requests per second on your GCP a3-highgpu-8g instances, you just drastically slashed the per-token cost of your operation.

The software layer optimizing the silicon is just as critical as the silicon itself. If you are deploying LLMs in production without continuous batching, you are essentially buying a supercomputer and using it as a pocket calculator.

Back to Blog

Related Posts

View All Posts »