· AI Infrastructure · 6 min read
Serverless Inference: Conquering the 5-Second Cold Start
The infrastructure hacks required to make scale-to-zero LLM inference viable for production latency.

- Traditional serverless architectures fail for LLMs because pulling a 30GB model weight file from cloud storage during a cold start takes minutes, not milliseconds.
- To achieve sub-second cold starts, you must move away from containerized downloads and utilize memory mapping (mmap) and weight streaming.
- Pre-warmed snapshotting allows you to freeze a GPU state in memory and clone it across workers, bypassing the initialization overhead of frameworks like PyTorch.
- True scale-to-zero economics are finally becoming possible for massive models, fundamentally changing the unit economics of AI startups.
The promise of serverless infrastructure is simple: you only pay for what you use. When a request comes in, a container spins up, processes the request, and spins down. If there is no traffic, your bill is zero. This economic model powered the explosion of Web 2.0 startups because it eliminated the need to provision and pay for idle servers.
But when the generative AI boom hit, we slammed into a massive wall. You cannot simply drop a Large Language Model into an AWS Lambda function.
A modern LLM, even a relatively small 8-billion parameter model, requires tens of gigabytes of weights to be loaded into VRAM before it can generate a single token. If you scale a GPU node to zero, the next time a user sends a prompt, the node has to boot up, download the massive weight file from object storage, load it into system RAM, and then transfer it over the PCIe bus to the GPU.
This process, known as a cold start, can easily take 30 to 60 seconds. In the world of consumer applications, a 30-second wait time is an eternity. Users will simply close the tab. As a result, AI companies have been forced to leave their expensive GPU instances running 24/7, burning cash on idle compute just to guarantee low latency.
Conquering the 5-second cold start is the holy grail of AI infrastructure. Lets look at the art of weight streaming and snapshotting that make serverless LLM inference actually work in production.
The Physics of the Cold Start
To fix the cold start, we have to identify exactly where the time is being spent. When a cold request hits an idle GPU node, the timeline looks roughly like this:
- Provisioning: The cloud provider allocates the VM and GPU (1-3 seconds).
- Container Pull: Downloading the Docker image containing your inference server code (2-5 seconds).
- Weight Download: Pulling the 20GB+ Safetensors or PyTorch
.ptfile from S3 or GCS (15-40 seconds, highly dependent on network bandwidth). - Model Initialization: PyTorch allocating memory and moving the weights from system RAM to VRAM (5-10 seconds).
The absolute biggest bottleneck is Step 3: dragging massive files across the network. If we want serverless inference, we have to eliminate the network download.
Technique 1: Memory Mapping (mmap) and Local Caching
The first breakthrough in reducing cold starts is realizing that you do not need to download the entire model to start executing it, and you certainly don’t need to download it over the network every time if you can cache it on the host node.
Modern serverless GPU providers (like Modal, Baseten, or RunPod) use massive, shared NVMe drives attached directly to the host machines. When your container spins up, the model weights are already sitting on the host’s local SSD.
But reading a 20GB file from an SSD into RAM still takes time. This is where mmap (memory mapping) comes in. Formats like Safetensors are designed to be memory-mapped. Instead of actively reading the file into RAM, the OS maps the file directly into the application’s virtual address space. When PyTorch accesses a tensor, the OS handles fetching the specific page from the SSD in the background. This bypasses the massive initialization pause and starts streaming data to the GPU almost instantly.
Technique 2: Chunked Weight Streaming
Even with local NVMe caching, transferring 20GB over the PCIe bus to the GPU takes a few seconds. If you wait for the entire model to load before processing the prompt, you are still facing a noticeable cold start.
The solution is chunked weight streaming. Because LLMs execute sequentially layer by layer, you do not need Layer 32 in VRAM to compute Layer 1.
Advanced inference engines stream the weights directly into the GPU just in time for execution. While the GPU is calculating the attention scores for Layer 1, the PCIe bus is simultaneously transferring the weights for Layer 2. By overlapping the I/O transfer with the compute, you can start processing the prefill phase of the prompt mere milliseconds after the container boots, completely masking the transfer latency.
Explainer Diagram: A timeline comparison. The top track shows a traditional sequential cold start (Download -> Load to RAM -> Transfer to VRAM -> Execute). The bottom track shows the serverless approach using mmap and weight streaming, where execution begins almost immediately as layers are streamed to the GPU just-in-time.
Technique 3: Pre-Warmed Snapshotting
The final frontier of cold-start optimization is addressing Step 4: Model Initialization. PyTorch is a fantastic framework, but instantiating a massive model object and running the initial graph compilations takes significant CPU time.
To bypass this, providers are utilizing snapshotting technologies like CRIU (Checkpoint/Restore In Userspace).
Instead of booting the container and running the Python initialization script from scratch, the provider boots the model once, loads it into memory, and then freezes the entire process state (memory, file descriptors, GPU context) into a snapshot.
When a user request comes in, the system restores the snapshot. The container wakes up exactly at the point where it is ready to accept a prompt, completely bypassing the heavy initialization phase. It is the equivalent of closing your laptop lid and opening it again, rather than performing a full hard reboot.
The Economic Reality
Combining local NVMe caching, memory-mapped layer streaming, and process snapshotting brings the cold start down from 40 seconds to under 3 seconds.
For the first time, you can realistically scale a massive LLM cluster to zero during off-peak hours and rely on cold starts to handle the initial traffic spikes without infuriating your users.
This fundamentally alters the unit economics of building AI applications. You no longer need millions in venture capital just to keep an idle cluster of A100s warm. You can build, deploy, and pay strictly for the milliseconds of compute your users actually consume. The infrastructure has finally caught up with the models.



