Tools
LLM GPU Cost Estimator
Estimate hardware needs and costs for serving LLMs at scale.
VRAM Estimates
Recommended Hardware
Est. Cost: ~$4.00 / hour
How the Math Works
1. Model Weights (The Brain)
To run an AI, you must load its parameters (the knobs) into the GPU's memory.
- FP16 (High Quality): 2 bytes per parameter. A 70B model takes ~140 GB.
- INT4 (Compressed): 0.5 bytes per parameter. The same 70B model shrinks to ~35 GB.
2. KV Cache (Short-Term Memory)
As the AI generates responses, it remembers the conversation history in the KV Cache.
- It scales linearly with the total tokens (Input + Output) and the number of users active at the exact same time.
- Modern models use Grouped Query Attention (GQA) to keep this memory pool small, but it can still grow to tens of gigabytes under high load.
3. Optimization Variables
Advanced serving techniques alter the VRAM footprint:
- Speculative Decoding: Uses a small assistant model to draft answers. It speeds up serving but requires loading both models into memory.
- Continuous Batching: Overlaps requests to ensure the GPU is never idle, making KV cache utilization more efficient without increasing max VRAM.
Pricing Captured: April 2026.
Sources: Hardware estimates and costs based on typical dedicated cloud rates (e.g., Lambda Labs, RunPod).
Disclaimer: GPU spot and on-demand pricing changes frequently based on availability. Please double-check the latest rates on the provider's website before making final architectural decisions.
Frequently Asked Questions
Why do I need more VRAM than the model size?
Loading the model is only step one. To actually process requests, the GPU needs free memory for activation tensors and the KV cache (short-term memory). If you run out of VRAM during generation, the system will crash or slow down significantly.
What is the difference between Gemma 3 and Gemma 4?
Gemma 3 (2025) introduced native multimodality and large context windows. Gemma 4 (2026) focuses on agentic workflows, featuring a highly optimized 26B Mixture of Experts (MoE) model and a powerful 31B dense model that rivals closed-source models.
Is INT4 quantization good enough for production?
Modern techniques like AWQ or GPTQ allow INT4 to retain very high accuracy (often within 1-2% of FP16) while using a fraction of the memory. For most business applications, INT4 or INT8 is the recommended starting point to save costs.
What is PagedAttention?
PagedAttention (used in vLLM) manages KV cache memory in small blocks, similar to virtual memory in operating systems. It eliminates external fragmentation and allows sharing of memory between requests, drastically increasing throughput.
How does continuous batching work?
Traditional batching waits for all requests in a batch to finish before starting new ones. Continuous batching inserts new requests as soon as any request in the batch finishes, maximizing GPU utilization.
What is the impact of MoE (Mixture of Experts) on VRAM?
MoE models (like Mixtral or Gemma 4 26B) have massive total parameters but only activate a subset for each token. However, you still need to load the *entire* model into VRAM. So you need the VRAM for the full model size, even if compute is faster.
What is the difference between A100 and H100 for serving?
H100 is significantly faster for inference due to architecture improvements (Transformer Engine) and higher memory bandwidth, often offering 2-3x performance for LLMs compared to A100, reducing latency and increasing throughput.
How does context length affect memory?
Memory grows linearly with context length in standard attention. Long contexts require huge KV cache pools, which can sometimes exceed the memory required for the model weights themselves.
Can I run a 70B model on a single GPU?
Not at FP16 precision (requires ~140GB). You can run it on a single 80GB GPU if you quantize to INT4 (~35GB) or INT8 (~70GB), which fits comfortably on a single A100 or H100 80GB.
What serving framework should I use?
vLLM is highly recommended for high throughput via PagedAttention. TGI (Text Generation Inference) is also popular and robust. Choice depends on specific model support and infrastructure preferences.