Search

Tools

LLM GPU Cost Estimator

Estimate hardware needs and costs for serving LLMs at scale.

10
2048
1024

VRAM Estimates

Model Weights: 140 GB
KV Cache (Memory): 9.4 GB
Draft Model (Speculative): 0 GB
Total VRAM: 149.4 GB

Recommended Hardware

2x NVIDIA A100 (80GB)

Est. Cost: ~$4.00 / hour

How the Math Works

1. Model Weights (The Brain)

To run an AI, you must load its parameters (the knobs) into the GPU's memory.

2. KV Cache (Short-Term Memory)

As the AI generates responses, it remembers the conversation history in the KV Cache.

3. Optimization Variables

Advanced serving techniques alter the VRAM footprint:

Pricing Captured: April 2026.
Sources: Hardware estimates and costs based on typical dedicated cloud rates (e.g., Lambda Labs, RunPod).
Disclaimer: GPU spot and on-demand pricing changes frequently based on availability. Please double-check the latest rates on the provider's website before making final architectural decisions.

Frequently Asked Questions

Why do I need more VRAM than the model size?

Loading the model is only step one. To actually process requests, the GPU needs free memory for activation tensors and the KV cache (short-term memory). If you run out of VRAM during generation, the system will crash or slow down significantly.

What is the difference between Gemma 3 and Gemma 4?

Gemma 3 (2025) introduced native multimodality and large context windows. Gemma 4 (2026) focuses on agentic workflows, featuring a highly optimized 26B Mixture of Experts (MoE) model and a powerful 31B dense model that rivals closed-source models.

Is INT4 quantization good enough for production?

Modern techniques like AWQ or GPTQ allow INT4 to retain very high accuracy (often within 1-2% of FP16) while using a fraction of the memory. For most business applications, INT4 or INT8 is the recommended starting point to save costs.

What is PagedAttention?

PagedAttention (used in vLLM) manages KV cache memory in small blocks, similar to virtual memory in operating systems. It eliminates external fragmentation and allows sharing of memory between requests, drastically increasing throughput.

How does continuous batching work?

Traditional batching waits for all requests in a batch to finish before starting new ones. Continuous batching inserts new requests as soon as any request in the batch finishes, maximizing GPU utilization.

What is the impact of MoE (Mixture of Experts) on VRAM?

MoE models (like Mixtral or Gemma 4 26B) have massive total parameters but only activate a subset for each token. However, you still need to load the *entire* model into VRAM. So you need the VRAM for the full model size, even if compute is faster.

What is the difference between A100 and H100 for serving?

H100 is significantly faster for inference due to architecture improvements (Transformer Engine) and higher memory bandwidth, often offering 2-3x performance for LLMs compared to A100, reducing latency and increasing throughput.

How does context length affect memory?

Memory grows linearly with context length in standard attention. Long contexts require huge KV cache pools, which can sometimes exceed the memory required for the model weights themselves.

Can I run a 70B model on a single GPU?

Not at FP16 precision (requires ~140GB). You can run it on a single 80GB GPU if you quantize to INT4 (~35GB) or INT8 (~70GB), which fits comfortably on a single A100 or H100 80GB.

What serving framework should I use?

vLLM is highly recommended for high throughput via PagedAttention. TGI (Text Generation Inference) is also popular and robust. Choice depends on specific model support and infrastructure preferences.