Tools

Build vs Buy LLM Calculator

Compare API costs with self-hosting costs, applying real-world optimization levers.

Commercial API Model

Self-Hosted Model

Monthly Token Volume (Millions)

150M

Hardware Commitment

Apply Quantization (INT4) Reduces GPU count needed by 50%

Optimized Serving (vLLM) Increases throughput by ~3x

Cost Comparison

Commercial API Cost: $1,200

Self-Hosting Cost: $2,000

GPUs Needed: 2

Net Savings: -$800

Recommendation

Stick with APIs for now.

How the Math Works

1. API Cost Calculation

API costs are calculated by multiplying your token volume by the provider's rates. We assume a standard split of 70% input tokens and 30% output tokens.

GPT-4o: $5.00 (in) / $15.00 (out) per Million tokens.
Claude 3.5 Sonnet: $3.00 (in) / $15.00 (out) per Million tokens.
Llama 3 70B (API): $0.60 (in) / $0.80 (out) per Million tokens.

2. Self-Hosting Cost Calculation (Levers)

Self-hosting costs are driven by the hardware needed to handle your volume:

Baseline Hardware: We assume a base unit cost of ~$1,000/month for a single high-end GPU (like an A100 equivalent on Lambda Labs or RunPod).
Quantization (INT4): Shrinks the model, allowing a 70B model to fit on 1 GPU instead of 2, cutting the cost in half!
Optimized Serving (vLLM): Uses PagedAttention to increase throughput by ~3x. This means you can handle much more volume before needing to rent *another* GPU.
Commitment Discounts: Moving from On-Demand to Reserved instances or Spot instances applies significant discounts but adds risk.

Pricing Captured: April 2026.
Sources: API pricing pulled from OpenAI and Anthropic official docs. Hosting costs based on typical dedicated cloud provider rates (Lambda Labs, RunPod).
Disclaimer: AI infrastructure and API pricing changes rapidly. Please double-check the latest rates before making final architectural or financial decisions.

Frequently Asked Questions

At what scale does self-hosting make sense?

For smaller models (8B class), the break-even point is often around 500M tokens per month. For larger models (70B class), you typically need to be consuming several billion tokens per month before dedicated GPU hardware becomes cheaper than pay-as-you-go APIs, unless you apply optimizations like quantization and vLLM.

What are the hidden costs of self-hosting?

This calculator covers hardware costs. Hidden costs include engineering time to maintain the serving stack, monitoring, downtime risks, and the cost of underutilized hardware when traffic is low.

What is continuous batching?

Continuous batching (used in vLLM) allows the serving engine to insert new requests into the batch as soon as any request finishes, rather than waiting for the whole batch to complete. This drastically increases throughput and lowers cost per token.

Does open-source match commercial quality?

Yes, models like Llama 3 and Mixtral have closed the gap significantly. For specialized tasks where you can fine-tune a smaller model on your specific data, self-hosting a smaller model often outperforms a generic large commercial model.

How do I calculate token usage?

Estimate based on average prompt and response length. A typical interaction might be 500 input tokens and 200 output tokens. Multiply by your expected monthly volume to get total tokens.

What about fine-tuning costs?

Fine-tuning requires compute for training, which is a one-time or periodic cost. This calculator focuses on serving (inference) costs. Fine-tuning can reduce the model size needed, lowering hosting costs.

Is privacy a good reason to self-host?

Yes, many enterprises self-host to ensure data never leaves their network, complying with strict privacy regulations (GDPR, HIPAA) that commercial APIs might not guarantee without expensive enterprise agreements.

What is the maintenance overhead?

Significant. You need to manage updates, scaling, monitoring, and hardware failures. This is why many start with APIs and migrate only at scale.

Can I use hybrid approaches?

Yes! Many use a small local model for easy tasks (routing, summarization) and call commercial APIs only for complex reasoning, optimizing both cost and quality.

What about SLA guarantees?

Commercial APIs offer SLAs (often with enterprise tiers). Self-hosting puts the SLA responsibility on your own engineering team.