TTFT (Time To First Token): Measuring Inference Correctly

Key Takeaways

Relying on "average latency" for LLM applications is a dangerous anti-pattern that obscures critical performance bottlenecks.
Inference performance must be broken down into two distinct phases: Time To First Token (TTFT) and Inter-Token Latency (ITL).
TTFT is heavily dependent on compute availability and the prefill phase, while ITL is bottlenecked by memory bandwidth during the decode phase.
Optimizing for one often requires trading off the other; architecture choices must align with the specific user experience requirements.

If you are building LLM applications in production, and your primary performance metric is “average latency,” you are flying blind. I see this constantly. Teams launch a new feature powered by Gemma, look at their Datadog dashboard, see an average response time of 2.5 seconds, and call it a success.

Then the complaints start rolling in. Users are abandoning the interface. The application feels sluggish. Why? Because averages lie.

In distributed systems, the tail is where the pain lives. That 2.5-second average is hiding the fact that 10% of your users are staring at a blank screen for 8 seconds before anything happens. In the world of generative interfaces, a blank screen for 8 seconds is an eternity. It breaks the illusion of intelligence. It destroys the user experience.

To actually understand and optimize inference performance, you have to throw away the average. You must break the generation process down into its fundamental physical phases. You need to measure, monitor, and optimize two very different metrics: Time To First Token (TTFT) and Inter-Token Latency (ITL).

The Anatomy of a Generation

When you send a prompt to a model, the inference engine does not process the request in a single, monolithic chunk. It happens in two distinct phases, each stressing the underlying hardware in completely different ways.

First comes the Prefill Phase. This is where the model ingests your entire prompt, processes it, and computes the key-value (KV) cache for the attention mechanism. This phase is extremely compute-intensive. It needs raw FLOPS. It wants to saturate the tensor cores on your TPUs or GPUs.

The metric that captures the prefill phase is Time To First Token (TTFT). This is the exact moment the user hits “submit” to the moment the very first word appears on their screen.

Once the prefill is complete, the engine shifts into the Decode Phase. This is the autoregressive loop. The model generates one token, appends it to the context, and runs the whole thing again to generate the next token. Unlike prefill, decode is incredibly memory-bandwidth bound. The compute units are mostly sitting idle, waiting for the massive weights of the model and the ever-growing KV cache to be loaded from HBM (High Bandwidth Memory) into the processor’s registers for every single token.

The metric that captures the decode phase is Inter-Token Latency (ITL). This is the time it takes to generate each subsequent token after the first one.

Why the Distinction Matters

You cannot optimize a system if you don’t understand where the bottleneck is. If your overall response time is slow, you need to know if the model is struggling to digest the prompt (high TTFT) or if it’s struggling to spit out the answer (high ITL).

Let’s look at how these metrics impact the user experience, and more importantly, how you fix them.

The TTFT Problem: The Blank Screen of Death

A high TTFT is usually the most jarring experience for a user. They ask a question, and nothing happens. It feels broken.

If your TTFT is spiking, you are likely dealing with one of two issues:

Queueing Delay: Your inference server is overloaded. Requests are sitting in a queue waiting for a free worker. If you are running your own GKE cluster for inference, this means you don’t have enough replicas, or your load balancing is inefficient.
Massive Contexts: You are shoving a 500k token document into the prompt. The prefill phase simply takes a long time to compute the KV cache for that much text.

How to optimize TTFT:

Scale out: Add more compute. Spin up more TPU nodes in your Vertex AI endpoint to handle the concurrency.
Semantic Caching: If users are asking similar questions or processing the same documents, cache the KV cache. Don’t recompute the prefill if you don’t have to.
Chunked Prefill: (We’ll dive deeper into this in a later post), but essentially, you break the massive prefill calculation into smaller chunks so it doesn’t block other requests in a continuous batching system.

The ITL Problem: The Sluggish Typist

A high ITL feels different. The first word appears quickly, but then the model “types” out the rest of the answer agonizingly slowly. It feels like watching someone hunt-and-peck on a keyboard.

If your ITL is high, you are hitting the memory wall. The hardware cannot move data from HBM to the compute units fast enough.

How to optimize ITL:

Model Quantization: Shrink the weights. Moving from FP16 to INT8 or FP8 drastically reduces the amount of data that needs to be shuttled around memory, directly improving ITL.
Tensor Parallelism: Split the model across multiple chips. This increases the total aggregate memory bandwidth available for the decode phase.
Speculative Decoding: Use a smaller, faster “draft” model to guess the next few tokens, and use the large model to verify them in parallel. This can significantly speed up the apparent ITL.

The Trade-off Matrix

Here is the uncomfortable truth: optimizing for TTFT often degrades ITL, and vice versa.

In a continuous batching system (like vLLM or standard Vertex AI endpoints), the engine constantly tries to maximize throughput by grouping requests together.

If the engine prioritizes new requests (optimizing TTFT), it forces the decode phase of existing requests to wait, spiking their ITL. If the engine prioritizes finishing current requests (optimizing ITL), new requests sit in the queue longer, spiking their TTFT.

This is where you, as the AI engineer, have to make architectural decisions based on the product requirements.

If you are building a real-time voice agent, TTFT is everything. A one-second delay before the agent starts speaking destroys the conversational flow. You will gladly sacrifice throughput and accept a slightly slower ITL to guarantee a sub-300ms TTFT.

If you are building an offline document summarizer, TTFT doesn’t matter at all. The user isn’t staring at the screen. You want to maximize throughput. You will configure your inference engine to aggressively batch requests, allowing TTFT to drift into the tens of seconds in exchange for churning through documents as efficiently as possible.

Instrumenting the Reality

Stop looking at the average. Stop looking at the P99 of total request time.

You need to instrument your application to log TTFT and ITL for every single interaction. When you look at a latency dashboard, you should see two distinct distributions.

When you make a change—when you switch from one model to another, when you change your quantization strategy, when you alter your prompt length—you need to see exactly how it impacts the prefill phase versus the decode phase.

Only then are you actually engineering the system. Everything else is just guessing.

Search

TTFT (Time To First Token): Measuring Inference Correctly

The Anatomy of a Generation

Why the Distinction Matters

The TTFT Problem: The Blank Screen of Death

The ITL Problem: The Sluggish Typist

The Trade-off Matrix

Instrumenting the Reality

Related Posts

LiteRT-LM Deep Dive: Engineering LLM Inference for the Edge

Semantic Caching at Scale: Vector Embeddings for 5x Latency Reduction

Speculative Decoding Infrastructure: Squeezing Latency without Hardware Upgrades

Single-Batch Inference: Speculative Decoding on an A100

The Anatomy of a Generation

Why the Distinction Matters

The TTFT Problem: The Blank Screen of Death

The ITL Problem: The Sluggish Typist

The Trade-off Matrix

Instrumenting the Reality

Enjoying this insight?

Related Posts

LiteRT-LM Deep Dive: Engineering LLM Inference for the Edge

Semantic Caching at Scale: Vector Embeddings for 5x Latency Reduction

Speculative Decoding Infrastructure: Squeezing Latency without Hardware Upgrades

Single-Batch Inference: Speculative Decoding on an A100

Strictly Necessary

Analytics