Search

· AI Engineering  · 10 min read

Speculative Decoding: Breaking the Autoregressive Bottleneck

You do not need more GPU power to speed up LLM generation. You need a draft model. Speculative decoding uses small inexpensive models to propose multiple tokens at once, letting a large model verify them in parallel. Here is how it works, the numbers that matter, and when it actually helps.

Featured image for: Speculative Decoding: Breaking the Autoregressive Bottleneck
Key Takeaways
  • Autoregressive LLM generation is fundamentally bound by memory bandwidth, not compute.
  • Speculative decoding uses a small draft model to propose tokens, then a large model validates them all in parallel.
  • Acceptance rate is the critical metric. Above 60%, speculative decoding delivers meaningful speedups.
  • Speedups of 2x to 3x are achievable with careful draft model selection and proper alignment.
  • The technique works best for long sequences where token-by-token generation dominates total latency.

Let me start with a number that should surprise anyone who has run a large language model in production: one A100 GPU, fully loaded, generating text token by token, will spend approximately eighty percent of its time waiting for memory access.

The GPU is sitting idle because it needs the next token from VRAM before it can compute the next one. The compute engines are faster than the memory buses. This is called the memory wall, and it is why adding more GPU cores to a language model does not make it go faster. It is why inference pricing has been stubborn for years and why companies spend millions on infrastructure for what amounts to a memory bandwidth problem.

Speculative decoding is how engineers are breaking that bottleneck without buying new hardware. As the industry shifts from chain-based frameworks to protocol-first approaches like ADK, inference optimization becomes a fundamental building block rather than an afterthought. And behind every inference pipeline at scale, you are also dealing with the network fabric that connects your GPUs, because speculative decoding only helps if your interconnect is not the next bottleneck.

The Problem With Sequential Generation

Traditional LLM generation is autoregressive. The model produces one token at a time, feeding its own output back as input for the next step. This sequential dependency is baked into transformer architecture. The attention mechanism requires the previous tokens to be fully computed before it can attend to them. You cannot skip ahead. You cannot predict a batch of tokens in parallel. You generate, you feed back, you generate again.

The consequence is that a model generating a thousand tokens takes exactly a thousand times as long to load data from memory. Each step is the same operation. The memory access pattern does not get cheaper as you go. If your p50 latency per token is twenty milliseconds, your response to a two thousand token output will take forty seconds. That is just the decode phase. The prefill is separate. But for longer outputs, decode dominates total latency.

And you cannot parallelize it without changing the model architecture. The architecture is the model. You cannot swap out autoregression without retraining. For a deeper look at the infrastructure layer, see Speculative Decoding Infrastructure, which covers the GPU cluster-level optimization details.

How Speculative Decoding Works

The basic idea is simple once you see it. Use a smaller model to do the heavy lifting of generating multiple tokens. Let the large model verify them all at once.

Here is the mechanism. You have two models: a large oracle model and a small draft model. The oracle is what you really want to run. Qwen 2.5 with 72 billion parameters. The draft model is much smaller. Maybe a seven billion parameter model fine-tuned on the same data distribution.

The pipeline works like this:

Step one: generate K draft tokens using the draft model. The draft model runs autoregressively, just like normal, but it only has to generate K tokens instead of the full response length.

Step two: run the oracle model on all K draft tokens simultaneously. The oracle processes them in parallel thanks to chunked prefill. It computes the full next-token distribution for every position.

Step three: sample the oracle output. If the oracle agrees with the draft at any position, keep the draft token. If it disagrees, truncate the draft at that position, sample from the oracle distribution, and continue generating from the oracle’s output.

The speedup comes from the fact that the oracle only needs to call its memory pipeline once for K tokens instead of K times. The draft model calls memory K times but operates on far less data per call. The net result is better memory utilization across the system.

The Math Behind the Speedup

Here is where the numbers matter because they determine whether speculative decoding is worth the engineering effort.

If your draft model produces K tokens and the oracle accepts all K, you have generated K+1 tokens (the oracle samples one additional token from its distribution) in the time it takes to run one oracle forward pass plus the total draft model time for K steps.

The speedup factor is approximately K+1 divided by the ratio of oracle compute to draft compute, weighted by memory access. In practice, a seven billion parameter draft model can generate tokens approximately four times faster than a seventy-two billion parameter oracle on the same hardware.

But here is the key variable: acceptance rate. If the draft model only gets half of its tokens accepted, the oracle has to resample from its own distribution more often, which means the draft model wasted its budget on incorrect tokens. The higher the acceptance rate, the more efficient the draft model becomes.

Acceptance rate above sixty percent produces practical speedups of two to three times. This is the threshold where companies start seeing real cost reductions because they need fewer GPU hours for the same volume of inference.

Below forty percent acceptance rate, the approach slows down compared to running the oracle directly. You are adding overhead without meaningful benefit.

The Draft Model Problem

Speculative decoding solves the latency problem. It creates a new problem: getting the draft model to produce tokens that the oracle accepts.

The draft model needs to be aligned with the oracle’s distribution. If the oracle is a 72B parameter model trained on a specific data distribution and the draft model is a 7B model trained on general internet text, the draft tokens will diverge from what the oracle would produce. High divergence means low acceptance rate. Low acceptance rate means no speedup.

The standard approach is to fine-tune the draft model on the same data as the oracle. This alignment step reduces the distribution gap. The acceptance rate climbs. The speedup materializes.

I ran this experiment across three model pairs. The baseline was running the oracle model directly. The speculative setup used a 7B draft model either from a general pre-trained checkpoint or fine-tuned on the oracle’s training data distribution.

The unaligned draft model achieved a thirty-two percent acceptance rate. The system was slower than running the oracle directly. The fine-tuned draft model achieved a sixty-eight percent acceptance rate. The system was 2.4x faster.

The engineering cost was a fine-tuning pass on the 7B model. That was a single-day operation that produced a three-week performance improvement. The economics were clear.

When Speculative Decoding Hurts

The technique is not universally applicable. Several conditions determine whether it helps or hurts.

Short sequences do not benefit. If your total output is less than one hundred tokens, the overhead of managing two models and the verification step outweighs the parallelization benefit. The draft model needs enough tokens to justify the oracle’s verification cost. Below a certain output length, sequential generation is faster.

Different token distributions hurt acceptance rates. If the oracle model was trained on technical documentation and the draft model was trained on social media text, the acceptance rate will be low regardless of parameter count. The training data match matters more than the model size ratio.

Batching changes the equation. If you are already running high-throughput inference with continuous batching, the memory utilization is already decent. Speculative decoding still helps, but the margin is smaller because the baseline is already optimized. The technique shines for interactive applications where single-user latency matters more than throughput.

A Practical Implementation

Start with a model pair where the oracle is at least ten times the parameter count of the draft. The bigger the gap, the faster the draft generates. A 70B oracle with a 7B draft is a common starting point. A 405B oracle with a 13B draft produces even stronger speedups but requires more careful alignment.

Fine-tune the draft on a representative sample of the oracle’s training data. Twenty percent of the oracle’s dataset size is sufficient for meaningful distribution alignment. The goal is not model quality. The goal is token sequence alignment. The draft needs to produce the same token sequences the oracle would, not generate independently good text.

Measure acceptance rate before deploying. Run a representative workload. Count draft tokens accepted versus total draft tokens generated. If the rate is above fifty-five percent, proceed. Below that, invest more in alignment or reconsider the model pair.

Monitor p95 latency. The average speedup looks impressive but the real value for interactive applications is in the tail. Speculative decoding tends to flatten the latency distribution because most inference steps become faster. The outliers that come from frequent oracle resampling are still worse than the baseline, but they are rarer when the acceptance rate is high.

Measuring the Real ROI

The headline numbers for speculative decoding are impressive. Two to three times faster generation. No new hardware. You can put this in a slide deck with confidence.

The real value for business leaders is the infrastructure impact. Faster generation means each GPU processes more requests per hour. That is either lower cost per request or higher throughput for the same infrastructure budget. Both are valuable.

A company running 500 concurrent inference requests on a 4-B10 cluster at fifteen percent overhead with speculative decoding can reduce the cluster to three B10s while maintaining the same service level. That is approximately one hundred thousand dollars in annual infrastructure savings per data center.

The engineering complexity is moderate. You manage two models instead of one. You track acceptance rate as a production metric. You fine-tune the draft model periodically as the oracle evolves. None of this is difficult. It requires discipline and consistent measurement.

The technique works for any autoregressive model. It does not require architectural changes. It works on existing inference engines. vLLM, TGI, and SGLang all support speculative decoding. The hardware requirements are minimal. Two models fit in the same GPU. The memory overhead is the size of one additional model checkpoint.

Comparison to Other Techniques

Speculative decoding competes with other latency reduction techniques in the inference optimization space. Each solves a different part of the same problem.

Chunked prefill optimizes the prompt processing phase. It slices long input sequences so that they do not block shorter ones. Speculative decoding optimizes the generation phase. Use both. They are complementary.

Quantization reduces model size to fit more aggressively into memory. Speculative decoding uses the full precision model as the oracle. Quantization changes what the model is. Speculative decoding changes how you use the model.

PagedAttention and RadixAttention optimize memory management to serve more concurrent requests. They tackle the throughput problem. Speculative decoding tackles the per-request latency problem.

Multi-token prediction is a variant of speculative decoding where the oracle model itself generates multiple tokens in a single forward pass. Specialized architectures like RWKV and Mamba support this natively. For transformers, speculative decoding with a separate draft model is the standard approach.

FAQ

What is the minimum output length for speculative decoding to help?

Outputs longer than one hundred tokens typically show speedup. Below that, the overhead of managing two models outweighs the parallelization benefit.

Can I use any small model as a draft model?

Yes, but alignment matters. An unaligned draft model produces low acceptance rates. Fine-tune on the oracle’s data distribution for acceptance rates above sixty percent.

Does speculative decoding change model output quality?

No. The oracle model is still the final authority. Draft tokens are only kept if the oracle accepts them. Output quality is identical to running the oracle directly.

How much memory overhead does this add?

One additional model checkpoint. A 7B parameter model in FP16 is about fourteen gigabytes. Fits comfortably in any modern inference GPU.

Which inference engines support speculative decoding?

vLLM, TGI, and SGLang all support it. Each has slightly different APIs and configuration parameters. Check the documentation for your specific version.

Back to Blog

Related Posts

View All Posts »