

Speculative Decoding: Breaking the Autoregressive Bottleneck
You do not need more GPU power to speed up LLM generation. You need a draft model. Speculative decoding uses small inexpensive models to propose multiple tokens at once, letting a large model verify them in parallel. Here is how it works, the numbers that matter, and when it actually helps.