Speculative decoding speedup
A cheap draft model proposes tokens; the target verifies them in one pass. How much speedup that buys depends on the acceptance rate, the draft length, and the cost ratio, and there is an optimal draft length.
Sources: Making LLMs faster · Code
Bet 02 / Speculative decoding
How much speedup does speculation actually buy?
A cheap draft proposes K tokens; the target verifies them in one pass and accepts a prefix. Each accepted token is one you did not pay full price for, but a longer draft costs more per step, so there is an optimal K. And counterintuitively, sampling is slower than greedy even at higher acceptance, because rejection and resampling add overhead the arithmetic does not see.
How this is computed
From the derivation. Expected tokens per iteration = (1 − α^(K+1)) / (1 − α); speedup = tokens / (1 + K·c).
- Greedy accepts when the argmax agrees, no per-token draw overhead.
- Sampling also needs the random draws to align and resamples from the residual distribution on rejection, so each draft token carries extra cost, modeled here as a higher effective
c. This is why greedy at 62.5% acceptance beat sampling at 64.3% (1.10x vs 0.85x) in the post. - EAGLE / Medusa raise acceptance and cut draft cost (feature-level drafting, multiple heads), which is how the full technique set reaches 2-3x.