Type to search posts and projects to navigate

Instrument · Inference

Speculative decoding speedup

A cheap draft model proposes tokens; the target verifies them in one pass. How much speedup that buys depends on the acceptance rate, the draft length, and the cost ratio, and there is an optimal draft length.

Sources: Making LLMs faster · Code

Bet 02 / Speculative decoding

How much speedup does speculation actually buy?

A cheap draft proposes K tokens; the target verifies them in one pass and accepts a prefix. Each accepted token is one you did not pay full price for, but a longer draft costs more per step, so there is an optimal K. And counterintuitively, sampling is slower than greedy even at higher acceptance, because rejection and resampling add overhead the arithmetic does not see.

Verdict
Speedup (this mode)
×
Tokens / iteration
Optimal draft length
K*
greedy sampling optimal K* break-even
How this is computed

From the derivation. Expected tokens per iteration = (1 − α^(K+1)) / (1 − α); speedup = tokens / (1 + K·c).

  • Greedy accepts when the argmax agrees, no per-token draw overhead.
  • Sampling also needs the random draws to align and resamples from the residual distribution on rejection, so each draft token carries extra cost, modeled here as a higher effective c. This is why greedy at 62.5% acceptance beat sampling at 64.3% (1.10x vs 0.85x) in the post.
  • EAGLE / Medusa raise acceptance and cut draft cost (feature-level drafting, multiple heads), which is how the full technique set reaches 2-3x.
Embed this on your site

Paste this HTML where you want the widget. It stays in sync with the live version, and matches your page in light or dark.

Subhadip Mitra