Speculative decoding speedup

Bet 02 / Speculative decoding

How much speedup does speculation actually buy?

A cheap draft proposes K tokens; the target verifies them in one pass and accepts a prefix. Each accepted token is one you did not pay full price for, but a longer draft costs more per step, so there is an optimal K. And counterintuitively, sampling is slower than greedy even at higher acceptance, because rejection and resampling add overhead the arithmetic does not see.

PresetiJump to a known configuration: the theory example, measured GPT-2 greedy/sampling runs, or EAGLE and Medusa methods.

Decoding modeiGreedy accepts when the argmax agrees. Sampling also needs random draws to align and must resample on rejection, adding per-token overhead.

Acceptance rate αiHow often the target accepts a draft token. It equals one minus the total-variation distance between the draft and target distributions.

Draft length KiHow many tokens the draft proposes before the target checks them all in a single verification pass.

Draft / target cost ciThe draft's cost per token as a fraction of the target's. A cheaper draft makes longer proposals worthwhile.

Verdict

—

Speedup (this mode)

— ×

Tokens / iteration

—

Optimal draft length

— K*

greedy sampling optimal K* break-even

How this is computed

From the derivation. Expected tokens per iteration = (1 − α^(K+1)) / (1 − α); speedup = tokens / (1 + K·c).

Greedy accepts when the argmax agrees, no per-token draw overhead.
Sampling also needs the random draws to align and resamples from the residual distribution on rejection, so each draft token carries extra cost, modeled here as a higher effective c. This is why greedy at 62.5% acceptance beat sampling at 64.3% (1.10x vs 0.85x) in the post.
EAGLE / Medusa raise acceptance and cut draft cost (feature-level drafting, multiple heads), which is how the full technique set reaches 2-3x.

Embed this on your site

Paste this HTML where you want the widget. It stays in sync with the live version, and matches your page in light or dark.

<iframe src="https://subhadipmitra.com/instruments/speculative-decoding/embed/" width="100%" height="920" loading="lazy" style="border:0;max-width:760px" title="Speculative decoding speedup — subhadipmitra.com"></iframe>