Type to search posts and projects to navigate

Instrument · Inference

Memory-bandwidth roofline

Is a decode workload memory-bound or compute-bound? Plug in the model, batch size, precision, and GPU, and watch the operating point cross the ridge.

Bet 02 / Memory bandwidth

Memory-bound, or compute-bound?

Autoregressive decode reads the whole model from memory for every token but does almost no arithmetic. Below the ridge point, the GPU sits idle waiting on memory. Here is exactly where your configuration lands.

State
Arithmetic intensity
FLOP/byte
Ridge point
FLOP/byte
GPU compute idle
%
Time / token · memory
Time / token · compute
Effective time / token
Throughput
tok/s
operating point roofline log–log axes
How this is computed

Weight-bound decode of one token at batch B, weights read once per token:

  • Arithmetic intensity = 2B / bytes-per-param (so it equals B at FP16, and rises as precision drops)
  • Ridge point = peak FLOP/s / memory bandwidth. You are memory-bound while intensity is left of it.
  • Memory time = (params × bytes-per-param) / bandwidth; compute time = 2 × params × B / peak FLOP/s.

Peak figures are approximate dense tensor-core throughput and HBM bandwidth per vendor spec, scaled by precision. KV-cache traffic is omitted; it deepens the memory-bound conclusion rather than softening it.

Subhadip Mitra