Instrument · Inference
Memory-bandwidth roofline
Is a decode workload memory-bound or compute-bound? Plug in the model, batch size, precision, and GPU, and watch the operating point cross the ridge.
Bet 02 / Memory bandwidth
Memory-bound, or compute-bound?
Autoregressive decode reads the whole model from memory for every token but does almost no arithmetic. Below the ridge point, the GPU sits idle waiting on memory. Here is exactly where your configuration lands.
State
—
Arithmetic intensity
— FLOP/byte
Ridge point
— FLOP/byte
GPU compute idle
—%
Time / token · memory
—
Time / token · compute
—
Effective time / token
—
Throughput
— tok/s
How this is computed
Weight-bound decode of one token at batch B, weights read once per token:
- Arithmetic intensity
= 2B / bytes-per-param(so it equalsBat FP16, and rises as precision drops) - Ridge point
= peak FLOP/s / memory bandwidth. You are memory-bound while intensity is left of it. - Memory time
= (params × bytes-per-param) / bandwidth; compute time= 2 × params × B / peak FLOP/s.
Peak figures are approximate dense tensor-core throughput and HBM bandwidth per vendor spec, scaled by precision. KV-cache traffic is omitted; it deepens the memory-bound conclusion rather than softening it.