r/LocalLLaMA · May 21, 2026 · 1 min read

Interesting paper advocates for quantized prefilling and precise decoding

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Interesting paper advocates for quantized prefilling and precise decoding

From other people's tests, NVFP4 decoding speed hasn't really allowed people to hit higher peaks (let's say: 85-90% memory bandwidth utilization) versus other approaches. The development leans toward a different class of optimization like parallel decoding. There is also measurement difficulty in MoE era where MoE suffers a tg speed penalty vs active dense. We may get pre-fill speedup, but tg performance is not mind-bendingly good and there are losses depending on the quantization processing.

This paper shares something simplistic, we should use W4A4 for the (theoretical 4x) prefill gain, and then we should not use W4A4 for decoding since it will accumulate more errors. Interesting, maybe some inference engines have applied this idea already.

- https://arxiv.org/abs/2605.20315

"Prefilling and decoding exhibit distinct computational bottlenecks and quantization redundancy behaviors. Prefilling processes a fixed input sequence in parallel and is suited to aggressive quantization: quantization errors do not recursively affect future inputs within the same prefill pass, and long agentic contexts often contain substantial redundancy. In contrast, decoding is much more error-sensitive, as each sampled token affects the generation process."

"Weight-and-activation quantization can accelerate compute-bound prefilling, but applying aggressive W4A4 quantization to the full autoregressive process is brittle, as activation errors may perturb token choices and accumulate over generation [5, 37, 46]. Mix-Quant therefore quantizes only context encoding while keeping decoding on the original high-precision path."

Besides NVFP4, the general idea of this seems important. Low precision crunching is useful, less lossy than streaming.

submitted by /u/Aaaaaaaaaeeeee
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA