Interesting paper advocates for quantized prefilling and precise decoding
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| From other people's tests, NVFP4 decoding speed hasn't really allowed people to hit higher peaks (let's say: 85-90% memory bandwidth utilization) versus other approaches. The development leans toward a different class of optimization like parallel decoding. There is also measurement difficulty in MoE era where MoE suffers a tg speed penalty vs active dense. We may get pre-fill speedup, but tg performance is not mind-bendingly good and there are losses depending on the quantization processing. This paper shares something simplistic, we should use W4A4 for the (theoretical 4x) prefill gain, and then we should not use W4A4 for decoding since it will accumulate more errors. Interesting, maybe some inference engines have applied this idea already. - https://arxiv.org/abs/2605.20315 "Prefilling and decoding exhibit distinct computational bottlenecks and quantization redundancy behaviors. Prefilling processes a fixed input sequence in parallel and is suited to aggressive quantization: quantization errors do not recursively affect future inputs within the same prefill pass, and long agentic contexts often contain substantial redundancy. In contrast, decoding is much more error-sensitive, as each sampled token affects the generation process." "Weight-and-activation quantization can accelerate compute-bound prefilling, but applying aggressive W4A4 quantization to the full autoregressive process is brittle, as activation errors may perturb token choices and accumulate over generation [5, 37, 46]. Mix-Quant therefore quantizes only context encoding while keeping decoding on the original high-precision path." Besides NVFP4, the general idea of this seems important. Low precision crunching is useful, less lossy than streaming. [link] [comments] |
More from r/LocalLLaMA
-
How small can the orchestration model in an agent be? (separating it from code-gen — that obviously wants a big model)
May 22
-
BeeLlama v0.2.0 – major DFlash update. Single RTX 3090: Qwen 3.6 27B up to 164 tps (4.40x), Gemma 4 31B up to 177.8 tps (4.93x). Prompt processing speed near baseline.
May 22
-
trained a prompt injection detector using ml-intern and DeepSeek v4 Flash, runs in the browser
May 22
-
ByteShape Qwen3.6-35B-A3B: 30% faster than Unsloth IQ on 6GB VRAM laptop
May 22
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.