2X tk/s (from 19.4 -> 38.1 tk/s on 1 x MI50) Playing with a hypothesis like speculative decoding.. but instead of an additional side model, exploiting that I can run multiple computations side-by-side AS IF I had Qwen3.6-27B loaded twice in memory - small quants don't use all…
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| Forgive the claude summary, in the readme, but the base works. I'm still working on the hip kernal and having it combine with MTP. I hope to get up near 80 tk/s. All started because I realized every Q8 (INT8 or F8) calculation was using f32 of compute and only use 1/4th the available numbers... so. for each value loaded we can run 4 operations. Then the idea of speculative decoding which has a smaller model on the side running predictions that a bigger model votes on/vetos - why not just have THE SAME MODEL make those determinations. KV cache adds a little overhead, but it's tiny. see tables in readme, and some SVGs as well. Benchmark is my single MI50. Because we're exploiting the nature of the smaller quants (Q8 or less), this really only works with those. [link] [comments] |
More from r/LocalLLaMA
-
ggml-webgpu: Improve prefill speeds for k-quants + refactor matmul for Q4/Q5/Q8 and k-quants by yomaytk · Pull Request #24225 · ggml-org/llama.cpp
Jun 9
-
I fine-tuned Parakeet 0.6B for medical ASR — open weights, local Mac/CUDA/CPU
Jun 9
-
Pipeline parallelism in llama.cpp may be wasting your VRAM
Jun 8
-
Quick note on the QAT of recent
Jun 8
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.