r/LocalLLaMA · June 9, 2026 · 1 min read

2X tk/s (from 19.4 -> 38.1 tk/s on 1 x MI50) Playing with a hypothesis like speculative decoding.. but instead of an additional side model, exploiting that I can run multiple computations side-by-side AS IF I had Qwen3.6-27B loaded twice in memory - small quants don't use all…

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

2X tk/s (from 19.4 -> 38.1 tk/s on 1 x MI50) Playing with a hypothesis like speculative decoding.. but instead of an additional side model, exploiting that I can run multiple computations side-by-side AS IF I had Qwen3.6-27B loaded twice in memory - small quants don't use all the available compute.

Forgive the claude summary, in the readme, but the base works. I'm still working on the hip kernal and having it combine with MTP. I hope to get up near 80 tk/s.

All started because I realized every Q8 (INT8 or F8) calculation was using f32 of compute and only use 1/4th the available numbers... so. for each value loaded we can run 4 operations. Then the idea of speculative decoding which has a smaller model on the side running predictions that a bigger model votes on/vetos - why not just have THE SAME MODEL make those determinations.

KV cache adds a little overhead, but it's tiny. see tables in readme, and some SVGs as well.

Benchmark is my single MI50.

Because we're exploiting the nature of the smaller quants (Q8 or less), this really only works with those.

submitted by /u/bigattichouse
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA