Used over a million tokens in three separate sessions to test Qwen 3.6 35b (new Multi-token Prediction version)
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
In my opinion, MTP models are 100% game changer for local LLMs.
In terms of speed, I was getting around 1.5x the tok/sec of previous tests.
The project was a test - building a full iterative step-by-step pygame; a small mystery dungeon-style game. At first I set 100-200k context and raised it to 300k. This is at KV Q8_0 quant. Edit: I was wrong, I had mistakenly left it at q4_0. I will redo tests tomorrow with Q8.
I use VSCodium and Roo. The idea was to see how far I can push the context window and measure (by feel) if a large context window with a multi-file project slows it down too much to be effective.
Model used: Qwen3.6-35B-A3B-UD-Q5_K_S (MTP version) - link
OS/Software: Ubuntu 24.04 - Vulkan - To use MTP I had to use a docker version of the MTP prototype of llama.cpp server (image: havenoammo/llama:vulkan-server)
My current window is 300k context but I feel like I can go even higher as my VRAM used is 28.3gb / 32gb. Likely 400k is viable (with the 35B MoE model that is).
GPU: Asus Radeon R9700 AI Pro card (32gb RDNA 4 card)
Just want to shoot my appreciation for the local LLM community and everyone responsible for enabling us to run these kinds of powerful models at home. Amazing when I think where we were just a year ago. I am having a blast exploring all this tech and every day that I learn something new, it just leaves me astounded.
EDIT: Switched to the Qwen 3.6 27b model (non-MoE) as I was running into issues with the MoE model when deep in context sessoin (200k ish). Will update results.
[link] [comments]
More from r/LocalLLaMA
-
Why Dario is on fire: lesson from dotcom bubble.
Jun 30
-
Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought
Jun 30
-
I Hate Dario Amodei, and everything he stands for.
Jun 29
-
Introducing LongCat-2.0 - , a large-scale MoE language model with 1.6 trillion total parameters and ~48 billion activated per token. This was the stealth model that was on Openrouter under the name 'owl-alpha'.
Jun 29
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.