Used over a million tokens in three separate sessions to test Qwen 3.6 35b (new Multi-token Prediction version)
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
In my opinion, MTP models are 100% game changer for local LLMs.
In terms of speed, I was getting around 1.5x the tok/sec of previous tests.
The project was a test - building a full iterative step-by-step pygame; a small mystery dungeon-style game. At first I set 100-200k context and raised it to 300k. This is at KV Q8_0 quant. Edit: I was wrong, I had mistakenly left it at q4_0. I will redo tests tomorrow with Q8.
I use VSCodium and Roo. The idea was to see how far I can push the context window and measure (by feel) if a large context window with a multi-file project slows it down too much to be effective.
Model used: Qwen3.6-35B-A3B-UD-Q5_K_S (MTP version) - link
OS/Software: Ubuntu 24.04 - Vulkan - To use MTP I had to use a docker version of the MTP prototype of llama.cpp server (image: havenoammo/llama:vulkan-server)
My current window is 300k context but I feel like I can go even higher as my VRAM used is 28.3gb / 32gb. Likely 400k is viable (with the 35B MoE model that is).
GPU: Asus Radeon R9700 AI Pro card (32gb RDNA 4 card)
Just want to shoot my appreciation for the local LLM community and everyone responsible for enabling us to run these kinds of powerful models at home. Amazing when I think where we were just a year ago. I am having a blast exploring all this tech and every day that I learn something new, it just leaves me astounded.
EDIT: Switched to the Qwen 3.6 27b model (non-MoE) as I was running into issues with the MoE model when deep in context sessoin (200k ish). Will update results.
[link] [comments]
More from r/LocalLLaMA
-
Came home to find Pi with Qwen3.627B had run rm -rf .....
May 15
-
China modded GPU (eg. 4090 48gb) --> I'm gonna figure it out. IS THERE NO ONE ELSE CURIOUS??
May 15
-
club-5060ti: practical RTX 5060 Ti local LLM notes and configs
May 15
-
MiniMax M2.7 ultra uncensored heretic is Out Now with 4/100 Refusals, Available in Safetensors and GGUFs Formats!
May 15
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.