r/LocalLLaMA · · 1 min read

Could someone please help explain these results?

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I'm running Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf on 12 GB VRAM and 32 GB RAM via the TurboQuant variant of llama.cpp. I increased the --n-cpu-moe value from 8 to 30, and my inference rate doubled! (17 to 34 tok/s). Shouldn't it have slowed down from the CPU having to do so much more work? Here is the command I'm using:

llama-cli -m Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf -ngl 999 --n-cpu-moe 30 -fa on --cache-type-k turbo4 --cache-type-v turbo3 -c 262144 -t 6 -b 2048 -ub 512 --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 --no-mmap

Increasing it further to 41 didn't touch the inference rate. What's going on?

And if you're feeling charitable, could you also tell me how I might squeeze a little more speed out of this setup, if possible?

Edit: I increased it further from 41 to 256, and if anything, inference sped up even more, and VRAM usage stayed the same. I'm flummoxed, I tell you. Flummoxed.

submitted by /u/MackTuesday
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA