Could someone please help explain these results?
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
I'm running Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf on 12 GB VRAM and 32 GB RAM via the TurboQuant variant of llama.cpp. I increased the --n-cpu-moe value from 8 to 30, and my inference rate doubled! (17 to 34 tok/s). Shouldn't it have slowed down from the CPU having to do so much more work? Here is the command I'm using:
llama-cli -m Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf -ngl 999 --n-cpu-moe 30 -fa on --cache-type-k turbo4 --cache-type-v turbo3 -c 262144 -t 6 -b 2048 -ub 512 --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 --no-mmap
Increasing it further to 41 didn't touch the inference rate. What's going on?
And if you're feeling charitable, could you also tell me how I might squeeze a little more speed out of this setup, if possible?
Edit: I increased it further from 41 to 256, and if anything, inference sped up even more, and VRAM usage stayed the same. I'm flummoxed, I tell you. Flummoxed.
[link] [comments]
More from r/LocalLLaMA
-
opensource music reccomendation / playlist, similar to spotify radio / YT music mix?
May 25
-
llama.cpp has a clever trick for speeding up KV cache decode
May 25
-
how to install llamacpp the better way to wrapping it in python ui (CPU use only) ?
May 25
-
hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (Strix Halo, 7900 XTX)
May 24
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.