r/LocalLLaMA · · 4 min read

Reviewing speed optimizations on llamacpp for large MoE models on multiGPU rigs? (fitparams vs -ngl/-ncmoe vs other flags, P2P, overclocking)

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

In anticipation of MiniMax reported upcoming open-weight release of M3, wanted to do comprehensive review of what I’m aware of regarding speed optimizations. Hopefully it can be helpful reference for some people too. I outlined my understanding of currently available speed optimizations; what feedback can I get on my understanding or what big gaps am I missing?

(By the way, is any of this considered remotely valuable information? I’m been half-considering a career change and wondering if all the time I invest in all this stuff is even valuable enough knowledge to be hirable in the tech field. My dream would be to work at a frontier lab one day. But my understanding is that something like that would require much more technical expertise like manipulating kernels themselves for speed optimizations, or next-level knowledge of effectively applying agentic workflows in the B2C domain)

For llama-server arguments:
-ngl 999 : set as the highest possible number
-ncmoe ? : set to maybe a quarter or half of the total number of layers, and keep decreasing until it all fits
-t 12 : my cpu has 24 total threads, of which 12 are physical; so it’s bee suggested to me by chatbots to mentally designate it as 12. Tbh I don’t see any difference with this
-fa on : chatbots suggest to set this manually; this sees unnecessary to me because it defaults to on anyways
-fitt 256,256,256,256,256 : this is for 5x GPUs in my rig. my understanding is that this forces llamacpp to use more of the VRAM available instead of leaving behind the default 1024, which has helped pooch out a bit of performance gain
-ub 8192 : my understanding is that batch size helps speed up prompt processing speeds. This helps me go from 50tps to 120tps in pp speed, for a slight decrease in token generation speed from 12tps to 11tps, which I suspect is due to the large hit on VRAM that takes away VRAM available for attention.

(TBH llamacpp’s default fitparams have worked well for me. However cloud chatbots and Reddit always seem to suggest manually tuning -ngl -ncmoe to optimize performance. But to be honest they’ve never been any better than the standard fitparams for; am I utilizing these arguments correctly?)

P2P : I recently tried setting this up and I think I did get a decent speed boost. Unfortunately, I wasn’t good about my documentation so I couldn’t do a quantitative comparison before and after. I had to deactivate this recently after making hardware adjustments and trying to get my device to boot, but I think i’ll have to go back and set this up again. https://github.com/aikitoria/open-gpu-kernel-modules

Undervolting : my understanding is that this actually marginally decreases performance, but just helps keeps things cool and more power efficient for relatively minor performance cost.

Overclocking GPUS : this is something I haven’t had a chance to explore yet, and I’m not sure what the best way to go about this would be and if its safe for the hardware long term or not.

MTP : this has been more helpful for dense models like qwen3.6 27b via vLLM, rather than MoE models. I would like to be able to run MiniMax M3 at a decent quant and speed, but I suspect that I won’t be able to take advantage of MTP, based on how MiniMax did not make MTP available for open-weights M2.7

Why use llama.cpp instead of vLLM with cpu-RAM offload? : my understanding is that while vLLM is capable of DAM offload, it ends up being slower than llamacpp, despite tensor parallelism in vLLM vs pipeline parallelism in llamacpp

My hardware setup:
1x 2060super8gb (each on pcie4.0 x8)
4x 5060ti16gb (each on pcie4.0 x16)
256gb ddr4 3200 ram, running at essentially 4-channel
MC62-G40 mobo
3945WX cpu
(by the way, I would not recommend this hardware path, this cpu only has 2 ccds which limits it to basically 4-channel bandwidth. trying to get to 8-channel bandwidth would be significantly more expensive. also, this mobo does not have AVX-512 which would have helped prompt processing speeds. so overall costly with limited utility, especially for GPUs like 5060ti16gb’s which only have pcie5.0x8 anyways)

submitted by /u/Ambitious_Fold_2874
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA