Speed difference between Windows 11 and Linux with llama.cpp: a myth when using medium and large MoE models
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| As the title says, there is no speed difference between Linux and Windows when using llama.cpp. I myself kept two operating systems on my computer for a long time because of this misconception. But when I got tired of constantly switching, I decided to check how much performance I’d lose if I moved to Windows. First, a brief overview of the PC used in these tests: - CPU: Core Ultra 7 265KF under water cooling, with a slight overclock to 5.6/4.7 GHz core frequencies - Motherboard: Asus Z890 with three PCIe slots, two of them PCIe 4.0 x4 - RAM: Kingston Beast DDR5 192 GB (4×48 GB) at 6400 MHz, with slightly reduced voltage and relaxed timings to keep temperatures down - GPUs: Nvidia GeForce RTX 5080 16 GB + RTX 5060 Ti 16 GB + RTX 5060 Ti 16 GB, all undervolted with a slight memory overclock - PSU: 1200 W 80 Plus Gold — 1000 W would have been enough, but I went with headroom from the start Operating systems used: Ubuntu 26.04 with KDE and GNOME — I also ran one test with Xfce — and Windows 11 with all updates installed. The llama.cpp version was the same across the board, built via cmake the day before yesterday, which happened to include a commit for reducing VRAM usage: “llama: use f16 mask for FA to save VRAM”. Models tested: Qwen 3.5 122B Q8, Qwen 3.5 397B iq4_xs, MiniMax 2.7 Q5. llama.cpp launch parameters: `-nocb -dio --no-mmap -np 1 -t 15 -tb 15 -c 50000` (for coding, `-c 150000`) `-mg 0 -fa on --reasoning-budget 19000 --reasoning-budget-message " ... reasoning budget exceeded, need to answer." --no-mmproj`. It was also configured to start with the RTX 5080 by setting `CUDA_VISIBLE_DEVICES=1,2,0`. Linux : '-fit on' , Windows :' -fit-target 250' Results: - Qwen 3.5 122B: PP 300, TG 28 on Windows; PP 290, TG 28.5 on Linux - Qwen 3.5 397B: PP 140, TG 16 on Windows; PP 150, TG 15.2 on Linux - MiniMax 2.7: PP 220, TG 17 on Windows; PP 230, TG 16 on Linux All tests were run 4 times each, across the following tasks:
Well, WSL turned out to be the slowest — I ran a test with just Qwen 3.5 397B, and the speed dropped from PP 140, TG 16 down to 110 PP and 13.5 TG. I’ve laid out the exact llama.cpp launch parameters, so anyone can easily reproduce the results on their own hardware. Of course, everyone’s setup is different, but the performance ratio won’t change for MoE models with hybrid CPU+GPU offloading. And running such large models doesn’t require a ton of space, massive power draw, or all the other things people often list. From the wall, the 397B model pulled only 550–600 watts according to the readings. I also attached a photo of the PC — in a closed case, air convection is better with 140 mm fans. [link] [comments] |
More from r/LocalLLaMA
-
Is there a definitive way or cookie cutter way to benchmark variations of the same model for their KLD?
May 31
-
PolyRange: Contamination-resistant offensive-AI benchmark for web targets (that ain't a benchmark, THAT's a benchmark)
May 31
-
Don’t bite me for that question please…
May 31
-
Use any model and any provider with the official OpenAI Codex Desktop App, without modifying its code, and continue to use the official models in parallel?
May 31
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.