r/LocalLLaMA · · 3 min read

Speed difference between Windows 11 and Linux with llama.cpp: a myth when using medium and large MoE models

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Speed difference between Windows 11 and Linux with llama.cpp: a myth when using medium and large MoE models

As the title says, there is no speed difference between Linux and Windows when using llama.cpp. I myself kept two operating systems on my computer for a long time because of this misconception. But when I got tired of constantly switching, I decided to check how much performance I’d lose if I moved to Windows.

First, a brief overview of the PC used in these tests:

- CPU: Core Ultra 7 265KF under water cooling, with a slight overclock to 5.6/4.7 GHz core frequencies

- Motherboard: Asus Z890 with three PCIe slots, two of them PCIe 4.0 x4

- RAM: Kingston Beast DDR5 192 GB (4×48 GB) at 6400 MHz, with slightly reduced voltage and relaxed timings to keep temperatures down

- GPUs: Nvidia GeForce RTX 5080 16 GB + RTX 5060 Ti 16 GB + RTX 5060 Ti 16 GB, all undervolted with a slight memory overclock

- PSU: 1200 W 80 Plus Gold — 1000 W would have been enough, but I went with headroom from the start

Operating systems used: Ubuntu 26.04 with KDE and GNOME — I also ran one test with Xfce — and Windows 11 with all updates installed. The llama.cpp version was the same across the board, built via cmake the day before yesterday, which happened to include a commit for reducing VRAM usage: “llama: use f16 mask for FA to save VRAM”.

Models tested: Qwen 3.5 122B Q8, Qwen 3.5 397B iq4_xs, MiniMax 2.7 Q5.

llama.cpp launch parameters: `-nocb -dio --no-mmap -np 1 -t 15 -tb 15 -c 50000` (for coding, `-c 150000`) `-mg 0 -fa on --reasoning-budget 19000 --reasoning-budget-message " ... reasoning budget exceeded, need to answer." --no-mmproj`. It was also configured to start with the RTX 5080 by setting `CUDA_VISIBLE_DEVICES=1,2,0`. Linux : '-fit on' , Windows :' -fit-target 250'

Results:

- Qwen 3.5 122B: PP 300, TG 28 on Windows; PP 290, TG 28.5 on Linux

- Qwen 3.5 397B: PP 140, TG 16 on Windows; PP 150, TG 15.2 on Linux

- MiniMax 2.7: PP 220, TG 17 on Windows; PP 230, TG 16 on Linux

All tests were run 4 times each, across the following tasks:

  1. A brief article summary with 8k tokens of prompt processing.
  2. Translating a portion of a book from Chinese — 20k tokens of prompt processing.
  3. A Java test — the percentage results were the same across all models. Deliberate errors were introduced in two classes, with a total of 85k tokens of prompt processing.

Well, WSL turned out to be the slowest — I ran a test with just Qwen 3.5 397B, and the speed dropped from PP 140, TG 16 down to 110 PP and 13.5 TG.

I’ve laid out the exact llama.cpp launch parameters, so anyone can easily reproduce the results on their own hardware. Of course, everyone’s setup is different, but the performance ratio won’t change for MoE models with hybrid CPU+GPU offloading.

And running such large models doesn’t require a ton of space, massive power draw, or all the other things people often list. From the wall, the 397B model pulled only 550–600 watts according to the readings. I also attached a photo of the PC — in a closed case, air convection is better with 140 mm fans.

https://preview.redd.it/nb4i22ya3g4h1.jpg?width=3000&format=pjpg&auto=webp&s=2b259fcd089c0a4bb1c92a4a077bbfbae4d2b036

https://preview.redd.it/4fxd51ya3g4h1.jpg?width=4000&format=pjpg&auto=webp&s=7a3d67c87139f86d774fe4bd1942d39601624358

submitted by /u/Far-Usual5771
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA