PSA: Test your "threads" argument in llama.cpp (+80% performance in my case)
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
When GPT-OSS 120B has released last year I played around and tried to maximize it's performance. One thing that many people pointed out was that for hybrid CPU (Performance + Efficiency cores) you should use only P-cores with "--threads" argument and taskset/affinity. Back then I've setup that model on my friend's 14700K and yea limiting threads to 8 (because 8 P-cores) increased performance. So I continued to use that and recommend doing that since then.
Today I've played around with MTP draft settings on Gemma 4 26B A4B QAT and I randomly thought "Let's try increasing thread count". My CPU (250K Plus) has 18 cores (6 performance + 12 efficiency).
Performance uplift was so big that I made a simple basic script just to be sure (simple prompt to make PHP code for Wordpress, same settings apart from threads argument, same seed, 1 warmup run then 5 runs to reduce error) and here are the results:
threads runs min_tok/s mean_tok/s max_tok/s ------- ---- --------- ---------- --------- 6 5 48,938 49,144 49,451 12 5 61,329 62,938 67,614 16 5 87,877 88,765 89,126 18 5 64,154 66,478 67,373 Yea. Casual +80% performance uplift by using 16 threads instead of 6. YEA I ALSO DIDN'T BELIEVE THAT IT BECAME SO FAST THAT'S WHY I'VE MADE THAT BENCH SCRIPT TO CONFIRM.
In 6 thread test it was pinned to P-cores with /affinity argument, but it was the same as without it so maybe the Thread Director on Arrow Lake is better than on Raptor Lake (14700K on which I previously tested).
Curiously with 18 cores performance drops, but I don't see any throttling, it's still full boost on all cores so the bottleneck starts to show somewhere else, if somebody knows he may drop that into comments.
Config:
Intel 250K Plus + 64GB 6400MT/s + RTX 4070 SUPER 12GB with memory OC to 571GB/s + llama.cpp b9601
Command which gives me the best performance from everything I've tested so far (for example I see many people use spec draft 3, for me setting to '2' increased performance on QAT model, on non-QAT 3 was fine):
llama-server -m models/gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf --model-draft models/mtp-gemma-4-26B-A4B-it-qat.gguf --alias gemma4-26b-a4b-qat-q4xl-mtp -c 131072 -np 1 -b 2048 -ub 512 --threads 16 -ngl 99 -ncmoe 18 -fa on --spec-type draft-mtp --spec-draft-ngl 99 --spec-draft-n-max 2 --cache-type-k-draft q8_0 --cache-type-v-draft q8_0 --temp 1.0 --top-p 0.95 --top-k 64 --min-p 0.0 --repeat-penalty 1.0 So if you have 12GB VRAM like me try the above command, quant and mtp model are from Unsloth. Ofc tok/s will drop with more and more context, but that percentage difference is still the same. This command maybe is still not perfect, after I wake up I'll retest every single assumption I had, because maybe I set other arguments wrong too lol
Check how performance scales on your CPU, because you may be missing nearly half of the performance like I was... now I'm even more sad that Gemma 4 124B has not been released, because it 100% would be fast enough with that 16 thread setting, I would just put 32GB more RAM into that PC and it would be a perfect match :( :( :( :(
Sorry mods if I set incorrect post flair, I have no idea which I should use for this post
Edit: This post assumes that you're using hybrid (CPU+GPU) like me or pure CPU inference.
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.