What's up on CPU inference these days?
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
What are the best models, quants and llama.cpp versions/forks for CPU inference these days?
I have AVX2 but no AVX512 - Intel core ultra 7 165H; 64G RAM
This seems to ask for massive MoE (a lot of RAM, not a lot of bandwidth/compute). So Qwen3.6 35B A3B Q4_K_M with standard llama.cpp produces about 10 tps - usable in non-thinking mode, not usable in thinking mode.
Is this the best I can get or are there other options?
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.