r/LocalLLaMA · · 1 min read

What's up on CPU inference these days?

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

What are the best models, quants and llama.cpp versions/forks for CPU inference these days?

I have AVX2 but no AVX512 - Intel core ultra 7 165H; 64G RAM

This seems to ask for massive MoE (a lot of RAM, not a lot of bandwidth/compute). So Qwen3.6 35B A3B Q4_K_M with standard llama.cpp produces about 10 tps - usable in non-thinking mode, not usable in thinking mode.

Is this the best I can get or are there other options?

submitted by /u/ramendik
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA