Cheapest way to run GLM 5.x locally that's not a unified memory system?
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
This is primarily an exercise to determine the possible options, obscure as they might be, to run at least a 4bit quant (let's say roughly IQ4_XS).
Got a CPU only setup? Please share your experience. Sapphire Rapids ES 56core + DDR5 might be an option
Multi GPU setups with partial or complete offloading? What's your performance like?
It's not limited to GLM 5.x, anything similarly sized is ok too for the scope of this discussion.
Personally, I'm running a 5900X + 128GB DDR4 + 7900XT 20GB. The largest model I can run is Minimax 2.7 from AesSedAI at Q4_K_S - https://huggingface.co/AesSedai/MiniMax-M2.7-GGUF
For smaller stuff, it's still Qwen 3.6 27B at IQ4_XS from Unsloth/Bartowski.
[link] [comments]
More from r/LocalLLaMA
-
Why Dario is on fire: lesson from dotcom bubble.
Jun 30
-
Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought
Jun 30
-
I Hate Dario Amodei, and everything he stands for.
Jun 29
-
Introducing LongCat-2.0 - , a large-scale MoE language model with 1.6 trillion total parameters and ~48 billion activated per token. This was the stealth model that was on Openrouter under the name 'owl-alpha'.
Jun 29
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.