Giving GLM-5.2 a spin locally on CPU only! (poor man's rig for big models)
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| This is the UD-Q2-K_XL quant. Hardware is: Model: Dell PowerEdge R740 I'm using ik_llama.cpp which provides some significant performance improvements over the base llama.cpp for CPU-only inference. Unfortunately, we dual CPU folks have to worry about NUMA nodes and cross-socket memory latency which tanks performance, so I've isolated it to a single node for CPU cores and memory which gives me 24 cores and 384 GB node-local RAM to play with. I have model weights and 1M context fully in RAM. In basic chat, it's alright all things considered. 4 to 5.5 tok/s generation with MTP drafting turned on. Gets progressively worse as context grows of course, like when coding. I'm seeing about 3 tok/s as I start working with it in opencode. Speaking of which, here's the prompt I gave it where its output is in the screenshot:
So yeah, it's not really seriously usable on this hardware of course, but I wanted to play with this beast of a model a bit locally. In coding, it really is giving frontier vibes. I'm just happy that we can actually run a model this strong on our own hardware, and it's got me excited for what's coming next! [link] [comments] |
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.