GLM 5.2 on consumer hardware
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
I tried out the unsloth quants of GLM 5.2 on still "consumer-ish" hardware:
32C Zen5 Threadripper Pro 9975 WX, Asus WRX90E-SAGE-SE PCIe Gen5, 512GB DDR5 ECC RAM @ 4800MHz, dual RTX 5090.
This machine was put together pre-RAMpocalypse, and by then not exceedingly expensive compared to today's grotesque prices.
The quant I used was unsloth/GLM-5.2-GGUF, UD-Q5_K_S (492GB of weights).
I used a freshly compiled (cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="120f" -DGGML_CUDA_FA_ALL_QUANTS=ON -DGGML_CUDA_FORCE_MMQ=ON -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_GRAPHS=ON -DGGML_CCACHE=OFF -DGGML_CUDA_ENABLE_UNIFIED_MEMORY=0; cmake --build build --config Release -j 64) llama.cpp with the following invocation:
CUDA_VISIBLE_DEVICES=0,1 numactl --physcpubind=0-31 --localalloc llama.cpp/build/bin/llama-server \ --model ./GLM-5.2-UD-Q5_K_S-00001-of-00012.gguf \ --temp 1.0 \ --top-p 0.95 \ --min-p 0.01 \ --fit on --no-mmap --flash-attn on --ctx-size 32768 --no-warmup --prio 3 \ --threads 32 --threads-batch 32 --numa isolate --log-verbosity 4 --split-mode layer --direct-io --jinja With this I get consistently 12t/s. I just tried chatting, no agentic stuff.
There is very little to none variation of speed by omitting or using last line's llama.cpp options; same applies to the numa stuff.
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.