r/LocalLLaMA · June 25, 2026 · 1 min read

GLM 5.2 on consumer hardware

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I tried out the unsloth quants of GLM 5.2 on still "consumer-ish" hardware:

32C Zen5 Threadripper Pro 9975 WX, Asus WRX90E-SAGE-SE PCIe Gen5, 512GB DDR5 ECC RAM @ 4800MHz, dual RTX 5090.

This machine was put together pre-RAMpocalypse, and by then not exceedingly expensive compared to today's grotesque prices.

The quant I used was unsloth/GLM-5.2-GGUF, UD-Q5_K_S (492GB of weights).

I used a freshly compiled (cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="120f" -DGGML_CUDA_FA_ALL_QUANTS=ON -DGGML_CUDA_FORCE_MMQ=ON -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_GRAPHS=ON -DGGML_CCACHE=OFF -DGGML_CUDA_ENABLE_UNIFIED_MEMORY=0; cmake --build build --config Release -j 64) llama.cpp with the following invocation:

CUDA_VISIBLE_DEVICES=0,1 numactl --physcpubind=0-31 --localalloc llama.cpp/build/bin/llama-server \ --model ./GLM-5.2-UD-Q5_K_S-00001-of-00012.gguf \ --temp 1.0 \ --top-p 0.95 \ --min-p 0.01 \ --fit on --no-mmap --flash-attn on --ctx-size 32768 --no-warmup --prio 3 \ --threads 32 --threads-batch 32 --numa isolate --log-verbosity 4 --split-mode layer --direct-io --jinja

With this I get consistently 12t/s. I just tried chatting, no agentic stuff.

There is very little to none variation of speed by omitting or using last line's llama.cpp options; same applies to the numa stuff.

submitted by /u/phwlarxoc
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA