GLM-5.2 (744B, 2-bit) at 7.3 tok/s on 4×3090 + 192GB — and why IQ1_M wasn't any faster
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
TLDR: For the first time, I feel relief that they could shut down the cloud services and I would be ok. I got my 4th 3090 and then unsloth dropped the Q2 and Q1. I wrote nothing else here its from CC, so it might be wrong. GLM-5.2 UD-IQ2_M runs across 4×3090 + RAM expert offload at ~7.3 tok/s. Two decode A/Bs: halving the quant (IQ2->IQ1) did NOTHING; going 6->12 CPU threads gave +22%. The offloaded-expert decode is bound by CPU compute, not memory bandwidth.
## Hardware
- Ryzen 9900X, 192GB DDR5-5600
- 4× RTX 3090 (1 Ti + 3 FE), 96GB total. One card sits on a PCIe x1 link (chipset-lane tradeoff to keep the boot NVMe at x4).
## Config
- unsloth GLM-5.2 UD-IQ2_M, 223GB on disk (744B total / 40B active)
- llama.cpp master. Arch is glm-dsa (MLA + DeepSeek sparse attn + nextn). Older releases won't load it — needs a current build.
- ~83GB across the 4 GPUs (19 of 75 MoE layers' experts) + ~166GB resident RAM (the other 56 layers, computed on CPU). q8_0 KV is basically free thanks to MLA.
## --n-cpu-moe will OOM you
With -sm layer, the kept-on-GPU experts all land on the LAST card and it tried to alloc 54GB on a 24GB GPU. Fix: place experts per-device explicitly —
-ot "blk\.(3|4|5)\.ffn_(gate|up|down)_exps=CUDA0" ... CUDA1/2/3, with a =CPU catch-all last. Spread evenly; the card holding output/embeddings runs tightest.
## What actually moves decode (two A/Bs, one variable each)
- IQ1_M (213GB) vs IQ2_M (238GB), same split: 7.30 vs 7.29 tok/s. Identical.
- 6 threads vs 12 threads, same everything: 5.83 vs 7.14 tok/s. +22%.
Decode is bound by the CPU compute of the active offloaded experts (dequant + matmul), NOT bandwidth. Smaller quant = same matmul shape = same FLOPs = no gain. More cores = gain, up to your physical core count. (Prefill was flat at 135 tok/s across threads -- not core-bound.) The levers that work: more cores, more experts on GPU (fewer offloaded layers). Quant size isn't one.
## MLA helps long ctx but doesn't make 1M free
KV is ~6GB at 128K, but scales linearly: ~50GB at 1M (q8), ~29GB (q4_1). With ~15GB
free VRAM, 1M is out. q4_1 gets ~360K, q8_0 ~200K. DSA shrinks attention COMPUTE at
long ctx, not the cache size.
## The x1 card: useless for splits, perfect for a sidecar
A little bonus if you are ok with 5 toks instead of 7, you can do this with a Q1 across 3 cards and it frees a gpu. A x1 link kills tensor/layer split, but a single-card model never crosses the link at inference — x1 only costs load time. Dropped GLM to 3 cards and put a Qwen3.6-35B-A3B on the x1 card alone: 116 tok/s, full speed.
## No MTP yet
glm-dsa ships a nextn/MTP head but it's an unimplemented stub in llama.cpp (loads the tensors, builds no graph — only Qwen has MTP merged). ngram self-speculative is the fallback; helps on code/structured output, not prose.
## Biggest real speed lever: turn thinking off
Decode rate is fixed, but thinking burns tokens. Same prompt, same correct answer: non-thinking 13.5s vs reasoning_effort high/max 60-80s — ~5-6× wall-clock. Per-request dial; default it off, opt in for hard problems.
## Cost
192GB DDR5 + 4 used 3090s + a 9900X. No cloud, no subscriptions. Running cost is electricity (cards capped at 200W each).
This is the first validated config (even 5 layers/card, ubatch 512) — simplest to explain:
#!/usr/bin/env bash
# GLM-5.2 UD-IQ2_M (2-bit) on 4x 24GB GPUs + ~190GB RAM, llama.cpp expert offload.
# Arch is glm-dsa -> needs a CURRENT llama.cpp build. Older releases won't load it.
#
# Build llama.cpp master first (static avoids RUNPATH headaches):
# git clone https://github.com/ggml-org/llama.cpp
# cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON \
# -DCMAKE_CUDA_ARCHITECTURES=86 # 86=Ampere/3090; set to your arch
# cmake --build llama.cpp/build -j --target llama-server
#
# Download: hf download unsloth/GLM-5.2-GGUF --include "*UD-IQ2_M*" --local-dir GLM-5.2
SERVER=./llama.cpp/build/bin/llama-server
MODEL=./GLM-5.2/UD-IQ2_M/GLM-5.2-UD-IQ2_M-00001-of-00006.gguf
# THE KEY BIT: distribute on-GPU experts EXPLICITLY across cards, rest to CPU.
# DON'T use --n-cpu-moe here -- with -sm layer it dumps all kept-on-GPU experts
# onto the LAST card and OOMs (it tried 54GB on a 24GB card). Instead, pin ~5 MoE
# layers' experts per card via -ot, and send the rest to CPU with the catch-all.
# Tune the layer counts to your VRAM: more on GPU = faster (fewer CPU round-trips),
# but the card holding output+embeddings (CUDA0) runs tightest -- back it off if it OOMs.
# blk.0-2 are dense (no experts); MoE layers are 3-77.
CUDA_VISIBLE_DEVICES=0,1,2,3 CUDA_DEVICE_ORDER=PCI_BUS_ID "$SERVER" \
--model "$MODEL" \
--host 0.0.0.0 --port 8001 \
--ctx-size 131072 \
--n-predict -1 \
--n-gpu-layers 999 \
--split-mode layer --tensor-split 1,1,1,1 \
-ot "blk\.(3|4|5|6|7)\.ffn_(gate|up|down)_exps\.=CUDA0" \
-ot "blk\.(8|9|10|11|12)\.ffn_(gate|up|down)_exps\.=CUDA1" \
-ot "blk\.(13|14|15|16|17)\.ffn_(gate|up|down)_exps\.=CUDA2" \
-ot "blk\.(18|19|20|21|22)\.ffn_(gate|up|down)_exps\.=CUDA3" \
-ot "ffn_(gate|up|down)_exps\.=CPU" \
--threads 12 \
--batch-size 2048 --ubatch-size 512 \
--flash-attn on \
--cache-type-k q8_0 --cache-type-v q8_0 \
--no-mmap \
--jinja \
--reasoning off # default non-thinking (~5-6x faster wall-clock);
# callers opt in per-request with
# chat_template_kwargs:{"enable_thinking":true}
Notes for whoever reads it:
- 20 of 75 MoE layers on GPU, 55 on CPU → ~83 GB VRAM + ~166 GB RAM, ~7.3 tok/s decode.
- Generic paths (./llama.cpp, ./GLM-5.2) so they edit two lines and go.
- The -ot block is the whole point — that's the OOM-avoiding trick and the comment explains the tuning. The catch-all =CPU must come last.
- I dropped the ngram/--spec-type line (niche, optional) and all my env-var scaffolding.
[link] [comments]
More from r/LocalLLaMA
-
Researchers trained a Deep Research agent with 32 H100s and open-sourced everything
Jun 19
-
GLM-5.2 can now run locally in llama.cpp and Unsloth Studio.
Jun 19
-
[NEW MODEL] SupraLabs just released SupraVL-Nano-900k, a Vision-Language Model built entirely from scratch!
Jun 19
-
SETI @ Home aka distributed LLM inference engine. Does this exist and if not, should we make one?
Jun 19
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.