r/LocalLLaMA · · 5 min read

GLM-5.2 (744B, 2-bit) at 7.3 tok/s on 4×3090 + 192GB — and why IQ1_M wasn't any faster

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

TLDR: For the first time, I feel relief that they could shut down the cloud services and I would be ok. I got my 4th 3090 and then unsloth dropped the Q2 and Q1. I wrote nothing else here its from CC, so it might be wrong. GLM-5.2 UD-IQ2_M runs across 4×3090 + RAM expert offload at ~7.3 tok/s. Two decode A/Bs: halving the quant (IQ2->IQ1) did NOTHING; going 6->12 CPU threads gave +22%. The offloaded-expert decode is bound by CPU compute, not memory bandwidth.

## Hardware

- Ryzen 9900X, 192GB DDR5-5600

- 4× RTX 3090 (1 Ti + 3 FE), 96GB total. One card sits on a PCIe x1 link (chipset-lane tradeoff to keep the boot NVMe at x4).

## Config

- unsloth GLM-5.2 UD-IQ2_M, 223GB on disk (744B total / 40B active)

- llama.cpp master. Arch is glm-dsa (MLA + DeepSeek sparse attn + nextn). Older releases won't load it — needs a current build.

- ~83GB across the 4 GPUs (19 of 75 MoE layers' experts) + ~166GB resident RAM (the other 56 layers, computed on CPU). q8_0 KV is basically free thanks to MLA.

## --n-cpu-moe will OOM you

With -sm layer, the kept-on-GPU experts all land on the LAST card and it tried to alloc 54GB on a 24GB GPU. Fix: place experts per-device explicitly —

-ot "blk\.(3|4|5)\.ffn_(gate|up|down)_exps=CUDA0" ... CUDA1/2/3, with a =CPU catch-all last. Spread evenly; the card holding output/embeddings runs tightest.

## What actually moves decode (two A/Bs, one variable each)

- IQ1_M (213GB) vs IQ2_M (238GB), same split: 7.30 vs 7.29 tok/s. Identical.

- 6 threads vs 12 threads, same everything: 5.83 vs 7.14 tok/s. +22%.

Decode is bound by the CPU compute of the active offloaded experts (dequant + matmul), NOT bandwidth. Smaller quant = same matmul shape = same FLOPs = no gain. More cores = gain, up to your physical core count. (Prefill was flat at 135 tok/s across threads -- not core-bound.) The levers that work: more cores, more experts on GPU (fewer offloaded layers). Quant size isn't one.

## MLA helps long ctx but doesn't make 1M free

KV is ~6GB at 128K, but scales linearly: ~50GB at 1M (q8), ~29GB (q4_1). With ~15GB

free VRAM, 1M is out. q4_1 gets ~360K, q8_0 ~200K. DSA shrinks attention COMPUTE at

long ctx, not the cache size.

## The x1 card: useless for splits, perfect for a sidecar

A little bonus if you are ok with 5 toks instead of 7, you can do this with a Q1 across 3 cards and it frees a gpu. A x1 link kills tensor/layer split, but a single-card model never crosses the link at inference — x1 only costs load time. Dropped GLM to 3 cards and put a Qwen3.6-35B-A3B on the x1 card alone: 116 tok/s, full speed.

## No MTP yet

glm-dsa ships a nextn/MTP head but it's an unimplemented stub in llama.cpp (loads the tensors, builds no graph — only Qwen has MTP merged). ngram self-speculative is the fallback; helps on code/structured output, not prose.

## Biggest real speed lever: turn thinking off

Decode rate is fixed, but thinking burns tokens. Same prompt, same correct answer: non-thinking 13.5s vs reasoning_effort high/max 60-80s — ~5-6× wall-clock. Per-request dial; default it off, opt in for hard problems.

## Cost

192GB DDR5 + 4 used 3090s + a 9900X. No cloud, no subscriptions. Running cost is electricity (cards capped at 200W each).

This is the first validated config (even 5 layers/card, ubatch 512) — simplest to explain:

#!/usr/bin/env bash

# GLM-5.2 UD-IQ2_M (2-bit) on 4x 24GB GPUs + ~190GB RAM, llama.cpp expert offload.

# Arch is glm-dsa -> needs a CURRENT llama.cpp build. Older releases won't load it.

#

# Build llama.cpp master first (static avoids RUNPATH headaches):

# git clone https://github.com/ggml-org/llama.cpp

# cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON \

# -DCMAKE_CUDA_ARCHITECTURES=86 # 86=Ampere/3090; set to your arch

# cmake --build llama.cpp/build -j --target llama-server

#

# Download: hf download unsloth/GLM-5.2-GGUF --include "*UD-IQ2_M*" --local-dir GLM-5.2

SERVER=./llama.cpp/build/bin/llama-server

MODEL=./GLM-5.2/UD-IQ2_M/GLM-5.2-UD-IQ2_M-00001-of-00006.gguf

# THE KEY BIT: distribute on-GPU experts EXPLICITLY across cards, rest to CPU.

# DON'T use --n-cpu-moe here -- with -sm layer it dumps all kept-on-GPU experts

# onto the LAST card and OOMs (it tried 54GB on a 24GB card). Instead, pin ~5 MoE

# layers' experts per card via -ot, and send the rest to CPU with the catch-all.

# Tune the layer counts to your VRAM: more on GPU = faster (fewer CPU round-trips),

# but the card holding output+embeddings (CUDA0) runs tightest -- back it off if it OOMs.

# blk.0-2 are dense (no experts); MoE layers are 3-77.

CUDA_VISIBLE_DEVICES=0,1,2,3 CUDA_DEVICE_ORDER=PCI_BUS_ID "$SERVER" \

--model "$MODEL" \

--host 0.0.0.0 --port 8001 \

--ctx-size 131072 \

--n-predict -1 \

--n-gpu-layers 999 \

--split-mode layer --tensor-split 1,1,1,1 \

-ot "blk\.(3|4|5|6|7)\.ffn_(gate|up|down)_exps\.=CUDA0" \

-ot "blk\.(8|9|10|11|12)\.ffn_(gate|up|down)_exps\.=CUDA1" \

-ot "blk\.(13|14|15|16|17)\.ffn_(gate|up|down)_exps\.=CUDA2" \

-ot "blk\.(18|19|20|21|22)\.ffn_(gate|up|down)_exps\.=CUDA3" \

-ot "ffn_(gate|up|down)_exps\.=CPU" \

--threads 12 \

--batch-size 2048 --ubatch-size 512 \

--flash-attn on \

--cache-type-k q8_0 --cache-type-v q8_0 \

--no-mmap \

--jinja \

--reasoning off # default non-thinking (~5-6x faster wall-clock);

# callers opt in per-request with

# chat_template_kwargs:{"enable_thinking":true}

Notes for whoever reads it:

- 20 of 75 MoE layers on GPU, 55 on CPU → ~83 GB VRAM + ~166 GB RAM, ~7.3 tok/s decode.

- Generic paths (./llama.cpp, ./GLM-5.2) so they edit two lines and go.

- The -ot block is the whole point — that's the OOM-avoiding trick and the comment explains the tuning. The catch-all =CPU must come last.

- I dropped the ngram/--spec-type line (niche, optional) and all my env-var scaffolding.

submitted by /u/Important_Quote_1180
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA