Got GLM-5.2 + MTP speculative decode running on 4× DGX Spark (GB10) — and the build piece the public recipe is missing
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
TL;DR: the recipe's image-build mods aren't actually public – I reconstructed them from the public kernels (with Claude) – and you have to build vLLM at the author's exact pinned ref or the real AWQ weights crash on load. Running now at ~9.4 tok/s on my own 4× GB10.
Saw a link on X to CosmicRaisins' GLM-5.2 stack for 4× GB10: vLLM TP=4, MTP speculative decode, ported sparse-MLA Triton kernels (the Hopper-only _flashmla_C path doesn't exist on sm_121), and a data-free 15% expert prune so the AWQ-INT4 weights fit. Great work. I'd actually tried vanilla vLLM for GLM-5.2 on these boxes months ago and it fell over around 512-token context, so I'd been serving it on llama.cpp RPC (~5 tok/s) instead – a working sparse-MLA
MTP path was exactly what I'd been after. Porting it to my own 4-node Spark cluster, I hit two walls worth sharing:
The image isn't reproducible from the public repo. The README points at two vLLM mods in a spark-vllm-docker fork, but they aren't actually published (only the kernels are). So I reconstructed them from the public kernels – a single build-recon-image.sh that bakes the kernels in, patches deep_gemm.py (route the 3 DSA fns to the sm12x_* fallbacks on the sm_120/121 family, before the _missing() gate) and sparse_attn_indexer.py (drop the has_deep_gemm gate on sm12x), auto-applies the flashmla→Triton monkeypatch, and pip install b12x==0.23.0. The wiring validates with a quick import check on the GPU.
The base vLLM ref really matters. Building on a newer vLLM than the author's pinned commit made the real AWQ weights crash at process_weights_after_loading (_k_scale.fill_ → async CUDA error: invalid argument). Dummy weights loaded fine, so it was specific to real-weight processing. Rebuilding vLLM at the author's exact ref fixed it instantly. If you port this: pin the ref.
Other port notes: you can skip the 378 GB weight download – the 15% prune is deterministic from the cyankiwi AWQ base via the repo's awq_surgery.py (~20 min, pure safetensors surgery). On nodes with less free memory, gpu-memory-utilization 0.93 trips the boot guard – drop to 0.90 + lower max-model-len. No shared FS? NFS-export the weights from the head. And set the RoCE HCA/GID-index for your fabric.
Result: serving fine, coherent output, ~9.4 tok/s decode on a single RoCE rail – roughly 2× the llama.cpp fallback it replaced (MTP acceptance ~2.8/4). The author gets ~20 with dual-rail – the inter-node allreduce bandwidth is the decode bottleneck, so the 2nd rail is the ~2× lever (still debugging NCCL dual-rail GID resolution on mine).
Full notes + my fork + the reconstruction script: https://github.com/anvarazizov/glm-5.2-gb10
Huge credit to CosmicRaisins for the kernels/prune/MTP work — this is just the integration glue to make it portable. Would love for the maintainer to vendor the build script so nobody else has to reverse-engineer it.
[link] [comments]
More from r/LocalLLaMA
-
Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought
Jun 30
-
I Hate Dario Amodei, and everything he stands for.
Jun 29
-
Introducing LongCat-2.0 - , a large-scale MoE language model with 1.6 trillion total parameters and ~48 billion activated per token. This was the stealth model that was on Openrouter under the name 'owl-alpha'.
Jun 29
-
Krea-2-Turbo Image Model - Easy to be fully uncensored, but it can also EDIT Images!
Jun 29
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.