My GLM-5.2-FP8 HGX-H200 SGLang docker deploy config
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Halo lads. Name says it all. Right now, after 1-2 hours of experimenting, this is maximum i could squeeze out current hardware
No, im not rich. Its my companies GPUs, just sharing my experience
docker run -d \ --name glm-5.2-sglang \ --restart unless-stopped \ --gpus all \ --shm-size 32g \ --ipc=host \ -v /data/models/glm-5.2:/model \ -p 30000:30000 \ lmsysorg/sglang:latest \ sglang serve \ --model-path /model \ --served-model-name glm-5.2 \ --host 0.0.0.0 \ --port 30000 \ --tp 8 \ --mem-fraction-static 0.83 \ --enable-metrics \ --reasoning-parser glm45 \ --tool-call-parser glm47 \ --cuda-graph-max-bs 256 Cookbook`s flags, i did not use:
- DP - limits context to 120k~ on each shard. I turned off everything related to it, just pure TP
- moe-a2a-backend deepep - idk how, but it actually slows down token/s. 50t/s~ on vs 70t/s~ off
- mem-fraction-static 0.83 - if you try to use more, OOM guaranteed
result is 262k context and 70t/s
So ye, that`s it. If you have any questions feel free to ask, i`ll try to answer
btw vLLM official recipes wont work for H200. i guess, its because of kv cache fp8 quant on dsv3 architecture
[link] [comments]
More from r/LocalLLaMA
-
Why Dario is on fire: lesson from dotcom bubble.
Jun 30
-
Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought
Jun 30
-
I Hate Dario Amodei, and everything he stands for.
Jun 29
-
Introducing LongCat-2.0 - , a large-scale MoE language model with 1.6 trillion total parameters and ~48 billion activated per token. This was the stealth model that was on Openrouter under the name 'owl-alpha'.
Jun 29
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.