r/LocalLLaMA · June 17, 2026 · 1 min read

My GLM-5.2-FP8 HGX-H200 SGLang docker deploy config

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Halo lads. Name says it all. Right now, after 1-2 hours of experimenting, this is maximum i could squeeze out current hardware

No, im not rich. Its my companies GPUs, just sharing my experience

docker run -d \ --name glm-5.2-sglang \ --restart unless-stopped \ --gpus all \ --shm-size 32g \ --ipc=host \ -v /data/models/glm-5.2:/model \ -p 30000:30000 \ lmsysorg/sglang:latest \ sglang serve \ --model-path /model \ --served-model-name glm-5.2 \ --host 0.0.0.0 \ --port 30000 \ --tp 8 \ --mem-fraction-static 0.83 \ --enable-metrics \ --reasoning-parser glm45 \ --tool-call-parser glm47 \ --cuda-graph-max-bs 256

Cookbook`s flags, i did not use:

DP - limits context to 120k~ on each shard. I turned off everything related to it, just pure TP
moe-a2a-backend deepep - idk how, but it actually slows down token/s. 50t/s~ on vs 70t/s~ off
mem-fraction-static 0.83 - if you try to use more, OOM guaranteed

result is 262k context and 70t/s

So ye, that`s it. If you have any questions feel free to ask, i`ll try to answer

btw vLLM official recipes wont work for H200. i guess, its because of kv cache fp8 quant on dsv3 architecture

submitted by /u/Soft-Wedding4595
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA