I just bought Asus Ascent : Nvidia GB10 (DGX) and It is slower than my Ryzen Ai Max
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
It is suppose to be 2-4x faster but i am only getting 6TK/s on Gemma4-31B . What am i doing wrong?
- Infrence engine : llama-cpp latest as of 15th May 2026 , built my own via https://ggml.ai/dgx-spark.sh
- Tested models
- Step3.5-Apex-I-Quality - DGX - 27 tk/s , AI-Max 30 tk/s
- gemma-4-31B-it-UD-Q8_K_XL - 6.19 tk/s , AI-Max 7.10 tk/s
Command :
llama-server --models-preset /home/dgx/models/models.ini --models-dir /home/dgx/models/ --host 0.0.0.0 --port 8080 --models-max 1 --parallel 1
model.ini:
``` [*] threads = 12 flash-attn = on mlock = off mmap = off fit = on warmup = on ; batch-size = 4096 ; ubatch-size = 512 cache-type-k = q8_0 cache-type-v = q8_0 jinja = true direct-io = on cache-prompt = true cache-reuse = 256 cache-ram = 32768 reasoning-format = auto n-gpu-layers = 999
```
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.