r/LocalLLaMA · May 15, 2026 · 1 min read

I just bought Asus Ascent : Nvidia GB10 (DGX) and It is slower than my Ryzen Ai Max

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

It is suppose to be 2-4x faster but i am only getting 6TK/s on Gemma4-31B . What am i doing wrong?

Infrence engine : llama-cpp latest as of 15th May 2026 , built my own via https://ggml.ai/dgx-spark.sh
Tested models
- Step3.5-Apex-I-Quality - DGX - 27 tk/s , AI-Max 30 tk/s
- gemma-4-31B-it-UD-Q8_K_XL - 6.19 tk/s , AI-Max 7.10 tk/s

Command :

llama-server --models-preset /home/dgx/models/models.ini --models-dir /home/dgx/models/ --host 0.0.0.0 --port 8080 --models-max 1 --parallel 1

model.ini:

``` [*] threads = 12 flash-attn = on mlock = off mmap = off fit = on warmup = on ; batch-size = 4096 ; ubatch-size = 512 cache-type-k = q8_0 cache-type-v = q8_0 jinja = true direct-io = on cache-prompt = true cache-reuse = 256 cache-ram = 32768 reasoning-format = auto n-gpu-layers = 999

```

submitted by /u/Voxandr
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA