r/LocalLLaMA · · 1 min read

MTP experiences on 7900xtx?

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Hi! I have been using Qwen3.6 35B A3B happily the past few weeks, and I wanted to try out Qwn3.6 27B with the new fancy MTP speculative draft!

These are my settings currently:

llama-server \ -m $HOME/Documents/ML/Qwen3.6-27B-Q4_K_M.gguf \ -c 64000 \ -ngl 65 \ --parallel 1 \ -t 8 \ --jinja \ --host 0.0.0.0 \ --port 5566 \ --reasoning-budget 0 \ --spec-type draft-mtp --spec-draft-n-max 3;

I have a 7900XTX. This llama.cpp is built with vulkan, not ROCm.

I was hoping to get usable speeds with good context to upgrade from the MoE, but so far I'm not super impressed :(

With these settings my VRAM is at 93%

Token speed isn't unusable with these settings but it's still quite slow :(

prompt eval time = 4794.47 ms / 3445 tokens ( 1.39 ms per token, 718.54 tokens per second) eval time = 38484.86 ms / 872 tokens ( 44.13 ms per token, 22.66 tokens per second) total time = 43279.33 ms / 4317 tokens

Do I need to quantize my cache? Should I drop to Q3 27B? Is 27B at Q3 better than the MoE?

Additionally, I was used to 128K context on the MoE, and I didn't quantize the cache.

What are your settings?

Edit: I did try with a q8 cache and I was able to fit the entire model in VRAM with 64k context, and my token/s is much better, at 50tok/s, which is a definitely very usable :)

submitted by /u/Combinatorilliance
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA