r/LocalLLaMA · May 16, 2026 · 2 min read

Strix Halo Llama.cpp MTP Benchmarks: 27B Gets Much Faster, 35B Is Mixed

#model-release #benchmark

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Like Read original ↗

TL;DR

All models were Qwen3.6

27B-MTP vs Base 27B (15k single-turn): Faster overall

Total Time (wall): 87.44s → 77.39s (10.05s faster / -11.50%)
Generation: 7.63 → 16.15 t/s (+111.77% speedup)
Prompt Processing: 279.75 → 244.90 t/s (-12.46% slowdown)

35B-MTP vs Base 35B (15k single-turn): Slower overall

Total Time (wall): 20.83s → 23.16s (2.33s slower / +11.17%)
Generation: 48.18 → 56.12 t/s (+16.47% speedup)
Prompt Processing: 972.18 → 811.90 t/s (-16.49% slowdown)

27B-MTP vs Base 27B (5-turn chat, ~28.5k context): Massive time savings

Total Time (wall): 258.65s → 200.55s (58.10s faster / -22.46%)
Turns 2-5 (wall): 211.37s → 155.33s (56.04s faster / -26.51%)
Avg Generation: 7.61 → 17.98 t/s (+136.41% speedup)
Avg Prompt Processing: 254.20 → 207.87 t/s (-18.23% slowdown)

35B-MTP vs Base 35B (5-turn chat, ~28.5k context): Roughly tied, slightly slower

Total Time (wall): 58.86s → 60.24s (1.38s slower / +2.34%)
Turns 2-5 (wall): 47.96s → 49.21s (1.25s slower / +2.62%)
Avg Generation: 46.66 → 58.23 t/s (+24.80% speedup)
Avg Prompt Processing: 826.47 → 703.45 t/s (-14.89% slowdown)

Terminology:

wall = real end-to-end elapsed time from sending the request to receiving the full response.
pp = prompt processing throughput (tokens/sec).
gen t/s = generation throughput (tokens/sec).

Hardware / Software

CPU: AMD RYZEN AI MAX+ 395 (16C/32T)
iGPU: Radeon 8060S (RADV GFX1151)
RAM: 30 GiB
OS: Ubuntu 24.04, kernel 6.17
llama.cpp / llama-server: 9187 (0253fb21f)
Vulkan Instance: 1.4.313
GPU API: 1.4.305
Mesa RADV: 25.0.7

Models Tested (all Unsloth)

Qwen3.6-27B-Q8_0.gguf
Qwen3.6-27B-Q8_0-MTP.gguf
Qwen3.6-35B-A3B-Q8_0.gguf
Qwen3.6-35B-A3B-Q8_0-MTP.gguf

Runtime Config Used

--ctx-size 128000
-b 2048
--ubatch-size 1024
--flash-attn on
--threads 16
--threads-batch 16

MTP models only:

--spec-type draft-mtp
--spec-draft-n-max 3
--spec-draft-p-min 0.75

Methodology

15k single-turn uncached

Synthetic agentic prompt calibrated to ~15k prompt tokens.
max_tokens=256, temperature=0.
Prompt randomized each run (RUN_TAG) so cache_n=0 (true uncached prefill).
2 runs per model.

5-turn subsequent-turn test

Same scripted 5-turn back-and-forth for each model.
~3900-word user payload each turn.
Context grows to ~28.5k prompt tokens by turn 5.
max_tokens=220, temperature=0.
Reported both full 5-turn total and turns 2-5 only (to isolate “subsequent turn” behavior).

Stability

Retry logic on transient 502/503/504 for long runs.
Reported both server infer timing and client-observed wall time.

Full Results (Latency-Focused)

15k single-turn

Family	Non-MTP wall	MTP wall	Delta
27B	87.44s	77.39s	-11.50%
35B	20.83s	23.16s	+11.17%

5-turn total (~28.5k by turn 5)

Family	Non-MTP wall	MTP wall	Delta
27B	258.65s	200.55s	-22.46%
35B	58.86s	60.24s	+2.34%

Subsequent turns only (turns 2-5)

Family	Non-MTP wall	MTP wall	Delta
27B	211.37s	155.33s	-26.51%
35B	47.96s	49.21s	+2.62%

Takeaways

MTP consistently lowers pp and increases generation t/s.
Workload shape dictates the overall winner:
If decode dominates, MTP can win hard (as seen on 27B here).
If prefill dominates enough, MTP may lose slightly overall (as seen on 35B here).
On this Strix Halo setup:
27B-MTP is a strong practical upgrade for long-context chat workflows.
35B-MTP is mixed: faster token generation, but slightly slower end-to-end for these specific long-context tests.

submitted by /u/xjE4644Eyc
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.