Strix Halo Llama.cpp MTP Benchmarks: 27B Gets Much Faster, 35B Is Mixed
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
TL;DR
All models were Qwen3.6
27B-MTP vs Base 27B (15k single-turn): Faster overall
- Total Time (wall): 87.44s → 77.39s (10.05s faster / -11.50%)
- Generation: 7.63 → 16.15 t/s (+111.77% speedup)
- Prompt Processing: 279.75 → 244.90 t/s (-12.46% slowdown)
35B-MTP vs Base 35B (15k single-turn): Slower overall
- Total Time (wall): 20.83s → 23.16s (2.33s slower / +11.17%)
- Generation: 48.18 → 56.12 t/s (+16.47% speedup)
- Prompt Processing: 972.18 → 811.90 t/s (-16.49% slowdown)
27B-MTP vs Base 27B (5-turn chat, ~28.5k context): Massive time savings
- Total Time (wall): 258.65s → 200.55s (58.10s faster / -22.46%)
- Turns 2-5 (wall): 211.37s → 155.33s (56.04s faster / -26.51%)
- Avg Generation: 7.61 → 17.98 t/s (+136.41% speedup)
- Avg Prompt Processing: 254.20 → 207.87 t/s (-18.23% slowdown)
35B-MTP vs Base 35B (5-turn chat, ~28.5k context): Roughly tied, slightly slower
- Total Time (wall): 58.86s → 60.24s (1.38s slower / +2.34%)
- Turns 2-5 (wall): 47.96s → 49.21s (1.25s slower / +2.62%)
- Avg Generation: 46.66 → 58.23 t/s (+24.80% speedup)
- Avg Prompt Processing: 826.47 → 703.45 t/s (-14.89% slowdown)
Terminology:
wall= real end-to-end elapsed time from sending the request to receiving the full response.pp= prompt processing throughput (tokens/sec).gen t/s= generation throughput (tokens/sec).
Hardware / Software
- CPU: AMD RYZEN AI MAX+ 395 (16C/32T)
- iGPU: Radeon 8060S (RADV GFX1151)
- RAM: 30 GiB
- OS: Ubuntu 24.04, kernel 6.17
- llama.cpp / llama-server: 9187 (0253fb21f)
- Vulkan Instance: 1.4.313
- GPU API: 1.4.305
- Mesa RADV: 25.0.7
Models Tested (all Unsloth)
Qwen3.6-27B-Q8_0.ggufQwen3.6-27B-Q8_0-MTP.ggufQwen3.6-35B-A3B-Q8_0.ggufQwen3.6-35B-A3B-Q8_0-MTP.gguf
Runtime Config Used
--ctx-size 128000-b 2048--ubatch-size 1024--flash-attn on--threads 16--threads-batch 16
MTP models only:
--spec-type draft-mtp--spec-draft-n-max 3--spec-draft-p-min 0.75
Methodology
15k single-turn uncached
- Synthetic agentic prompt calibrated to ~15k prompt tokens.
max_tokens=256,temperature=0.- Prompt randomized each run (RUN_TAG) so
cache_n=0(true uncached prefill). - 2 runs per model.
5-turn subsequent-turn test
- Same scripted 5-turn back-and-forth for each model.
- ~3900-word user payload each turn.
- Context grows to ~28.5k prompt tokens by turn 5.
max_tokens=220,temperature=0.- Reported both full 5-turn total and turns 2-5 only (to isolate “subsequent turn” behavior).
Stability
- Retry logic on transient 502/503/504 for long runs.
- Reported both server infer timing and client-observed wall time.
Full Results (Latency-Focused)
15k single-turn
| Family | Non-MTP wall | MTP wall | Delta |
|---|---|---|---|
| 27B | 87.44s | 77.39s | -11.50% |
| 35B | 20.83s | 23.16s | +11.17% |
5-turn total (~28.5k by turn 5)
| Family | Non-MTP wall | MTP wall | Delta |
|---|---|---|---|
| 27B | 258.65s | 200.55s | -22.46% |
| 35B | 58.86s | 60.24s | +2.34% |
Subsequent turns only (turns 2-5)
| Family | Non-MTP wall | MTP wall | Delta |
|---|---|---|---|
| 27B | 211.37s | 155.33s | -26.51% |
| 35B | 47.96s | 49.21s | +2.62% |
Takeaways
- MTP consistently lowers pp and increases generation t/s.
- Workload shape dictates the overall winner:
- If decode dominates, MTP can win hard (as seen on 27B here).
If prefill dominates enough, MTP may lose slightly overall (as seen on 35B here).
On this Strix Halo setup:
27B-MTP is a strong practical upgrade for long-context chat workflows.
35B-MTP is mixed: faster token generation, but slightly slower end-to-end for these specific long-context tests.
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.