r/LocalLLaMA · · 2 min read

Strix Halo Llama.cpp MTP Benchmarks: 27B Gets Much Faster, 35B Is Mixed

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

TL;DR

All models were Qwen3.6

27B-MTP vs Base 27B (15k single-turn): Faster overall

  • Total Time (wall): 87.44s → 77.39s (10.05s faster / -11.50%)
  • Generation: 7.63 → 16.15 t/s (+111.77% speedup)
  • Prompt Processing: 279.75 → 244.90 t/s (-12.46% slowdown)

35B-MTP vs Base 35B (15k single-turn): Slower overall

  • Total Time (wall): 20.83s → 23.16s (2.33s slower / +11.17%)
  • Generation: 48.18 → 56.12 t/s (+16.47% speedup)
  • Prompt Processing: 972.18 → 811.90 t/s (-16.49% slowdown)

27B-MTP vs Base 27B (5-turn chat, ~28.5k context): Massive time savings

  • Total Time (wall): 258.65s → 200.55s (58.10s faster / -22.46%)
  • Turns 2-5 (wall): 211.37s → 155.33s (56.04s faster / -26.51%)
  • Avg Generation: 7.61 → 17.98 t/s (+136.41% speedup)
  • Avg Prompt Processing: 254.20 → 207.87 t/s (-18.23% slowdown)

35B-MTP vs Base 35B (5-turn chat, ~28.5k context): Roughly tied, slightly slower

  • Total Time (wall): 58.86s → 60.24s (1.38s slower / +2.34%)
  • Turns 2-5 (wall): 47.96s → 49.21s (1.25s slower / +2.62%)
  • Avg Generation: 46.66 → 58.23 t/s (+24.80% speedup)
  • Avg Prompt Processing: 826.47 → 703.45 t/s (-14.89% slowdown)

Terminology:

  • wall = real end-to-end elapsed time from sending the request to receiving the full response.
  • pp = prompt processing throughput (tokens/sec).
  • gen t/s = generation throughput (tokens/sec).

Hardware / Software

  • CPU: AMD RYZEN AI MAX+ 395 (16C/32T)
  • iGPU: Radeon 8060S (RADV GFX1151)
  • RAM: 30 GiB
  • OS: Ubuntu 24.04, kernel 6.17
  • llama.cpp / llama-server: 9187 (0253fb21f)
  • Vulkan Instance: 1.4.313
  • GPU API: 1.4.305
  • Mesa RADV: 25.0.7

Models Tested (all Unsloth)

  • Qwen3.6-27B-Q8_0.gguf
  • Qwen3.6-27B-Q8_0-MTP.gguf
  • Qwen3.6-35B-A3B-Q8_0.gguf
  • Qwen3.6-35B-A3B-Q8_0-MTP.gguf

Runtime Config Used

  • --ctx-size 128000
  • -b 2048
  • --ubatch-size 1024
  • --flash-attn on
  • --threads 16
  • --threads-batch 16

MTP models only:

  • --spec-type draft-mtp
  • --spec-draft-n-max 3
  • --spec-draft-p-min 0.75

Methodology

15k single-turn uncached

  • Synthetic agentic prompt calibrated to ~15k prompt tokens.
  • max_tokens=256, temperature=0.
  • Prompt randomized each run (RUN_TAG) so cache_n=0 (true uncached prefill).
  • 2 runs per model.

5-turn subsequent-turn test

  • Same scripted 5-turn back-and-forth for each model.
  • ~3900-word user payload each turn.
  • Context grows to ~28.5k prompt tokens by turn 5.
  • max_tokens=220, temperature=0.
  • Reported both full 5-turn total and turns 2-5 only (to isolate “subsequent turn” behavior).

Stability

  • Retry logic on transient 502/503/504 for long runs.
  • Reported both server infer timing and client-observed wall time.

Full Results (Latency-Focused)

15k single-turn

Family Non-MTP wall MTP wall Delta
27B 87.44s 77.39s -11.50%
35B 20.83s 23.16s +11.17%

5-turn total (~28.5k by turn 5)

Family Non-MTP wall MTP wall Delta
27B 258.65s 200.55s -22.46%
35B 58.86s 60.24s +2.34%

Subsequent turns only (turns 2-5)

Family Non-MTP wall MTP wall Delta
27B 211.37s 155.33s -26.51%
35B 47.96s 49.21s +2.62%

Takeaways

  • MTP consistently lowers pp and increases generation t/s.
  • Workload shape dictates the overall winner:
  • If decode dominates, MTP can win hard (as seen on 27B here).
  • If prefill dominates enough, MTP may lose slightly overall (as seen on 35B here).

  • On this Strix Halo setup:

  • 27B-MTP is a strong practical upgrade for long-context chat workflows.

  • 35B-MTP is mixed: faster token generation, but slightly slower end-to-end for these specific long-context tests.

submitted by /u/xjE4644Eyc
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA