r/LocalLLaMA · · 2 min read

StepFun 3.7 Flash MTP Bench Strix Halo

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

This is the StepFun Step-3.7-Flash UD-IQ4_XS main model with the official StepFun MTP Q8_0 draft model, served through a patched llama.cpp Vulkan/RADV build.

Host

  • System: AMD Ryzen AI Max+ 395 / Radeon 8060S (gfx1151)
  • Memory: 128 GB unified LPDDR5X
  • BIOS UMA / VRAM: 4 GB UMA dedicated VRAM
  • GTT ceiling: 112 GiB
  • IOMMU: enabled (amd_iommu=on)
  • OS: Ubuntu 25.04 (Plucky)
  • Kernel: 6.18.1-061801-generic
  • Mesa / RADV: Mesa 25.2.8 / RADV
  • ROCm: 7.1.1 baseline; some later rows also reference ROCm 7.2.x runtime libraries

Model

  • Main model: StepFun Step-3.7-Flash UD-IQ4_XS
  • Main model size on disk: 95,336,010,208 bytes / 88.79 GiB
  • Main model shards: 3
  • Draft model: Step-3.7-Flash-MTP-Q8_0.gguf
  • Draft model size: about 3.5 GiB
  • Architecture: step35
  • Model class: roughly 200B total parameters / about 11B active parameters per token
  • Backend: llama.cpp Vulkan/RADV b9360 with Step-3.7 MTP patch
  • Context used for this bench: 12,288
  • MTP settings: DRAFT_N=2, PMIN=0.60, UBATCH=512

Latest measured numbers

Metric StepFun MTP Non-MTP baseline Change
Load to listening ~31 s ~31 s no startup penalty observed
Prefill / prompt processing 211.2 tok/s 212.0 tok/s basically flat
Decode / token generation 26.0 tok/s 20.4 tok/s +27.5%
Normalized wall time, 1150-in/2000-out 82.4 s 103.4 s 20.8% faster
Two concurrent requests 19.7 / 19.6 tok/s 17.14 tok/s each +15% per slot
Two-slot aggregate 35.7 tok/s ~34 tok/s +5% aggregate
Socket power during decode ~73 W ~85 W ~14% lower

The main result: MTP materially improves decode speed without hurting prefill. For a roughly 200B-total MoE model, 26 tok/s single-stream on a 128 GB Strix Halo APU is a useful local lane.

Draft acceptance

The standard decode probe showed:

  • Drafted tokens: 491
  • Accepted draft tokens: 416
  • Accepted / drafted: 84.7%

Important source note: the summarized bench.json currently has "mtp.acceptance_pct": null. The 84.7% acceptance number comes from the raw tg_probe.json timing counters, not from the aggregate bench.json field.

Context against other local lanes

These are not quality-equivalent rows, but they help place the speed tier:

Model / lane Total / active Quant / path Prefill Decode
Qwen 3.6 35B MTP 35B / A3B Q4_K_M, Vulkan MTP not listed here 81.2 tok/s
gpt-oss-120b 117B / A5.1B MXFP4, Vulkan 787 tok/s 46.7 tok/s
Qwen3-Coder-Next coder MoE UD-Q4_K_XL, Vulkan 723.2 tok/s 44.4 tok/s
Qwen 3.5 122B MTP 122B / A10B MXFP4_MOE, Vulkan MTP 332.1 tok/s 26.7 tok/s
StepFun 3.7 Flash MTP ~200B / A11B UD-IQ4_XS + Q8 MTP draft 211.2 tok/s 26.0 tok/s
StepFun 3.7 Flash plain ~200B / A11B UD-IQ4_XS, no MTP 212.0 tok/s 20.4 tok/s

The interesting part is that StepFun MTP lands in the same rough decode tier as Qwen 122B MTP while moving a much larger total-parameter model. Whether that is the best lane depends on whether StepFun's quality is worth spending the 26 tok/s tier on.

submitted by /u/westsunset
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA