r/LocalLLaMA · June 6, 2026 · 2 min read

StepFun 3.7 Flash MTP Bench Strix Halo

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

This is the StepFun Step-3.7-Flash UD-IQ4_XS main model with the official StepFun MTP Q8_0 draft model, served through a patched llama.cpp Vulkan/RADV build.

Host

System: AMD Ryzen AI Max+ 395 / Radeon 8060S (gfx1151)
Memory: 128 GB unified LPDDR5X
BIOS UMA / VRAM: 4 GB UMA dedicated VRAM
GTT ceiling: 112 GiB
IOMMU: enabled (amd_iommu=on)
OS: Ubuntu 25.04 (Plucky)
Kernel: 6.18.1-061801-generic
Mesa / RADV: Mesa 25.2.8 / RADV
ROCm: 7.1.1 baseline; some later rows also reference ROCm 7.2.x runtime libraries

Model

Main model: StepFun Step-3.7-Flash UD-IQ4_XS
Main model size on disk: 95,336,010,208 bytes / 88.79 GiB
Main model shards: 3
Draft model: Step-3.7-Flash-MTP-Q8_0.gguf
Draft model size: about 3.5 GiB
Architecture: step35
Model class: roughly 200B total parameters / about 11B active parameters per token
Backend: llama.cpp Vulkan/RADV b9360 with Step-3.7 MTP patch
Context used for this bench: 12,288
MTP settings: DRAFT_N=2, PMIN=0.60, UBATCH=512

Latest measured numbers

Metric	StepFun MTP	Non-MTP baseline	Change

Load to listening	~31 s	~31 s	no startup penalty observed
Prefill / prompt processing	211.2 tok/s	212.0 tok/s	basically flat
Decode / token generation	26.0 tok/s	20.4 tok/s	+27.5%
Normalized wall time, 1150-in/2000-out	82.4 s	103.4 s	20.8% faster
Two concurrent requests	19.7 / 19.6 tok/s	17.14 tok/s each	+15% per slot
Two-slot aggregate	35.7 tok/s	~34 tok/s	+5% aggregate
Socket power during decode	~73 W	~85 W	~14% lower

The main result: MTP materially improves decode speed without hurting prefill. For a roughly 200B-total MoE model, 26 tok/s single-stream on a 128 GB Strix Halo APU is a useful local lane.

Draft acceptance

The standard decode probe showed:

Drafted tokens: 491
Accepted draft tokens: 416
Accepted / drafted: 84.7%

Important source note: the summarized bench.json currently has "mtp.acceptance_pct": null. The 84.7% acceptance number comes from the raw tg_probe.json timing counters, not from the aggregate bench.json field.

Context against other local lanes

These are not quality-equivalent rows, but they help place the speed tier:

Model / lane	Total / active	Quant / path	Prefill	Decode

Qwen 3.6 35B MTP	35B / A3B	Q4_K_M, Vulkan MTP	not listed here	81.2 tok/s
gpt-oss-120b	117B / A5.1B	MXFP4, Vulkan	787 tok/s	46.7 tok/s
Qwen3-Coder-Next	coder MoE	UD-Q4_K_XL, Vulkan	723.2 tok/s	44.4 tok/s
Qwen 3.5 122B MTP	122B / A10B	MXFP4_MOE, Vulkan MTP	332.1 tok/s	26.7 tok/s
StepFun 3.7 Flash MTP	~200B / A11B	UD-IQ4_XS + Q8 MTP draft	211.2 tok/s	26.0 tok/s
StepFun 3.7 Flash plain	~200B / A11B	UD-IQ4_XS, no MTP	212.0 tok/s	20.4 tok/s

The interesting part is that StepFun MTP lands in the same rough decode tier as Qwen 122B MTP while moving a much larger total-parameter model. Whether that is the best lane depends on whether StepFun's quality is worth spending the 26 tok/s tier on.

submitted by /u/westsunset
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.