StepFun 3.7 Flash MTP Bench Strix Halo
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
This is the StepFun Step-3.7-Flash UD-IQ4_XS main model with the official StepFun MTP Q8_0 draft model, served through a patched llama.cpp Vulkan/RADV build.
Host
- System: AMD Ryzen AI Max+ 395 / Radeon 8060S (
gfx1151) - Memory: 128 GB unified LPDDR5X
- BIOS UMA / VRAM: 4 GB UMA dedicated VRAM
- GTT ceiling: 112 GiB
- IOMMU: enabled (
amd_iommu=on) - OS: Ubuntu 25.04 (Plucky)
- Kernel:
6.18.1-061801-generic - Mesa / RADV: Mesa
25.2.8/ RADV - ROCm:
7.1.1baseline; some later rows also reference ROCm7.2.xruntime libraries
Model
- Main model: StepFun Step-3.7-Flash
UD-IQ4_XS - Main model size on disk:
95,336,010,208bytes /88.79 GiB - Main model shards: 3
- Draft model:
Step-3.7-Flash-MTP-Q8_0.gguf - Draft model size: about
3.5 GiB - Architecture:
step35 - Model class: roughly 200B total parameters / about 11B active parameters per token
- Backend: llama.cpp Vulkan/RADV b9360 with Step-3.7 MTP patch
- Context used for this bench: 12,288
- MTP settings:
DRAFT_N=2,PMIN=0.60,UBATCH=512
Latest measured numbers
| Metric | StepFun MTP | Non-MTP baseline | Change |
|---|---|---|---|
| Load to listening | ~31 s | ~31 s | no startup penalty observed |
| Prefill / prompt processing | 211.2 tok/s | 212.0 tok/s | basically flat |
| Decode / token generation | 26.0 tok/s | 20.4 tok/s | +27.5% |
| Normalized wall time, 1150-in/2000-out | 82.4 s | 103.4 s | 20.8% faster |
| Two concurrent requests | 19.7 / 19.6 tok/s | 17.14 tok/s each | +15% per slot |
| Two-slot aggregate | 35.7 tok/s | ~34 tok/s | +5% aggregate |
| Socket power during decode | ~73 W | ~85 W | ~14% lower |
The main result: MTP materially improves decode speed without hurting prefill. For a roughly 200B-total MoE model, 26 tok/s single-stream on a 128 GB Strix Halo APU is a useful local lane.
Draft acceptance
The standard decode probe showed:
- Drafted tokens: 491
- Accepted draft tokens: 416
- Accepted / drafted: 84.7%
Important source note: the summarized bench.json currently has "mtp.acceptance_pct": null. The 84.7% acceptance number comes from the raw tg_probe.json timing counters, not from the aggregate bench.json field.
Context against other local lanes
These are not quality-equivalent rows, but they help place the speed tier:
| Model / lane | Total / active | Quant / path | Prefill | Decode |
|---|---|---|---|---|
| Qwen 3.6 35B MTP | 35B / A3B | Q4_K_M, Vulkan MTP | not listed here | 81.2 tok/s |
| gpt-oss-120b | 117B / A5.1B | MXFP4, Vulkan | 787 tok/s | 46.7 tok/s |
| Qwen3-Coder-Next | coder MoE | UD-Q4_K_XL, Vulkan | 723.2 tok/s | 44.4 tok/s |
| Qwen 3.5 122B MTP | 122B / A10B | MXFP4_MOE, Vulkan MTP | 332.1 tok/s | 26.7 tok/s |
| StepFun 3.7 Flash MTP | ~200B / A11B | UD-IQ4_XS + Q8 MTP draft | 211.2 tok/s | 26.0 tok/s |
| StepFun 3.7 Flash plain | ~200B / A11B | UD-IQ4_XS, no MTP | 212.0 tok/s | 20.4 tok/s |
The interesting part is that StepFun MTP lands in the same rough decode tier as Qwen 122B MTP while moving a much larger total-parameter model. Whether that is the best lane depends on whether StepFun's quality is worth spending the 26 tok/s tier on.
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.