r/LocalLLaMA · · 3 min read

Qwen 3.6 35B GGUF: NTP vs MTP quantization results across GPUs and CPUs

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Qwen 3.6 35B GGUF: NTP vs MTP quantization results across GPUs and CPUs

Hey r/LocalLLaMA,

We’ve released our ByteShape Qwen 3.6 35B GGUF quantizations in two families: standard NTP (Next Token Prediction or non-MTP) and MTP.

Blog / Download NTP Models / Download MTP Models

TL;DR

  • For NTP, “pick the largest quant that fits” worked surprisingly well.
  • Lower bpw was not automatically better: our largest model was very hard to beat on quality/speed, including prompt processing and token generation.
  • MTP gave a real GPU generation-speed boost, usually around 20–40%, but the extra memory footprint can change what fits.
  • MTP speedup is heavily workload dependent.
  • CPU MTP was not attractive in our tests, so our CPU recommendation remains NTP.
  • We excluded MMLU from this release because Qwen 3.6 showed answer-format compliance issues in full precision, making it a noisy quantization-comparison signal.

For this release, we tried to make the comparison more of a small hardware study than just a model drop. We benchmarked the original model and a broader set of quantized variants across RTX 4090, 5090, Pro 6000, 4080, 5060 Ti, plus Intel i7, Intel Ultra 7, Ryzen 9, and Raspberry Pi 5. Shoutout to the quantizers we included in the comparisons: Bartowski, Unsloth, Mudler, and AesSedai. We picked a few of the most recommended quants from each of the quantizers, since you probably wouldn’t care about these results if we took the time to evaluate every single quant (or once 3.7 comes out ;) ).

The main NTP result was a bit counterintuitive. Usually, you expect smaller bpw quants to win clearly on speed. Here our largest release variant often stayed competitive not only in quality but also in prompt processing and token generation. So bpw is not something to minimize blindly: if the larger model fits your memory and context budget, it may still be the better choice.

There are hardware-specific exceptions, especially on 16GB devices and Raspberry Pi 5, so we put the full recommendations and plots in the blog rather than trying to compress all of them here.

For MTP, the trade-off is different. On GPUs, we saw a meaningful generation-speed boost, usually around 20 - 40% (this is heavily workload dependent and requires your testing). But MTP also increases runtime memory, so on 16GB GPUs the larger MTP model was no longer practical at our context settings, making model GPU-2 MTP the usable recommendation. The MTP results also support the same bpw observation: in some cases, the larger model basically catches up with the smaller model in throughput.

CPU MTP was not attractive in our tests. Prompt processing is already slow on CPUs, and MTP makes it worse. For now, our CPU recommendation remains NTP.

Methodology note: we found an answer-format compliance issue in Qwen 3.6 that we did not see in the same way with Qwen 3.5. In several MMLU cases, the full-precision model appeared to know the answer, but did not respond in the strict format expected by the benchmark, despite the prompts being 5-shot. Since this was already a baseline-model behavior rather than a quantization artifact, we excluded MMLU from the benchmarking for this release.

So, the important takeaway is:

For this model, “pick the largest quant that fits” worked surprisingly well for NTP. MTP is worth it on GPUs if you have the memory headroom, but it changes what fits and is not automatically better on CPUs.

We’ll keep Reddit short-ish. The blog has the full graphs, experiments, hardware breakdowns, and methodology details.

submitted by /u/enrique-byteshape
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA