r/LocalLLaMA · · 2 min read

I ran a quantization shootout on Qwen3-Coder and the results are... interesting

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I ran a quantization shootout on Qwen3-Coder and the results are... interesting

Out of random curiousity I ran a shootout on Qwen3-Coder-Next. I've been using the MXFP4_MOE from unsloth for awhile as it's just really fast on my system. But was curious about perceision. I know quantization hurts the model, but I don't think I had really understoof that till I tested it myself.

Hardware: 3× R9700 PRO (96 GB VRAM)

Backend: llama.cpp Vulkan

Eval: wikitext-2 (583 chunks, ctx 512)

Formats tested: MXFP4_MOE Q4_K_M Q5_K_M UD-Q5_K_M

TLDR: UD-Q5_K_M is cooking! Better quality than formats half its size, barely any speed penalty. Unsloth's dynamic precision approach is really good. I might need to test it at lower quants now.

The Numbers
(no shit I asked claude to make me a table to copy pasta)

Metric MXFP4 Q4_K_M Q5_K_M UD-Q5_K_M
Same top-1 89.4% 89.6% 93.0% 94.0%
Mean KL divergence 0.0746 0.0685 0.0308 0.0217
Max KL (worst token) 13.04 5.93 8.19 4.75
File size 44.7 GB 45.2 GB 52.9 GB 55.2 GB

UD-Q5_K_M wins on literally every quality metric while only being ~10 GB larger than MXFP4.

Here's the thing nobody talks about: token accuracy compounds exponentially.

A 5% difference in per-token agreement becomes a 500× difference by token 100. All LLM's are auto regressive. Yann LeCun is always talking about this and that LLM's suffer from exponentially diverging error probabilities. This is were all your hallicunations and stuff happen.

MXFP4 (89.4%) > 100 token output: 0.0014% chance of perfect agreement

UD-Q5_K_M (94%) > 100 token output: 0.21% chance of perfect agreement

That's not a big number, but on long refactoring tasks or multi step reasoning, you feel it. MXFP4 "goes off the rails" way more often.

There is a speed trade off to all of this though.

refill (batch 512): MXFP4 still fastest (hardware kernels)

Prefill (batch 4096): MXFP4 wins again

Decode: Q4_K_M edges UD-Q5 slightly, but UD-Q5 is within 9% despite being 22% larger

For interactive coding (which is decode-bound anyway), the speed hit is negligible.

For me, I swapped my default from MXFP4 to UD-Q5_K_M. MXFP4 is still great for heavy prefill workloads but for daily code generation where you care about quality over speed, UD-Q5 is the clear winner.

What quants are you guys running for code models? Are you finding the same quality cliff with aggressive compression? And if you're on Nvidia hardware, are you seeing different tradeoffs than RDNA?

https://preview.redd.it/0z8kkkhjkp2h1.png?width=1130&format=png&auto=webp&s=aadcce727dc26d756d67d4e356a709aa96fd030f

submitted by /u/alphatrad
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA