I ran a quantization shootout on Qwen3-Coder and the results are... interesting
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| Out of random curiousity I ran a shootout on Qwen3-Coder-Next. I've been using the MXFP4_MOE from unsloth for awhile as it's just really fast on my system. But was curious about perceision. I know quantization hurts the model, but I don't think I had really understoof that till I tested it myself. Hardware: 3× R9700 PRO (96 GB VRAM) Backend: llama.cpp Vulkan Eval: wikitext-2 (583 chunks, ctx 512) Formats tested: MXFP4_MOE Q4_K_M Q5_K_M UD-Q5_K_M TLDR: UD-Q5_K_M is cooking! Better quality than formats half its size, barely any speed penalty. Unsloth's dynamic precision approach is really good. I might need to test it at lower quants now. The Numbers
UD-Q5_K_M wins on literally every quality metric while only being ~10 GB larger than MXFP4. Here's the thing nobody talks about: token accuracy compounds exponentially. A 5% difference in per-token agreement becomes a 500× difference by token 100. All LLM's are auto regressive. Yann LeCun is always talking about this and that LLM's suffer from exponentially diverging error probabilities. This is were all your hallicunations and stuff happen. MXFP4 (89.4%) > 100 token output: 0.0014% chance of perfect agreement UD-Q5_K_M (94%) > 100 token output: 0.21% chance of perfect agreement That's not a big number, but on long refactoring tasks or multi step reasoning, you feel it. MXFP4 "goes off the rails" way more often. There is a speed trade off to all of this though. refill (batch 512): MXFP4 still fastest (hardware kernels) Prefill (batch 4096): MXFP4 wins again Decode: Q4_K_M edges UD-Q5 slightly, but UD-Q5 is within 9% despite being 22% larger For interactive coding (which is decode-bound anyway), the speed hit is negligible. For me, I swapped my default from MXFP4 to UD-Q5_K_M. MXFP4 is still great for heavy prefill workloads but for daily code generation where you care about quality over speed, UD-Q5 is the clear winner. What quants are you guys running for code models? Are you finding the same quality cliff with aggressive compression? And if you're on Nvidia hardware, are you seeing different tradeoffs than RDNA? [link] [comments] |
More from r/LocalLLaMA
-
How small can the orchestration model in an agent be? (separating it from code-gen — that obviously wants a big model)
May 22
-
BeeLlama v0.2.0 – major DFlash update. Single RTX 3090: Qwen 3.6 27B up to 164 tps (4.40x), Gemma 4 31B up to 177.8 tps (4.93x). Prompt processing speed near baseline.
May 22
-
trained a prompt injection detector using ml-intern and DeepSeek v4 Flash, runs in the browser
May 22
-
ByteShape Qwen3.6-35B-A3B: 30% faster than Unsloth IQ on 6GB VRAM laptop
May 22
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.