r/LocalLLaMA · · 7 min read

Here is my llama.cpp NVFP4/MXFP6 GGUF quantizer tool

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Hello everyone

I wanted to share what I've been working on. I started writing NVFP4 kernels for llama.cpp last year and needed the ability to quantize NVFP4 GGUFs, so this project started as an NVFP4 quantizer. It's since become much larger. I would love to get more help to improve it.

This is what I call the advanced-quantizer-tool (MIT license).
This is used to create NVFP4 and MXFP6 models into GGUFs directly. But it can do much more.
The latest model I've made with it are here: Qwen3.6-27B-NVFP4-MTP-GGUF (version 3, 4-June-2026) and Qwopus3.6-27B-v2-MTP-NVFP4-GGUF. I have quite a few others on HF with older revisions that are not quite as good as the quantizer is now, but still better than converted GGUFs. Eval benchmarks were excellent and it was performing very well.

What this does that's special:

The basic idea is, start from a source BF16 GGUF, imatrix data, and a logits KLD file. Then search quantization methods and see how it holds up against the source model. It will evaluate all the candidate and quantization types based off the predetermined requirements/metrics, imatrix and kld data to make the best possible final blend of quantization techniques incorporating multiple methods into one final file. I also came up with my own that I've called "RSF".
This is by no means finished, perfect, or bug-free by any means. But there is a lot of potential for this as a dynamic quantizer tool. This will create NVFP4 models that perform better than ModelOpt in the testing I've done so far.
It is meant to be reproducible, so it writes reports, ledgers, tensor assignment maps, and validation logs so you can see exactly why and what was chosen and debug the quant plan.

Some of the things it can do now:

  • Scores layer by layer quant target candidates using PPL, mean KLD, p95/p99/p999 KLD, tail KLD, RMS probability delta, same-top p, top-flip weight, entropy, file size, BPW, tensor type.
  • Correctly creates NVFP4 weight and input tensor scales
  • Does repeated full-model KLD evaluation over the chosen corpus input for the dataset
  • Treats sensitive tensors conservatively (eg, embeddings, MTP/NextN tensors, related grouped tensors such as QKV, gate/up pairs, experts, head groups)
  • Supports recipes, ledgers, RSF/candidate reports, and writes manifests, checkpoint keys, final tensor assignment maps and histograms.
  • Integrates the outstanding 4 over 6 NVFP4 improvement into the model (created by Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, Song Han)
  • Various other quantization ideas are incorporated (AWQ, etc)

RSF (Refined Scale Fitting)

  • RSF measures the imatrix-weighted reconstruction error, then searches nearby scale multipliers, and picks a better lattice fit. I originally did this for NVFP4/MXFP6, but since applied same idea on Q2/Q3/Q4/Q5/Q6 K quants; it improves their quantization, too.

Tensor promotion

It will start everything as NVFP4 (or whatever specified), and then up-promote tensors at the final stage when the remaining error justifies the size/speed loss, using a weighted score.

MXFP6 future
Blackwell supports native hardware scaling for MXFP6 right now, but nobody wrote any real kernels for it and there haven't been any models. So I wrote a full working MXFP6 CUDA implementation that works great for me. I have posted a few mixed NVFP4/MXFP6 models (made prior to the latest improvements in the tool, so new versions will be even better), and found promoting just a few 'weak' tensors from NVFP4 to MXFP6 improves model quality significantly. The latest MXFP6 kernels are still slower than NVFP4 when the model is all MXFP6 (as expected, it's larger), but it's come a long way and the latest CUDA builds are almost there now. MXFP6 quality is superior to NVFP4 as far as quantization error. An NVFP4 model with a small portion of MXFP6 layers won't be noticeably slower (on Blackwell at least), and barely increases the model size.

Quantization Depth Presets:
There are three default modes to choose from.

  • Fast: smaller depth search, lighter RSF work, quicker candidate filtering.
  • normal: intended default for real, serious runs. May require better GPU resources; slower.
  • Deep: intense, wider, exhaustive search with improved validation. This is very slow. I would love to know how this works on big Blackwell GPUs like B200.

Mode comparison for Qwen3.5-0.8B on RTX 5090:

Mode Size Quant time Mean PPL(Q) Mean KLD 99.9% KLD RMS Δp Same top p Top flip
normal 431.25 MiB 35:48.81 21.348164 0.120205 1.629277 8.491% 80.468% 0.019304
deep 432.18 MiB 57:53.39 21.017407 0.100507 1.245584 7.672% 81.869% 0.016312

The selector stage compares candidate policies using a quick proxy error evaluating from first KLD data, caches it, then looks for tensor wins with KLD guards. It then reviews the final tensor-candidates list and finally will patch each layer with the best final candidate.

Powered by CUDA and llama.cpp

KLD and the heavy evaluations use CUDA as much possible and designed to keep as much work on device and reduce host/device copy. The model is patched in VRAM repeatedly so it's only written to disk once. Every evaluation requantizes the layer into each of the available candidate types and then rechecks the kld/ppl, it does this in memory only. Host side work uses parallel CPU workers to speed things up and the max number of threads can be specified. The final GGUF write is only done at the very end.

The tool will decide n_seq to use for KLD eval based on available VRAM available and writes reusable checkpoints to disk, so on long runs you can stop and start and then resume. Previously quantized existing GGUF models can be edited and improved further as needed, with the source kld/BF16 are available. This can also be used in a different way to do some form of finetuning with a new imatrix file. I am investigating doing more of that in a more defined way separately.

Modular for Research
The design brings in candidates and quantization policies/techniques as choices as it quantizes. But adding a new one is really easy. If a better way to quantize NVFP4 (or any other type) becomes available or wants to be studied, all this needs is the new method alone to be written as a regular C or C++ function, then added to the policies and as a candidate. The rest of the quantization, ppl/kld handling, imatrix, inference, backend handling, etc, is normal llama.cpp. So the new quantization technique or method can easily be tested and compared against. You can quickly make a real model with it and see how it performs in a real setting.

There is a text based UI wizard, but it is far from finished or perfect, and was not the primary focus. I've created various SKILLS/AGENTS MD for an AI coder to work with it. Tell it exactly what you want, it will know what to do from the MD instructions. All can still be done from the command line, however.

Known issues:

  • There are too many options and parameters exposed as CLI flags or defines, which makes it quite complicated to understand.
  • Much of the code and options are still presuming NVFP4 was the only quant target.
  • Various functions and candidate logic need further human cleanup from bloaty AI code.
  • ETA/progress reporting is not perfect and it can be is quite misleading, mostly at the late selector eval stages
  • Docs need to be improved
  • The entire process would benefit from more simplification once feature complete is reached.
  • Speed could be optimized much further by removing and reducing duplicated candidate logic. The deep search optimization for Qwen3.6-27B took my machine about 17 hours. The model is great. But this is far too slow.
  • Not tested with multiple GPUs [I only have done all of this with a single 5090]
  • Scoring and weight values for "what metric is most important" for selector guidance to prioritize what candidate to choose would be better tuned by people that know more about this than I do

I’m hopeful this can be useful tool for everyone and for improving NVFP4 and MXFP6. PRs or help getting the tool better would be very welcome!

submitted by /u/ElectronicStranger53
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA