r/LocalLLaMA · · 2 min read

We built a calibration-aware Q4_K_M quant of Qwen3.5 0.8B that recovers 96.5% of the BF16 gap vs pure llama.cpp Q4_K_M (SpectralQuant)

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

We built a calibration-aware Q4_K_M quant of Qwen3.5 0.8B that recovers 96.5% of the BF16 gap vs pure llama.cpp Q4_K_M (SpectralQuant)

Hey everyone,

We just released our first release candidate from Spectral Labs: a Qwen3.5 0.8B Q4_K_M built using a new calibration-aware quantization approach we're calling SpectralQuant.

The goal here was to see if we could make a standard Q4_K_M footprint behave more like a larger quant format, without breaking standard llama.cpp compatibility or adding mixed-precision sidecars.

The Method (SpectralQuant)

Normally, quantization is treated as a local rounding problem. SpectralQuant tackles it differently. We use calibration signals to identify behaviorally sensitive directions in the model. Instead of spreading quantization error evenly, we shape the error so that lower-impact areas absorb more of the compression burden, protecting the weights that matter most.

The Results

We evaluate based on prompt loss across multiple validation sets (lower is better). For this release, we compared our fixed-footprint Q4_K_M (4.52 BPW / 415.7 MiB) against the BF16 reference, standard llama.cpp pure Q4_K_M, and a range of Unsloth quants.

Model BPW est. Size MiB convergence60 heldout120 C4 (64x256)
BF16 reference 16.01 1446.5 2.2682 2.9809
SpectralQuant Q4_K_M 4.52 415.7 2.2509 2.9961 3.2874
Unsloth UD-Q4_K_XL 5.79 532.9 2.2833 2.9913
Unsloth IQ4_NL 5.26 483.4 2.3289 3.0484
Unsloth Q4_K_M 5.52 507.8 2.3268 3.0510 3.2574
Unsloth Q4_K_S 5.27 484.6 2.3126 3.0700
Unsloth IQ4_XS 5.11 469.8 2.3869 3.1061
llama.cpp pure Q4_K_M 4.52 415.7 2.7404 3.4135 3.3014
  • BF16 Gap Recovery: On our heldout120 evaluation suite, pure llama.cpp Q4_K_M hits a loss of 3.4135 (vs BF16's 2.9809). SpectralQuant drops that loss to 2.9961. That is a 96.5% recovery of the gap between standard Q4 and full BF16.
  • Vs. Unsloth: At 4.52 BPW, SpectralQuant achieves lower prompt loss on heldout120 than Unsloth's Q4_K_S, Q4_K_M, IQ4_NL, and IQ4_XS, all of which use more bytes (5.11 to 5.52 BPW).
  • C4 Validation: We also see improvements on standard C4 validation over pure Q4_K_M at the same footprint, though Unsloth's Q4_K_M edges it out here (while using ~92 MB more).

Note: On convergence60, SpectralQuant slightly undercuts the BF16 reference loss. We're actively analyzing this to untangle genuine behavioral recovery from localized calibration alignment.

Limitations & Transparency

We want to be clear about what this is and isn't.

  1. The claims are strictly bounded to this release table and same-footprint Q4_K_M behavior.
  2. Larger or dynamic quantizations can still win in certain setups. You should always evaluate on your specific workload.
  3. There are no FP-kept modules and no dynamic quant formats here, it's a strict, standard GGUF that you can run today with llama-cli or llama-server.

Hugging Face Repo: https://huggingface.co/Spectral-Labs25/Qwen3.5-0.8B-SpectralQuant-Q4_K_M

A detailed technical blog post breaking down the math and methodology is coming soon. Let us know how it runs for you!

submitted by /u/RevealIndividual7567
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA