r/LocalLLaMA · June 23, 2026 · 1 min read

I mapped the KLD of KV cache quantization for Qwen3.6-35B-A3B and Gemma4-E2B QAT

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Like Read original ↗

I mapped the KLD of KV cache quantization for Qwen3.6-35B-A3B and Gemma4-E2B QAT

TL;DR version

q8/q8 is nearly free on both models
q4/q4 is useable on Qwen and catastrophic on Gemma
turbo4 is sometimes slightly better, sometimes slightly worse, than q4_0
turbo3 and turbo2 allow compressing the cache to unprecedented levels - but you'll pay dearly for it
K is sometimes more sensitive than V, sometimes less, sometimes they're symmetrical

Full analysis

Nuance, caveats, zoomable plots, and the software to replicate these plots with any model:

https://github.com/crusaderky/pixi-llm-recipes/tree/main/perplexity#readme

submitted by /u/crusaderky
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA