r/LocalLLaMA · · 1 min read

I mapped the KLD of KV cache quantization for Qwen3.6-35B-A3B and Gemma4-E2B QAT

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I mapped the KLD of KV cache quantization for Qwen3.6-35B-A3B and Gemma4-E2B QAT

TL;DR version

  • q8/q8 is nearly free on both models
  • q4/q4 is useable on Qwen and catastrophic on Gemma
  • turbo4 is sometimes slightly better, sometimes slightly worse, than q4_0
  • turbo3 and turbo2 allow compressing the cache to unprecedented levels - but you'll pay dearly for it
  • K is sometimes more sensitive than V, sometimes less, sometimes they're symmetrical

Full analysis

Nuance, caveats, zoomable plots, and the software to replicate these plots with any model:

https://github.com/crusaderky/pixi-llm-recipes/tree/main/perplexity#readme

submitted by /u/crusaderky
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA