Statistical Inference and Quality Measures of KV Cache Quantisations Inspired by TurboQuant
Mirrored from arXiv — Machine Learning for archival readability. Support the source by reading on the original site.
arXiv:2605.08114v1 Announce Type: new
Abstract: We analyse three KV cache quantization schemes under a fair bit budget: \textbf{KV} (scalar MSE baseline), \textbf{KQV} (WHT + MSE on $K$; WHT + MSE + QJL on $V$), and \textbf{QKQV} (WHT + MSE + QJL on both). Starting from the Beta distribution on the hypersphere, we trace how QJL on $K$ inflates inner product variance by $\pi/2$, which softmax amplifies nonlinearly via Jensen's inequality, and we present statistical inference and information metrics to highlight practical differences.
Three empirical findings emerge. (1)~At $n=4$ (the practically dominant budget), KQV wins on every measure -- KL divergence, geometric $K$ error, and 6D distance -- across all distributions and ranks tested. (2)~The K--V asymmetry is unconditional: QKQV is consistently worse than KQV in KL divergence at every budget and distribution. (3)~A budget-dependent crossover exists: QKQV achieves better geometric $K$ reconstruction at $n \in \{2,3,5\}$, KQV at $n \in \{4,6\}$, invariant to rank and tail weight -- an open rate-distortion problem.
$\mathrm{KL}(p_{\mathrm{ref}} \| p_{\mathrm{quant}})$, K-only by construction, bridges K direction error to routing corruption and output collapse. We present a sufficient condition when the Jensen mechanism amplifies superlinearly through the softmax. At $n \in \{2,3,5\}$, QKQV wins geometrically because this assumption does not bind. At $n=4$, elevated K error and KL divergence for QKQV strongly suggest the Jensen mechanism is the operative cause of the crossover, providing a new perspective and explanation.
More from arXiv — Machine Learning
-
Can AI Draw Science? A Benchmark for Evaluating Scientific Figure Generation by Text-to-Image and Multimodal Models
Jun 30
-
On the Necessity of a Liquid Substrate for Mesh Intelligence
Jun 30
-
Position: RL Researchers Need to Distinguish Between Solving Simulators and Using Simulators as a Proxy
Jun 30
-
Learning to Distributedly Estimate under Partially Known Dynamics: A Covariance-Agnostic Neural Kalman Consensus Filter
Jun 30
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.