r/LocalLLaMA · · 1 min read

OSCAR RotationZoo - Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

OSCAR RotationZoo - Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

https://huggingface.co/Zhongzhu/OSCAR-RotationZoo

OSCAR RotationZoo

Precomputed K/V rotation matrices for OSCAR INT2 KV-cache quantization.

This repository contains the artifacts for the paper: OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization Zhongzhu Zhou, Donglin Zhuang, Jisen Li, Ziyan Chen, Shuaiwen Leon Song, Ben Athiwaratkun, Xiaoxia Wu

OSCAR captures Q/K/V activations on a small calibration set, estimates attention-aware K/V covariance offline, and derives per-layer orthogonal rotations that align INT2 quantization with the directions attention actually consumes. The result is ~7× compression of the KV-cache memory footprint with single-digit pp accuracy drop on GPQA for dense reasoning models.

This repo packages the rotations as drop-in .pt files so you don't need to re-run the Q/K/V dump and eigendecomposition yourself.

Available rotations

Model Calibration GPQA (BF16) GPQA (OSCAR INT2)
Qwen/Qwen3-4B-Thinking-2507 seq20000_prompt83_group128 67.27 67.17
Qwen/Qwen3-4B-Thinking-2507 seq20000_prompt85_group128 (fresh re-dump) 67.27
Qwen/Qwen3-8B seq20000_prompt83_group128 56.67 55.56
Qwen/Qwen3-32B seq16000_prompt69_group128 58.49 60.40
zai-org/GLM-4.7-FP8 seq10000_prompt43_group128 73.23 73.57

Time to time, we're getting stuffs like this. And I keep updating this thread continuously with those things. Hopefully I can run medium size(30-40B) MOE models(Also 10-20B Dense models) better & faster with 8GB VRAM by end of this year.

Would be awesome to have this on llama.cpp.

submitted by /u/pmttyji
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA