OSCAR RotationZoo - Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| https://huggingface.co/Zhongzhu/OSCAR-RotationZoo OSCAR RotationZooPrecomputed K/V rotation matrices for OSCAR INT2 KV-cache quantization. This repository contains the artifacts for the paper: OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization Zhongzhu Zhou, Donglin Zhuang, Jisen Li, Ziyan Chen, Shuaiwen Leon Song, Ben Athiwaratkun, Xiaoxia Wu
OSCAR captures Q/K/V activations on a small calibration set, estimates attention-aware K/V covariance offline, and derives per-layer orthogonal rotations that align INT2 quantization with the directions attention actually consumes. The result is ~7× compression of the KV-cache memory footprint with single-digit pp accuracy drop on GPQA for dense reasoning models. This repo packages the rotations as drop-in Available rotations
Time to time, we're getting stuffs like this. And I keep updating this thread continuously with those things. Hopefully I can run medium size(30-40B) MOE models(Also 10-20B Dense models) better & faster with 8GB VRAM by end of this year. Would be awesome to have this on llama.cpp. [link] [comments] |
More from r/LocalLLaMA
-
Old Mac Pro still proving its worth
May 25
-
How local AI improved your live?
May 25
-
RAG for developer docs so local llm can code using latest library?
May 25
-
I built a computer use sandbox framework for codex on headless linux. GPU passthrough, computer use, and sudo access for codex all work. It's the perfect dev sandbox to allow full auto work while minimizing the "rm -rf /" risk
May 25
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.