r/LocalLLaMA · · 1 min read

This is amazing. Token speed doubled + kv cache now need low vram - qwen 27b

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

This is amazing. Token speed doubled + kv cache now need low vram - qwen 27b

"Qwen3.6-27B Q4_K_M on a single RTX 3090: native 256K context at 38.6 tok/s with 72 MiB of resident KV, needle recall 88-100% at 6% residency, harness accuracy unchanged (36/36 vs full cache)."

On the same hardware, generation speeds doubled and VRAM usage dropped significantly (21GB to 17.5GB) while maintaining full context accuracy

Yt video of fahd --> https://youtu.be/8rTVCRWvRDo?si=MYiVrQQltbSsMAOP

Link to git hub - https://github.com/Luce-Org/lucebox-hub/tree/main/optimizations/kvflash

Quality loss?? --> Quality verdict (harness ground truth, base-vs-base control included): full results in RESULTS.md. Outputs are not guaranteed byte-identical to the full cache on long generations (the masked kernel path rounds differently — a different deterministic lineage), but correctness is identical: 36/36 vs 36/36 across HumanEval, GSM, MATH, and agent suites.

submitted by /u/9r4n4y
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA