r/LocalLLaMA · · 1 min read

Developers who use local AI - Q4_0 vs Q8_0 KV quant?

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I'd love to hear from developers who use big context windows if they notice a difference?

Obviously I would love to cut the KV cache VRAM requirement in half, but I'm worried about quality especially when we enter into 50k+ context territory.

I don't really need a full study, just wondering, anecdotally, what people have experienced.

My current setup: Docker stack with Llama.cpp server at the helm (Vulkan - I pay AMD tax daily) - 32GB VRAM, using mostly Qwen 3.6 models for development. I go back and forth beetween the 27b dense and 35b MoE. WIth a dash of the lil guy (3.5 9B omnicoder variant) for smaller stuff since it's so zippy and uses a shite-ton less vram.

submitted by /u/Jorlen
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA