r/LocalLLaMA · May 18, 2026 · 1 min read

Configuration Qwen3.6-35b-a3b (12Gb VRAM)

#model-release #long-context #gpu #inference

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Has anyone here tested different KV cache quantizations and compared their performance?

I’m currently using the model in Q5_K_M with Q4 KV cache on a 12 GB VRAM GPU. With this setup, I’m offloading about 27 MoE layers to the CPU and getting around 40 tok/s with a 128k total context window.

I’m trying to see if I can push it a bit further, since I’m using it inside my own AI agent. The model is already pretty smart, but in agentic workflows it’s not always as strong or consistent as I’d like.

I’d be curious to know what KV quantization settings people are using, and how much difference they noticed in speed, memory usage, and output quality.

Also, would you recommend trying a different model quantization than Q5_K_M for this setup? For example, would Q4_K_M, Q6_K, or another quant be a better trade-off for speed, VRAM usage, and reasoning quality?

submitted by /u/HomoAgens1
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA