7900XTX 24GB vram, can finally fit Q6K+MTP with Qwen 3.6 27B at 131k context
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
OS: CatchyOS
Instructions:
Connect monitor to iGPU directly so when you boot Linux your dGPU vram is 100% free since by default when you use your dGPU it consumes about 700mb~1.2gb of lost context space, yes you can still game normally using this approach.
Setup kvcache at q5_0/q4_0 (make sure to compile with CUDA_ALL_QUANTS)
Yes, Q5_0/Q4_0 is 1.6%~ less precise than Q8 by giving 12% less vram usage as proven here: (Qwen does an amazing job with kvcache).
https://anbeeld.com/articles/kv-cache-quantization-benchmarks-for-long-context
Now I can run Qwen 3.6 27B Unsloth Q6K model (22GB~) with 131k context at 55~60t/s
Add these arguments to compile (the blas changes I got from here with a guy saying that it helped him reduce vram usage, and well...)
-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DGGML_CUDA_FA_ALL_QUANTS=true You can then just pass the llama.cpp arguments:
-ctv q5_0 -ctk q4_0 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --presence-penalty 0.0 --repeat-penalty 1.0 -c 131000 --ninja --mlock --parallel 1 --no-mmproj [link] [comments]
More from r/LocalLLaMA
-
Why Dario is on fire: lesson from dotcom bubble.
Jun 30
-
Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought
Jun 30
-
I Hate Dario Amodei, and everything he stands for.
Jun 29
-
Introducing LongCat-2.0 - , a large-scale MoE language model with 1.6 trillion total parameters and ~48 billion activated per token. This was the stealth model that was on Openrouter under the name 'owl-alpha'.
Jun 29
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.