Long-context performance at lower quants
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
I've been using Qwen3.5 122B A10B (Q3_K_XL) a lot lately for coding, and it's been pretty incredible overall like it feels not far off from frontier-level for most tasks -- but I've been noticing that usually once I hit around 75-80k context use, it starts to get dumb all of a sudden.
It just hits a brick wall and quality deteriorates rapidly and drastically. It'll begin hallucinating, forgetting things, or think something it said/suggested was actually something that I said.
I found I have to compact before I get to that point, and then it keeps going on just fine.
Is this because I'm running Q3? Unfortunately Q4 is just outside of the capability of my system specs unless I want to start disk swapping.
So is it just an issue with this particular model? Or because it's Q3? Are there llama.cpp settings that can help?
I'm already using BF16 KV cache.
EDIT to add the snippet of my model config file for this one:
[*] flash-attn = on n = 8192 t = 8 tb = 8 cpu-range = 0-7 cpu-strict = 1 cpu-range-batch = 0-15 cpu-strict-batch = 1 jinja = on reasoning-budget = 4096 reasoning-budget-message = " -- Reasoning budget exceeded, proceed to final answer." [Qwen3.5-122B-A10B-UD-Q3_K_XL] model = G:\models\Qwen3.6-122B-A10B\UD-Q3_K_XL\Qwen3.5-122B-A10B-UD-Q3_K_XL-00001-of-00003.gguf ctx-size = 131072 cache-type-k = bf16 cache-type-v = bf16 presence-penalty = 1.1 repeat-penalty = 1.05 repeat-last-n = 512 temp = 0.1 top-p = 0.95 top-k = 20 min-p = 0.00 [link] [comments]
More from r/LocalLLaMA
-
Stop traumatizing AI into loops and turn hallucinations into an honest "I don't know!" by being NICE to them (Proof of Concept, Research, I don't want to sell anything)
May 27
-
Cactus Hybrid Router: Gemma4-2B can match Gemini-3.1-Flash-Lite by routing 15-55% of tasks to Gemini And Running The Rest Locally.
May 26
-
Small comparison on full compute performance (Anima) of 5090 (600,475 and 400W) vs 6000 PRO MaxQ (325W), and 6000 PRO WS/SE (600W).
May 26
-
$400 Qwen 3.6-27B Setup - Dual RTX 3060 - 30-50 t/s
May 26
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.