r/LocalLLaMA · · 1 min read

Long-context performance at lower quants

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I've been using Qwen3.5 122B A10B (Q3_K_XL) a lot lately for coding, and it's been pretty incredible overall like it feels not far off from frontier-level for most tasks -- but I've been noticing that usually once I hit around 75-80k context use, it starts to get dumb all of a sudden.

It just hits a brick wall and quality deteriorates rapidly and drastically. It'll begin hallucinating, forgetting things, or think something it said/suggested was actually something that I said.

I found I have to compact before I get to that point, and then it keeps going on just fine.

Is this because I'm running Q3? Unfortunately Q4 is just outside of the capability of my system specs unless I want to start disk swapping.

So is it just an issue with this particular model? Or because it's Q3? Are there llama.cpp settings that can help?

I'm already using BF16 KV cache.

EDIT to add the snippet of my model config file for this one:

[*] flash-attn = on n = 8192 t = 8 tb = 8 cpu-range = 0-7 cpu-strict = 1 cpu-range-batch = 0-15 cpu-strict-batch = 1 jinja = on reasoning-budget = 4096 reasoning-budget-message = " -- Reasoning budget exceeded, proceed to final answer." [Qwen3.5-122B-A10B-UD-Q3_K_XL] model = G:\models\Qwen3.6-122B-A10B\UD-Q3_K_XL\Qwen3.5-122B-A10B-UD-Q3_K_XL-00001-of-00003.gguf ctx-size = 131072 cache-type-k = bf16 cache-type-v = bf16 presence-penalty = 1.1 repeat-penalty = 1.05 repeat-last-n = 512 temp = 0.1 top-p = 0.95 top-k = 20 min-p = 0.00 
submitted by /u/_TheWolfOfWalmart_
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA