r/LocalLLaMA · · 3 min read

You guys were right - Qwen 3.6 35B IS good...and KV Cache DOES matter.

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

WARNING: I'm speed typing this, no time to organizea/format, so if short paragraph chunks bother you, just keep it moving.

When Qwen 3.6 35B dropped, a lot of people were heaping praises and I thought they were just glazing it because of the speed. 27B was objectionably smarter than the 35 on 3.5.

So when I got around to using the 27B version (unsloth's Q5KXL UD @ KV Q8/8), it became my daily driver without thinking on. No loops, solid speeds. And I've been mostly fine. Until the past two days.

I never gave 35B achance because speed (at the time) wasn't that important to me and again, the 27B is known to be smarter. But after wasting 2 days trying to de-bug subgraphs in rivet and blowing HOURS of time constantly dropping quants due to context overflow and having the model's intelligence labotomize, I remembered reading a post recently where someone did a test comparing the IQ4NXLs (MTP + standard) against the Q4KXL, Q5 and others.

So, I gave Qwen 3.6 35B IQ4NXL a shot, no kv cache compression since vram wasn't as much an issue, and it nearly one-shotted the solution. I've since run a few more tests with it and for a minute I've just been confused - like why is the 35 better? So, I figured it must be a) Qwens are still really good at lower quants, and more importantly b) kv cache REALLY MATTERS.

The 35B still creeps when it hits high context, even worse than the 27B it seems, and the only way I can do my end session routines is to switch to the Q4KXL at KV Q4/4, but then it's a risk that it'll forget a routine or miss details in the session summary. Also, I haven't spent a lot of time learning the 35Bs, so I need some time to feel them out and figure out what works best.

Anyway, the point is - the IQ4NXL w/unquanted kv cache outperformed the 27B Q5 K XL at kv q/8/8, to say nothing about the 27B Q4 at kv q/4/4. I always though it didn't matter much because of different comments and AI saying it's only a slight decrease in intelligence. But when it comes to agentic work, it clearly makes a difference and can save you HOURS of time.

And...it's fast. So yeah, I'm using 35B a lot more now - at least for this particular project. I still love the 27B and there's other stuff that I'd prefer even the quanted 27B to do over the 35B. And to be fair to the 27B, I haven't tried it w/no kv cache compression because I need speed, but I'm going to assume it'll probably have a leap in intelligence unquanted as well. But for now, I've gotta lot of work to do, time is of the essence, and I've only got an RTX 3090 TI.

Side note: I've been using LM Studio since I started using LLMs a couple of years ago, but with this current bug it has where it won't overflow or compact context, it's slowing everything down having to start new sessions, have my agent re-read all the notes, eat all that context, summarize at end when context is full again, rinse repeat. So I've moved over to llama.cpp.

I hesitated on llama.cpp because I didn't feel like learning a new tool (adding to my ever-growing-and-already-too-large-list of apps) , because I didn't feel like bothering with it, but since I've gone agentic, I just had my agent complie it and it works fine, so yeah. Just let the agent do it. 😄

submitted by /u/GrungeWerX
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA