r/LocalLLaMA · · 1 min read

I just realized how good MoE models are for consumer hardware

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I've been tinkering around with LLM for a while now, started with LM Studio like probably all of us and wanted to go into headless selhosted model so that I can use my macbook and still use my AI models.

I've been using Qwen 3.6 (and 3.5) 27B on my main computer which has a Ryzen 7 3800X, a 7900XT, 32Gb of RAM and that thing was pretty sloooooow even with MTP enabled.

You can probably call this a skill issue as I'm not familiar with llama.cpp forest of arguments yet despite reading the documentation when I'm confused about something.

And this morning I just had the urge of breaking everything I've done so far, tried a new gguf that isn't from unsloath, got the 35BA3B and moved all the expert part of the model to the "cpu" (even if it is actually moved to RAM but whatever) and I'm actually sad that my GPU VRAM is so empty now BUT that thing is ripping fast.

The difference between 27B and 35BA3B is kind of mind blowing and I think it might be even more efficient on the productivity side to have that much of a speed gain.

Before I had to take a coffee between what was done by 27B, now it is just a short pause and iteration with 35BA3B, so even if there was ton of hype (justified for sure) for 27B, give a shot to the 35BA3B especially if you are VRAM limited and have a decent amount of RAM.

Give me some tips on what I could try to optimise my models 27B and 35BA3B too as I'm also a beginner and that area and just want to learn more on this.

submitted by /u/ego100trique
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA