r/LocalLLaMA · · 3 min read

Qwen3.6 35B-A3B on a Laptop: My Zero to One Moment

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Hi everyone, I'm new here - because I only have a laptop and I only just realized local models are actually good enough now. So I'd like to share my experience, in case it helps others, and also to learn from the more experienced people here.

This is the first model that works for me on my ASUS Zenbook Pro 14 (RTX 4060 8GB VRAM, 64GB RAM):

  • fast enough: ~27TPS generation speed at 32k context, or ~18TPS at 256k context
  • smart enough: it can read and write files, use skills, execute CLI commands, use git, follow instructions, and act as a useful thinking partner.

Why it's important to me

For me this is important because it's where I unconsciously decided to draw the line - that I didn't want to share private information or more personal thoughts with cloud models (even TEE ones). I know I can still get hacked and my data leaked, but for me that's different than giving it up from the first prompt.

So for the first time, I now have this fully local, second brain. For me, it's a game changer.

I still use cloud models for public stuff

I'm still using cloud models for public projects, but for brainstorming and simple personal projects, local is now good enough for me. I'm also now looking into a more powerful desktop machine where maybe I can do some more serious coding. I have had a taste and I want more 😄

Now whenever I see Claude's black box "✽ Envisioning… (41s · ↓ 2.9k tokens · thinking some more with high effort)" it's so frustrating. I have no idea if it's going in the right direction. (whether this is an "efficient" way to do things is another story)

My issues so far with Qwen3.6

Qwen3.6 35B A3B is not perfect, here are some minor issues I observed, which I can work around:

  • It makes some mistakes, but normally recovers on its own.
  • Very occasionally it does get stuck in a loop. It does need some human monitoring, which is fine for me.
  • It sometimes doesn't read a skill in full or make the best decision even when it can fit it in context. It seems to sometimes be "lazy".
  • It is very non-deterministic. I didn't do any tweaks here though (because normally it ends up with the result I need).

I guess some of these could be improved if I used a larger quantization.

My setup

For inference I use llama.cpp, with unsloth's Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf.

For my harness, I use Pi with pi-llama-cpp extension. The harness runs in multipass and connects to the host running llama.cpp. I've also connected it to my phone through an E2EE Matrix chat (a custom one I built off of pi-messenger-bridge) - although it means I have to keep my laptop on all the time, which is annoying. Another reason for buying another machine which I'm more comfortable to run 24/7.

llama.cpp flags for 256k context(18tps):

./build/bin/llama-server -m Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf -ngl 24 -np 1 -fa on -ctk q4_0 -ctv q4_0 -c 262144 --host 0.0.0.0 --port 8088 -ncmoe 32 --no-mmap --jinja

llama.cpp flags for the 32k context (27tps):

./build/bin/llama-server -m Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf -ngl 99 -np 1 -fa on -ctk q4_0 -ctv q4_0 -c 32000 --host 0.0.0.0 --port 8088 -ncmoe 32 --no-mmap --jinja

What was your Zero to One moment?

submitted by /u/rolznz
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA