Running Qwen3.6-35B-A3B on a laptop RTX 4060 (8GB) — what worked, what didn't, and a surprising speculative-decoding result
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
TL;DR: I spent a long session tuning a 35B MoE on a tiny 8GB laptop GPU. Three things mattered a lot (--no-mmap, VRAM headroom, closing CPU-hungry apps). Several "obvious" optimizations did nothing because of this model's hybrid architecture (TurboQuant, Flash Attention, even i-quants made it worse). And speculative decoding gave me +26%, which contradicts the community benchmarks that found it net-negative. Looking for discussion + ideas.
The setup
- GPU: RTX 4060 Laptop, 8GB VRAM
- CPU/RAM: i7-13620H, 32GB DDR5-5600 dual-channel
- OS: Windows 11 (llama.cpp b9484, CUDA build)
- Model: Qwen3.6-35B-A3B (MoE, 35B total / ~3B active), Q4_K_M (~20GB)
- Key detail: this model is a hybrid — only 10 attention layers + 40 Gated Delta Net (recurrent) layers. That one fact explains most of my results.
Final config (the "default" profile)
-ngl 999 --n-cpu-moe 34 -c 65536 --parallel 1 --no-mmap
--cache-type-k q4_0 --cache-type-v q4_0
--temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 --presence-penalty 1.5
-md Qwen3.5-0.8B-Q4_K_M.gguf -ngld 99 --reasoning off
All dense layers (attention/router/norms) on GPU, experts on CPU. ~39 tok/s gen on a good day, ~5.4GB VRAM, ~2.5GB headroom.
What actually helped
--no-mmap is a big deal when experts are offloaded to CPU. With mmap, every token caused page faults on the expert tensors. Preloading them into RAM jumped generation speed dramatically (I measured ~11 → ~43 tok/s on an idle system). llama.cpp even prints a hint suggesting it when CPU tensor overrides are used.
VRAM headroom is critical on Windows. The NVIDIA driver's "System Memory Fallback" spills to system RAM instead of OOMing when VRAM is nearly full. With only ~740MB free, speed collapsed to ~7 tok/s. Keeping ≥1.5GB free fixed it. Counterintuitively, putting fewer experts on the GPU (higher --n-cpu-moe) was sometimes faster because it avoided the fallback.
The real bottleneck is the CPU, not the GPU. Experts run on CPU. Closing Discord + heavy browser tabs took me from ~6 to ~18 tok/s. GPU was at 59°C, never thermally throttling.
What I tested and rejected
TurboQuant KV quant (turbo3/turbo4, via a fork): works, loads fine, but gave ~0 benefit. Reason: this model's KV cache for 64K context is only ~295 MiB (10 attention layers!). Compressing 295MB is pointless when 7GB of experts fill the VRAM.
Flash Attention: no help (same reason — almost no attention layers to accelerate). Actually slightly slower.
IQ4_XS instead of Q4_K_M: ~35% slower (4.1 vs 6.3 tok/s same conditions). i-quants have expensive lookup-table decode that's slow on CPU; K-quants have optimized CPU kernels (REPACK=1). For CPU-offloaded experts, K-quant > i-quant even though the file is smaller.
--mlock: causes CUDA error: out of memory when combined with --no-mmap (pinned host allocation), and needs a special privilege on Windows anyway.
The surprising one: speculative decoding
Community benchmarks (incl. a dedicated RTX 3090 repo) found spec-decode net-negative on Qwen3.6-35B-A3B. On my setup it gave +26% (31 → 39 tok/s) using a vocab-matched Qwen3.5-0.8B draft.
My theory: with experts on CPU, generation is CPU-bound, and validating N draft tokens in one batched forward pass amortizes the expert compute better than N single-token passes. On a full-GPU 3090 the base model is already fast per token, so the draft overhead dominates. Has anyone else seen spec-decode help specifically in the CPU-offloaded-experts regime?
Bonus Windows gotchas
Smart App Control silently blocked the Open WebUI desktop app's unsigned DLLs (win32job.pyd). Moved Open WebUI into WSL2 instead.
From WSL the Windows-host server IP changes on reboot — fixed with WSL mirrored networking so localhost:8081 is stable.
Open questions for the group
Anyone else seeing spec-decode win on CPU-offloaded MoE (vs net-negative on full-GPU)?
For hybrid attention/recurrent models (Gated Delta Net), KV-cache optimizations seem irrelevant — what does move the needle?
Best way to disable thinking AND use a draft together? --chat-template-kwargs enable_thinking:false and --reasoning-budget 0 both throw "invalid argument" when a draft is loaded (applied to the draft's template too). Only --reasoning off works.
Any better draft model choice than Qwen3.5-0.8B for this target?
Happy to share more numbers / configs. Roast my setup.
[link] [comments]
More from r/LocalLLaMA
-
Github Copilot finally supporting custom endpoints
Jun 6
-
OpenLumara - A different kind of AI agent, written from scratch, not vibecoded. Extremely token-efficient, super small system prompt, made for local models. Everything is modular.
Jun 5
-
Gemma 4 QAT benchmark results (AMD 7900 XTX): faster, less VRAM, no quality loss
Jun 5
-
dots.tts 2B🎙️ SOTA TTS from RedNote
Jun 5
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.