Qwen3.6-35B-A3B Q4 262k context on 8GB 3070 Ti = +30tps
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| ..and on 8GB VRAM I can even push the context to 320K, 400K, 512K, and yes.. 1M. But it does start to slow down noticeably beyond 150k so I'd only do this if I ever really want the larger context. This is using APEX-I-Quality or Q4_K_XL quants both are better than Q4_K_M (IQ4_NL_XL for beyond 512k context). I have a total of 32GB of DDR4-2666 which is slightly above minimum DDR4. I see a lot of users with better GPUs and more VRAM seem to be getting less efficiency and have to drop context all the way to 64k or below to run at good tps, I don't understand why. But here are two things I learned from my tweaking so far. First, since 35B-A3B is an MoE model. It only needs ~3.5B to be in the VRAM during runtime. 8GB is enough to hold the active model layers (~3GB) + GPU buffers (~2GB) + 262144 KV Cache at q8_0 (2.56GB). It's a tight fit, but works. Messing with the engine's parameters like forcing all layers to be on VRAM or other runtime parameters like sm, fa, etc, seem to actually slow down the model for me and/or exhausts my VRAM and system RAM. Look at this screenshot for example, there's a misunderstanding of MoE that believes it must fit in its entirety in VRAM to run optimally. Second, just like Windows 11 sucks for gaming, all that "enhanced experience" also has an impact on LLM inference. Running a compact Linux from terminal (I chose Ubuntu Server) would only use up about 800MB of system RAM and practically no VRAM, compared to Windows 11, and it gives me a +25% boost to tps! Here are some numbers for the same llama.cpp parameters: On Windows
On Ubuntu Server (fresh double-boot install 2 days ago, installed on a 160GB partition from my fastest nvme)
So far its been good enough. But I have an older small GPU I can connect and use for the operating system while keeping the 3070 Ti entirely dedicated to the LLM. -------------------- Both profiles are coding focused and should work under Windows 11 too but with a lot less memory left. Main profile with 256K context: and with 512K context: I hope someone finds this helpful. I love this community and I'm in the Qwen3.7-35B-A3B waiting room with the rest eating my nails in anticipation lol [link] [comments] |
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.