r/LocalLLaMA · May 22, 2026 · 3 min read

Qwen3.6-35B-A3B Q4 262k context on 8GB 3070 Ti = +30tps

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Qwen3.6-35B-A3B Q4 262k context on 8GB 3070 Ti = +30tps

..and on 8GB VRAM I can even push the context to 320K, 400K, 512K, and yes.. 1M. But it does start to slow down noticeably beyond 150k so I'd only do this if I ever really want the larger context.

This is using APEX-I-Quality or Q4_K_XL quants both are better than Q4_K_M (IQ4_NL_XL for beyond 512k context).

I have a total of 32GB of DDR4-2666 which is slightly above minimum DDR4.

I see a lot of users with better GPUs and more VRAM seem to be getting less efficiency and have to drop context all the way to 64k or below to run at good tps, I don't understand why. But here are two things I learned from my tweaking so far.

First, since 35B-A3B is an MoE model. It only needs ~3.5B to be in the VRAM during runtime.

8GB is enough to hold the active model layers (~3GB) + GPU buffers (~2GB) + 262144 KV Cache at q8_0 (2.56GB). It's a tight fit, but works.

Messing with the engine's parameters like forcing all layers to be on VRAM or other runtime parameters like sm, fa, etc, seem to actually slow down the model for me and/or exhausts my VRAM and system RAM.

Look at this screenshot for example, there's a misunderstanding of MoE that believes it must fit in its entirety in VRAM to run optimally.

https://preview.redd.it/cpc4r9q7cr2h1.png?width=1197&format=png&auto=webp&s=89bd03a4537825b862472009225a7a99b7fbd8b4

Second, just like Windows 11 sucks for gaming, all that "enhanced experience" also has an impact on LLM inference. Running a compact Linux from terminal (I chose Ubuntu Server) would only use up about 800MB of system RAM and practically no VRAM, compared to Windows 11, and it gives me a +25% boost to tps!

Here are some numbers for the same llama.cpp parameters:

On Windows

Inference is <27 tps and drops quickly beyond 100k, in fact it starts dropping from the first few thousands of output tokens.
System memory is 28GB+ full, and if I mess with other parameters in llama.cpp it just fills up immediately (~31GB) dragging tps down with it
The highest context I was able to run stable is 512k at turbo quant 4 for KV

On Ubuntu Server (fresh double-boot install 2 days ago, installed on a 160GB partition from my fastest nvme)

Inference is ~34 tps and doesn't drop, it often goes up to ~37 during generating tokens!
System memory is 22GB full, giving me a full 8GB of system RAM to run i3wm/x11 with whatever software I need (no eye candy composers/apps that use the GPU because that'll use up precious VRAM)
I was able to get to 1M context on IQ4_NL_XL and turbo4 quant for KV

So far its been good enough. But I have an older small GPU I can connect and use for the operating system while keeping the 3070 Ti entirely dedicated to the LLM.

--------------------

Both profiles are coding focused and should work under Windows 11 too but with a lot less memory left.

Main profile with 256K context:

llama-server \ -m Qwen3.6-35B-A3B-Q4_K_XL.gguf \ --jinja \ --parallel 1 \ --temp 0.7 \ --top-k 20 \ --top-p 0.95 \ --min-p 0 \ --reasoning-budget 4096 \ -n 32768 \ --no-context-shift \ --no-mmap \ -c 262144 \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ --host 0.0.0.0

and with 512K context:

llama-server \ -m Qwen3.6-35B-A3B-Q4_K_XL.gguf \ --jinja \ --parallel 1 \ --temp 0.7 \ --top-k 20 \ --top-p 0.95 \ --min-p 0 \ --reasoning-budget 4096 \ -n 32768 \ --no-context-shift \ --no-mmap \ -c 524288 \ --rope-scale 2 \ --rope-scaling yarn \ --yarn-orig-ctx 262144 \ --cache-type-k turbo4 \ --cache-type-v turbo4 \ --host 0.0.0.0

I hope someone finds this helpful. I love this community and I'm in the Qwen3.7-35B-A3B waiting room with the rest eating my nails in anticipation lol

submitted by /u/Alternative-Cat-1347
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA