r/LocalLLaMA · May 28, 2026 · 3 min read

Qwen3.6-35B-A3B-APEX / 128K ctx on RTX 3060 12GB — 37 t/s gen with 72k ctx filled, PPL 3.25, offloading 17GB model

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I'm posting this because it may be helpful to squeeze the 12GB VRAM in the 3060.

All credit goes to spiritbuun's fork (github.com/spiritbuun/buun-llama-cpp) and mudler's APEX quantizations (huggingface.co/mudler). Spiritbuun's CUDA optimizations for NVIDIA GPUs — fused MMA fix, TurboQuant, fattn improvements — are what make offloading a 17.3 GB model on a 12 GB card at these speeds possible. Mudler's APEX I-Compact quantization gave me the best perplexity/speed trade-off of any variant I tested.

Hardware: - GPU: 1× RTX 3060 12GB (110W power limit) - CPU: Xeon E5-2678 v3 - RAM: 128 GB DDR4-2133 - PCIe 3.0 x16 - Container: Incus (LXC)

Command (optimal for me):

bash ./build/bin/llama-server \ -m /models/mudler/Qwen3.6-35B-A3B-APEX-MTP-I-Compact.gguf \ --no-warmup -c 131072 -np 1 --no-mmap --mlock \ -ctk turbo4 -ctv turbo4 \ --jinja --reasoning-budget 1536 \ --flash-attn on \ --host 0.0.0.0 --port 8000 \ -fitt 1500 \ --mmproj /models/mmproj-Qwen3.6-35B-A3B-Uncensored-Genesis-f16.gguf

Note on -fitt 1500: the mmproj takes ~900 MB. Without a fitting limit, llama-server tries to load it on GPU and OOMs. -fitt makes it work. Leaves room for the mmproj. Not needed without mmproj.

Models tested (72K prompt + 100 gen):

Model	Prompt (t/s)	Gen (t/s)	Notes
mudler/...APEX-MTP-I-Compact + genesis mmproj, MTP off	475	37.17	🏆
mudler/...APEX-MTP-I-Compact, no mmproj, MTP off	487	36.74
mudler/...APEX-I-Compact, no mmproj	461	34.04	No MTP heads in VRAM
unsloth/...UD-IQ3_S, no mmproj	488	26.21
unsloth/...UD-IQ4_NL, no mmproj	462	22.65
mudler/...APEX-MTP-I-Compact, MTP on	412	21.74

Full model names: mudler/Qwen3.6-35B-A3B-APEX-MTP-I-Compact.gguf, mudler/Qwen3.6-35B-A3B-APEX-I-Compact.gguf, unsloth/Qwen3.6-35B-A3B-UD-IQ3_S.gguf, unsloth/Qwen3.6-35B-A3B-UD-IQ4_NL.gguf

Context degradation (optimal config): - Fresh: ~45 t/s gen - @72K filled: 37.17 gen · 475 prompt - @129K filled: 28.08 gen · 420 prompt

llama-perplexity (enwik8 subset, 64K ctx, turbo4, flash-attn): PPL = 3.2529 +/- 0.01852 across 4 chunks

I think it's pretty good for this model and quantization. I'm happy with it.

Needle-in-a-haystack (manual, web UI): 5 trials with hidden codes (e.g. secret=6301) planted in 150K–200K token texts at varying depths. 100% retrieval — model found every hidden code on every trial. I've used academic markdown texts for this.

Key findings:

Spiritbuun's fork + mudler models are the key. Without spiritbuun's CUDA work these numbers wouldn't be possible on a 3060 with a 17 GB model, but as figures show, the mudler model was also fundamental.
MTP hurts on my setup (3060 12GB with heavy offloading): it drops gen by 41% when enabled. On cards with enough VRAM to fit the whole model, MTP works well — there are posts in this sub about it, and about cards with same VRAM but more compute power doing well. On a 3060 with offloading, leave it off.
Mudler's APEX quantizations are decisive over other options. I tried several APEX I-Compact variants from other users and they topped out at 32-34 t/s — mudler's consistently gives the best numbers. The gap vs bartowsky or unsloth is substantial.
The MTP-I file (with MTP heads included) performs better than the APEX-I even with MTP disabled (36.74 vs 34.04). Maybe, I'm not sure, the extra tensors sitting in VRAM seem to make some magic aligning the memory layout. No good explanation, just empirical.
Context degradation: ~18% from fresh to 72K, another ~24% from 72K to 129K. Prompt speed also suffers as context grows.

For a single RTX 3060 12GB, spiritbuun's fork + mudler/Qwen3.6-35B-A3B-APEX-MTP-I-Compact.gguf with MTP off is the best combo I've found for long sessions with large context. 37 t/s gen, PPL 3.25, offloading a 17.3 GB model on a 12 GB card. Again, all credit to spiritbuun and mudler

submitted by /u/old-mike
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA