Qwen3.6-35B-A3B-APEX / 128K ctx on RTX 3060 12GB — 37 t/s gen with 72k ctx filled, PPL 3.25, offloading 17GB model
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
I'm posting this because it may be helpful to squeeze the 12GB VRAM in the 3060.
All credit goes to spiritbuun's fork (github.com/spiritbuun/buun-llama-cpp) and mudler's APEX quantizations (huggingface.co/mudler). Spiritbuun's CUDA optimizations for NVIDIA GPUs — fused MMA fix, TurboQuant, fattn improvements — are what make offloading a 17.3 GB model on a 12 GB card at these speeds possible. Mudler's APEX I-Compact quantization gave me the best perplexity/speed trade-off of any variant I tested.
Hardware: - GPU: 1× RTX 3060 12GB (110W power limit) - CPU: Xeon E5-2678 v3 - RAM: 128 GB DDR4-2133 - PCIe 3.0 x16 - Container: Incus (LXC)
Command (optimal for me):
bash ./build/bin/llama-server \ -m /models/mudler/Qwen3.6-35B-A3B-APEX-MTP-I-Compact.gguf \ --no-warmup -c 131072 -np 1 --no-mmap --mlock \ -ctk turbo4 -ctv turbo4 \ --jinja --reasoning-budget 1536 \ --flash-attn on \ --host 0.0.0.0 --port 8000 \ -fitt 1500 \ --mmproj /models/mmproj-Qwen3.6-35B-A3B-Uncensored-Genesis-f16.gguf
Note on -fitt 1500: the mmproj takes ~900 MB. Without a fitting limit, llama-server tries to load it on GPU and OOMs. -fitt makes it work. Leaves room for the mmproj. Not needed without mmproj.
Models tested (72K prompt + 100 gen):
| Model | Prompt (t/s) | Gen (t/s) | Notes |
|---|---|---|---|
| mudler/...APEX-MTP-I-Compact + genesis mmproj, MTP off | 475 | 37.17 | 🏆 |
| mudler/...APEX-MTP-I-Compact, no mmproj, MTP off | 487 | 36.74 | |
| mudler/...APEX-I-Compact, no mmproj | 461 | 34.04 | No MTP heads in VRAM |
| unsloth/...UD-IQ3_S, no mmproj | 488 | 26.21 | |
| unsloth/...UD-IQ4_NL, no mmproj | 462 | 22.65 | |
| mudler/...APEX-MTP-I-Compact, MTP on | 412 | 21.74 |
Full model names: mudler/Qwen3.6-35B-A3B-APEX-MTP-I-Compact.gguf, mudler/Qwen3.6-35B-A3B-APEX-I-Compact.gguf, unsloth/Qwen3.6-35B-A3B-UD-IQ3_S.gguf, unsloth/Qwen3.6-35B-A3B-UD-IQ4_NL.gguf
Context degradation (optimal config): - Fresh: ~45 t/s gen - @72K filled: 37.17 gen · 475 prompt - @129K filled: 28.08 gen · 420 prompt
llama-perplexity (enwik8 subset, 64K ctx, turbo4, flash-attn): PPL = 3.2529 +/- 0.01852 across 4 chunks
I think it's pretty good for this model and quantization. I'm happy with it.
Needle-in-a-haystack (manual, web UI): 5 trials with hidden codes (e.g. secret=6301) planted in 150K–200K token texts at varying depths. 100% retrieval — model found every hidden code on every trial. I've used academic markdown texts for this.
Key findings:
Spiritbuun's fork + mudler models are the key. Without spiritbuun's CUDA work these numbers wouldn't be possible on a 3060 with a 17 GB model, but as figures show, the mudler model was also fundamental.
MTP hurts on my setup (3060 12GB with heavy offloading): it drops gen by 41% when enabled. On cards with enough VRAM to fit the whole model, MTP works well — there are posts in this sub about it, and about cards with same VRAM but more compute power doing well. On a 3060 with offloading, leave it off.
Mudler's APEX quantizations are decisive over other options. I tried several APEX I-Compact variants from other users and they topped out at 32-34 t/s — mudler's consistently gives the best numbers. The gap vs bartowsky or unsloth is substantial.
The MTP-I file (with MTP heads included) performs better than the APEX-I even with MTP disabled (36.74 vs 34.04). Maybe, I'm not sure, the extra tensors sitting in VRAM seem to make some magic aligning the memory layout. No good explanation, just empirical.
Context degradation: ~18% from fresh to 72K, another ~24% from 72K to 129K. Prompt speed also suffers as context grows.
For a single RTX 3060 12GB, spiritbuun's fork + mudler/Qwen3.6-35B-A3B-APEX-MTP-I-Compact.gguf with MTP off is the best combo I've found for long sessions with large context. 37 t/s gen, PPL 3.25, offloading a 17.3 GB model on a 12 GB card. Again, all credit to spiritbuun and mudler
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.