r/LocalLLaMA · · 4 min read

Qwen3.6-35B-A3B APEX on a Single RTX 3090 - Getting the Most Out of It

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Resources I used: - https://github.com/ikawrakow/ik_llama.cpp - as the reference llama.cpp fork - https://github.com/spiritbuun/buun-llama-cpp - to test the TurboQuant feature - https://huggingface.co/mudler - for the models - https://github.com/noonghunna/club-3090 - for speed references, benchmarking and setup guidance

My Goal

I recently got an RTX 3090 and tried to find the optimal configuration for running the Qwen3.6-35B-A3B model. My priorities were clear:

  • Maximum possible quality without sacrificing good speed
  • Minimum 128k context to handle long documents and long agentic flows

Speed Benchmarks

I tested two llama.cpp forks (ik_llama as suggested by club-3090 and the spiritbuun fork) with both main APEX model versions (I-Compact and I-Quality). Here are the generation speed results, all with 128k context.

Engine APEX Model KV Cache decode_TPS (Narrative) decode_TPS (Code)
ik_llama I-Compact q8_0 / q5_0 ~146 ~146
spiritbuun I-Compact turbo8 / turbo4 ~142 ~141
spiritbuun I-Quality turbo8 / turbo4 ~137 ~137
ik_llama I-Quality q8_0 / q5_0 ~137 ~137

Analysis: ik_llama with I-Compact is the undisputed king of speed. However, spiritbuun with I-Quality and turbo8/turbo4 cache delivers the same speed as ik_llama with I-Quality.

Quality Comparison

Here's a comparison table with official data from the APEX repository for the Qwen3.5-35B-A3B. Note: these are the official APEX benchmarks. I haven't been able to find 3.6 specific benchmark data, but the relative performance between APEX tiers should be the same.

Model Size PPL ↓ KL mean ↓ KL max ↓ HellaSwag ↑ tg128 (t/s) ↑
BF16 (reference) 64.6 GB 6.537 82.5% 30.4
APEX I-Quality 21.3 GB 6.552 0.0102 5.59 83.5% 62.3
UD-Q4_K_XL 20.7 GB 6.554 0.0097 3.14 83.0% 58.1
APEX I-Compact ~17 GB 6.857 0.0451 8.76 83.5%

On paper, APEX I-Quality and UD-Q4_K_XL look nearly identical: same perplexity (6.552 vs 6.554), similar KL metrics. But here's the kicker: APEX I-Quality is ~7% faster in generation (62.3 vs 58.1 t/s) while delivering slightly better HellaSwag (83.5% vs 83.0%).

APEX I-Compact is the efficiency champion: at only ~17 GB, it offers excellent quality and maximum speed, and you can push context to 256k without OOM. It even ties I-Quality on HellaSwag (83.5%).

Why turbo8/turbo4 is Better Than q8_0/q5_0

turbo8 is a new KV cache codec from the spiritbuun fork. The author (@spiritbuun) posted benchmarks on X (Twitter) comparing turbo8 against the traditional q8_0 cache:

ctx turbo8 tg/s vs q8_0 turbo8 mean KLD vs q8_0 KLD
2048 31.34 +1.9% 0.007717 -12%
8192 30.22 +3.6% 0.009450 -8%
16384 29.40 +6.7% 0.005235 -14%
32768 28.06 +15% 0.003594 -8%

Source: https://x.com/spiritbuun/status/2062164396789412256

turbo8 is consistently faster and always has lower KLD. The gap widens at longer contexts, reaching +15% speed at 32k tokens. Using it asymmetrically with turbo4 (turbo8 for Keys, turbo4 for Values) is what es recommended for the best balance.

NOTE 1: PR #72 - Essential for spiritbuun

For spiritbuun to perform at its peak, you need to apply PR #72 that I submitted to the repository. A previous change introduced a "fast-path" that invalidated CUDA graph capture during prefill, causing a ~38% prompt eval regression. The PR adds a guard so that the fast-path is only used for single-token decoding, restoring prefill throughput.

NOTE 2: MTP - My Experience

In my testing, the I-Quality model with MTP (Multi-Token Prediction) ,but MTP disabled, is actually faster than with it enabled. This might be because adding MTP heads changes the memory layout, or the quantization script for the MTP version is better optimized.

I've also found that MTP doesn't bring benefits for this model in my setup. You might see speed peaks, but you lose in prefill almost always, and often in generation too. This has been documented by others and the reasoning makes sense: these small MoE models are so quick that MTP can actually penalize performance rather than help.

So, if you're chasing maximum speed, try disabling MTP (simply omit the flag).

Launch Commands

ik_llama + I-Compact (Maximum Speed)

```bash

!/bin/bash

/root/ik_llama.cpp/build/bin/llama-server \ -m /models/Qwen3.6-35B-A3B-APEX-MTP-I-Compact.gguf \ -b 4096 -ub 1024 \ --cache-ram 4096 \ --parallel-tool-calls \ --recurrent-ckpt-mode auto --merge-qkv \ -c 196608 -np 1 --no-mmap --mlock \ -ctk q8_0 -ctv q5_0 \ -vhad -vhad -ngl 99 \ --jinja --reasoning-budget 0 --flash-attn on \ --host 0.0.0.0 --port 8000 ```

spiritbuun + I-Quality + turbo8/turbo4 (Best Quality/Context)

```bash

!/bin/bash

/root/buun-llama-cpp/build/bin/llama-server \ -m /models/Qwen3.6-35B-A3B-APEX-MTP-I-Quality.gguf \ --host 0.0.0.0 --port 8000 \ --no-warmup \ -c 131072 \ -np 1 \ --no-mmap --mlock \ -ctk turbo8 -ctv turbo4 \ --jinja --reasoning-budget 0 \ --flash-attn on ```


Final Thoughts

I did a similar post with my old 3060. I must say that turbo8/turbo4 for KV caches is working at similar speed to what I reported in that post (turbo4/turbo4), but with the superior coherence of turbo8 for keys.

P.S. I used Hermes Agent (as main model the Quality model in this article) for translation and formatting in this post.

submitted by /u/old-mike
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA