100+ t/s on Qwen3.6-27B Q8 across a 5090 + 3090 Ti — switching to tensor split-mode got me from 70 to 100+
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Wanted to share a setup that's been working great for me. Running Qwen3.6-27B at Q8_0 across two GPUs (RTX 5090 + RTX 3090 Ti) and getting ~100 t/s.
The big jump came from switching --split-mode to tensor. I was sitting at 70+ t/s on layer split before that. Tensor split keeps both cards busy on the same tensors instead of handing whole layers back and forth, and with a fast/slow pairing like this it made a real difference. Pairing it with a 70/30 tensor split (favoring the 5090) to match the relative compute.
Fair warning: this thing turns into a proper space heater under load. During decoding both GPUs pull hard the entire time — 750W+ from the cards alone.
Throughput depends on the prompt as well, with some reaching up to 130 t/s.
Full llama.cpp server command:
bash
llama-server \ -m Qwen3.6-27B-Q8_0.gguf \ -fa 1 \ --n-gpu-layers 99 \ --tensor-split 70,30 \ --fit off \ --main-gpu 0 \ --split-mode tensor \ --no-mmap \ --mlock \ --cpu-range 0-23 \ --cpu-range-batch 0-7 \ --ctx-size 196608 \ --parallel 2 \ --kv-unified \ --jinja --no-warmup --threads 24 --numa isolate \ --batch-size 2048 --ubatch-size 2048 --threads-batch 8 \ --chat-template-kwargs '{"preserve_thinking": false}' \ -cms 24000 \ -ctxcp 5 \ --alias qwen.3.6-27b.q8 \ --spec-type draft-mtp --spec-draft-n-max 3 \ --reasoning-budget 12288 \ --reasoning-budget-message "Wrap up your reasoning and give the final answer." \ --host 0.0.0.0 --port 8080 Happy to answer questions about the config.
P.s. If you want to understand how tensor splitting works, you can find more information in the llama.cpp documentation here: https://github.com/ggml-org/llama.cpp/blob/master/docs/multi-gpu.md
[link] [comments]
More from r/LocalLLaMA
-
Why Dario is on fire: lesson from dotcom bubble.
Jun 30
-
Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought
Jun 30
-
I Hate Dario Amodei, and everything he stands for.
Jun 29
-
Introducing LongCat-2.0 - , a large-scale MoE language model with 1.6 trillion total parameters and ~48 billion activated per token. This was the stealth model that was on Openrouter under the name 'owl-alpha'.
Jun 29
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.