Dual GPU llama.cpp speedup
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Llama.cpp has had a long standing issue with "--split-mode tensor", you'll get great results but it only supports non-quantized KV caches, for this very reason a lot of people decide to go with a healthy sized KV cache and ignore tensor parallelism.
I've had a stab at fixing the issue here - https://github.com/RedToasty/llama.cpp_qts - it's branched from mainline as of today, with minimal changes.
I'm personally running a 3060 12gb + 4070 Super 12gb, for a combined 24gb.
Here's my results with Q8_0/Q8_0 and "-sm tensor":
llama-bench.exe -m Qwen3.6-27B-Q4_K_M.gguf -sm tensor -fa 1 -ctk q8_0 -ctv q8_0 -p 128 -n 32 -b 128 -ub 128
| Model | Size | Params | Backend | NGL | Batch | UBatch | Type K | Type V | SM | FA | Test | Tokens/s |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen3.5 27B Q4_K Medium | 15.65 GiB | 26.90 B | CUDA | 99 | 128 | 128 | q8_0 | q8_0 | tensor | 1 | pp128 | 544.82 ± 6.01 |
| Qwen3.5 27B Q4_K Medium | 15.65 GiB | 26.90 B | CUDA | 99 | 128 | 128 | q8_0 | q8_0 | tensor | 1 | tg32 | 30.05 ± 0.38 |
Here's without tensor splitting:
llama-bench.exe -m Qwen3.6-27B-Q4_K_M.gguf -fa 1 -ctk q8_0 -ctv q8_0 -p 128 -n 32 -b 128 -ub 128
| Model | Size | Params | Backend | NGL | Batch | UBatch | Type K | Type V | FA | Test | Tokens/s |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen3.5 27B Q4_K Medium | 15.65 GiB | 26.90 B | CUDA | 99 | 128 | 128 | q8_0 | q8_0 | 1 | pp128 | 582.60 ± 28.57 |
| Qwen3.5 27B Q4_K Medium | 15.65 GiB | 26.90 B | CUDA | 99 | 128 | 128 | q8_0 | q8_0 | 1 | tg32 | 21.22 ± 0.52 |
Just over a 40% speed increase, with no loss of quality. This branch also supports the latest mtp changes, I've personally been using:
--spec-type draft-mtp --spec-draft-p-min 0.75 --spec-draft-n-max 2
In personal use my tokens per second have gone from around 25tps to around 40tps, in short "write a story" style contexts. I think it's due to limited vram, but I've personally had more joy with ngram-mod when using agentic coding and longer contexts.
I'd love to hear any feedback from anyone running dual 5060 ti or similar. Also anything dual Vulkan would be interesting, I'm looking for issues.
TLDR: If you run dual GPUs, grab/build this fork, add "-sm tensor" to your current command line and see if it goes any quicker.
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.