r/LocalLLaMA · · 2 min read

Dual GPU llama.cpp speedup

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Llama.cpp has had a long standing issue with "--split-mode tensor", you'll get great results but it only supports non-quantized KV caches, for this very reason a lot of people decide to go with a healthy sized KV cache and ignore tensor parallelism.  

I've had a stab at fixing the issue here - https://github.com/RedToasty/llama.cpp_qts - it's branched from mainline as of today, with minimal changes.  

I'm personally running a 3060 12gb + 4070 Super 12gb, for a combined 24gb.  

Here's my results with Q8_0/Q8_0 and "-sm tensor":  

llama-bench.exe -m Qwen3.6-27B-Q4_K_M.gguf -sm tensor -fa 1 -ctk q8_0 -ctv q8_0 -p 128 -n 32 -b 128 -ub 128  

Model Size Params Backend NGL Batch UBatch Type K Type V SM FA Test Tokens/s
Qwen3.5 27B Q4_K Medium 15.65 GiB 26.90 B CUDA 99 128 128 q8_0 q8_0 tensor 1 pp128 544.82 ± 6.01
Qwen3.5 27B Q4_K Medium 15.65 GiB 26.90 B CUDA 99 128 128 q8_0 q8_0 tensor 1 tg32 30.05 ± 0.38

Here's without tensor splitting:  

llama-bench.exe -m Qwen3.6-27B-Q4_K_M.gguf -fa 1 -ctk q8_0 -ctv q8_0 -p 128 -n 32 -b 128 -ub 128  

Model Size Params Backend NGL Batch UBatch Type K Type V FA Test Tokens/s
Qwen3.5 27B Q4_K Medium 15.65 GiB 26.90 B CUDA 99 128 128 q8_0 q8_0 1 pp128 582.60 ± 28.57
Qwen3.5 27B Q4_K Medium 15.65 GiB 26.90 B CUDA 99 128 128 q8_0 q8_0 1 tg32 21.22 ± 0.52

Just over a 40% speed increase, with no loss of quality. This branch also supports the latest mtp changes, I've personally been using:  

--spec-type draft-mtp --spec-draft-p-min 0.75 --spec-draft-n-max 2  

In personal use my tokens per second have gone from around 25tps to around 40tps, in short "write a story" style contexts. I think it's due to limited vram, but I've personally had more joy with ngram-mod when using agentic coding and longer contexts.  

I'd love to hear any feedback from anyone running dual 5060 ti or similar. Also anything dual Vulkan would be interesting, I'm looking for issues.  

TLDR: If you run dual GPUs, grab/build this fork, add "-sm tensor" to your current command line and see if it goes any quicker.

submitted by /u/Legitimate-Dog5690
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA