r/LocalLLaMA · May 17, 2026 · 2 min read

Dual GPU llama.cpp speedup

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Llama.cpp has had a long standing issue with "--split-mode tensor", you'll get great results but it only supports non-quantized KV caches, for this very reason a lot of people decide to go with a healthy sized KV cache and ignore tensor parallelism.

I've had a stab at fixing the issue here - https://github.com/RedToasty/llama.cpp_qts - it's branched from mainline as of today, with minimal changes.

I'm personally running a 3060 12gb + 4070 Super 12gb, for a combined 24gb.

Here's my results with Q8_0/Q8_0 and "-sm tensor":

llama-bench.exe -m Qwen3.6-27B-Q4_K_M.gguf -sm tensor -fa 1 -ctk q8_0 -ctv q8_0 -p 128 -n 32 -b 128 -ub 128

Model	Size	Params	Backend	NGL	Batch	UBatch	Type K	Type V	SM	FA	Test	Tokens/s
Qwen3.5 27B Q4_K Medium	15.65 GiB	26.90 B	CUDA	99	128	128	q8_0	q8_0	tensor	1	pp128	544.82 ± 6.01
Qwen3.5 27B Q4_K Medium	15.65 GiB	26.90 B	CUDA	99	128	128	q8_0	q8_0	tensor	1	tg32	30.05 ± 0.38

Here's without tensor splitting:

llama-bench.exe -m Qwen3.6-27B-Q4_K_M.gguf -fa 1 -ctk q8_0 -ctv q8_0 -p 128 -n 32 -b 128 -ub 128

Model	Size	Params	Backend	NGL	Batch	UBatch	Type K	Type V	FA	Test	Tokens/s
Qwen3.5 27B Q4_K Medium	15.65 GiB	26.90 B	CUDA	99	128	128	q8_0	q8_0	1	pp128	582.60 ± 28.57
Qwen3.5 27B Q4_K Medium	15.65 GiB	26.90 B	CUDA	99	128	128	q8_0	q8_0	1	tg32	21.22 ± 0.52

Just over a 40% speed increase, with no loss of quality. This branch also supports the latest mtp changes, I've personally been using:

--spec-type draft-mtp --spec-draft-p-min 0.75 --spec-draft-n-max 2

In personal use my tokens per second have gone from around 25tps to around 40tps, in short "write a story" style contexts. I think it's due to limited vram, but I've personally had more joy with ngram-mod when using agentic coding and longer contexts.

I'd love to hear any feedback from anyone running dual 5060 ti or similar. Also anything dual Vulkan would be interesting, I'm looking for issues.

TLDR: If you run dual GPUs, grab/build this fork, add "-sm tensor" to your current command line and see if it goes any quicker.

submitted by /u/Legitimate-Dog5690
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA