r/LocalLLaMA · June 12, 2026 · 4 min read

Comparing dual-GPU inference speed between llama.cpp row/tensor split and ik_llama graph split

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Comparing dual-GPU inference speed between llama.cpp row/tensor split and ik_llama graph split

Setup:

+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 610.43.02 KMD Version: 610.43.02 CUDA UMD Version: 13.3 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3080 Off | 00000000:01:00.0 Off | N/A | | 40% 30C P8 10W / 320W | 238MiB / 20480MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 3080 Off | 00000000:03:00.0 Off | N/A | | 40% 29C P8 8W / 320W | 17MiB / 20480MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+

Yes, these are the alibaba 3080 20gb, just arrived today. Great buy tbh.

I've used llama-benchy to benchmark prompt processing speed and token generation with ik_llama and llama.cpp with row, tensor and graph split modes.

Model used: https://huggingface.co/unsloth/Qwen3.6-27B-GGUF/blob/main/Qwen3.6-27B-Q8_0.gguf

No MTP for this benchmark.

Used latest version of ik_llama and llama.cpp for today. Just updated and recompiled before benchmarking.

Arguments used for all 3 runs:

-m '<...>/Qwen3.6-27B-Q8_0.gguf' \ --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 \ -np 1 -c 135000 -ngl 99

Arguments used for llama.cpp:

-sm row

-sm tensor

Arguments for ik_llama:

-sm graph

-sm row:

VRAM usage: GPU0: 18.2 / GPU1: 18.5

Results:

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
Qwen/Qwen3.6-27B	pp4096 @ d4000	1732.89 ± 14.86		4673.37 ± 40.08	4673.07 ± 40.08	4673.37 ± 40.08
Qwen/Qwen3.6-27B	tg128 @ d4000	23.03 ± 0.01	24.00 ± 0.00
Qwen/Qwen3.6-27B	pp4096 @ d8000	1766.49 ± 7.45		6848.27 ± 29.08	6847.97 ± 29.08	6848.27 ± 29.08
Qwen/Qwen3.6-27B	tg128 @ d8000	22.83 ± 0.01	23.00 ± 0.00
Qwen/Qwen3.6-27B	pp4096 @ d16000	1756.67 ± 9.84		11441.05 ± 63.85	11440.74 ± 63.85	11441.05 ± 63.85
Qwen/Qwen3.6-27B	tg128 @ d16000	22.44 ± 0.00	23.00 ± 0.00
Qwen/Qwen3.6-27B	pp4096 @ d32000	1670.17 ± 7.88		21613.73 ± 101.44	21613.42 ± 101.44	21613.73 ± 101.44
Qwen/Qwen3.6-27B	tg128 @ d32000	21.71 ± 0.01	22.00 ± 0.00
Qwen/Qwen3.6-27B	pp4096 @ d64000	1481.15 ± 4.23		45976.46 ± 130.94	45976.15 ± 130.94	45976.46 ± 130.94
Qwen/Qwen3.6-27B	tg128 @ d64000	20.41 ± 0.00	21.00 ± 0.00
Qwen/Qwen3.6-27B	pp4096 @ d128000	1195.01 ± 2.36		110541.23 ± 217.70	110540.93 ± 217.70	110541.23 ± 217.70
Qwen/Qwen3.6-27B	tg128 @ d128000	18.23 ± 0.00	19.00 ± 0.00

-sm tensor:

VRAM usage: GPU0: 18.1 / GPU1: 17.9

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
Qwen/Qwen3.6-27B	pp4096 @ d4000	1412.73 ± 15.38		5732.50 ± 61.94	5732.15 ± 61.94	5732.50 ± 61.94
Qwen/Qwen3.6-27B	tg128 @ d4000	38.95 ± 0.05	40.00 ± 0.00
Qwen/Qwen3.6-27B	pp4096 @ d8000	1400.96 ± 5.46		8635.04 ± 32.88	8634.68 ± 32.88	8635.04 ± 32.88
Qwen/Qwen3.6-27B	tg128 @ d8000	38.68 ± 0.10	39.00 ± 0.00
Qwen/Qwen3.6-27B	pp4096 @ d16000	1381.89 ± 4.16		14543.59 ± 43.73	14543.23 ± 43.73	14543.59 ± 43.73
Qwen/Qwen3.6-27B	tg128 @ d16000	38.14 ± 0.11	39.00 ± 0.00
Qwen/Qwen3.6-27B	pp4096 @ d32000	1328.03 ± 2.82		27181.67 ± 57.72	27181.31 ± 57.72	27181.67 ± 57.72
Qwen/Qwen3.6-27B	tg128 @ d32000	37.13 ± 0.01	38.00 ± 0.00
Qwen/Qwen3.6-27B	pp4096 @ d64000	1219.17 ± 2.61		55856.47 ± 119.00	55856.12 ± 119.00	55856.47 ± 119.00
Qwen/Qwen3.6-27B	tg128 @ d64000	35.18 ± 0.01	36.00 ± 0.00
Qwen/Qwen3.6-27B	pp4096 @ d128000	1036.75 ± 1.70		127414.43 ± 208.98	127414.08 ± 208.98	127414.43 ± 208.98
Qwen/Qwen3.6-27B	tg128 @ d128000	31.72 ± 0.12	32.00 ± 0.00

-sm graph (ik_llama):

VRAM usage: GPU0: 17.8 / GPU1: 19.2

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
Qwen/Qwen3.6-27B	pp4096 @ d4000	1420.56 ± 17.77		5700.41 ± 70.54	5699.81 ± 70.54	5700.41 ± 70.54
Qwen/Qwen3.6-27B	tg128 @ d4000	32.15 ± 0.03	33.00 ± 0.00
Qwen/Qwen3.6-27B	pp4096 @ d8000	1387.88 ± 13.61		8716.90 ± 84.91	8716.29 ± 84.91	8716.90 ± 84.91
Qwen/Qwen3.6-27B	tg128 @ d8000	31.81 ± 0.01	33.00 ± 0.00
Qwen/Qwen3.6-27B	pp4096 @ d16000	1362.43 ± 8.36		14751.24 ± 90.08	14750.64 ± 90.08	14751.24 ± 90.08
Qwen/Qwen3.6-27B	tg128 @ d16000	31.13 ± 0.01	32.00 ± 0.00
Qwen/Qwen3.6-27B	pp4096 @ d32000	1318.72 ± 9.42		27373.72 ± 195.00	27373.12 ± 195.00	27373.72 ± 195.00
Qwen/Qwen3.6-27B	tg128 @ d32000	30.32 ± 0.02	31.00 ± 0.00
Qwen/Qwen3.6-27B	pp4096 @ d64000	1216.07 ± 8.43		55999.88 ± 388.37	55999.27 ± 388.37	55999.88 ± 388.37
Qwen/Qwen3.6-27B	tg128 @ d64000	28.86 ± 0.04	30.00 ± 0.00
Qwen/Qwen3.6-27B	pp4096 @ d128000	1055.71 ± 7.36		125132.30 ± 869.60	125131.69 ± 869.60	125132.30 ± 869.60
Qwen/Qwen3.6-27B	tg128 @ d128000	26.35 ± 0.00	27.00 ± 0.00

submitted by /u/grumd
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Setup:

-sm row:

-sm tensor:

-sm graph (ik_llama):

Discussion (0)

More from r/LocalLLaMA