r/LocalLLaMA · · 4 min read

Comparing dual-GPU inference speed between llama.cpp row/tensor split and ik_llama graph split

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Comparing dual-GPU inference speed between llama.cpp row/tensor split and ik_llama graph split

Setup:

+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 610.43.02 KMD Version: 610.43.02 CUDA UMD Version: 13.3 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3080 Off | 00000000:01:00.0 Off | N/A | | 40% 30C P8 10W / 320W | 238MiB / 20480MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 3080 Off | 00000000:03:00.0 Off | N/A | | 40% 29C P8 8W / 320W | 17MiB / 20480MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+

Yes, these are the alibaba 3080 20gb, just arrived today. Great buy tbh.

I've used llama-benchy to benchmark prompt processing speed and token generation with ik_llama and llama.cpp with row, tensor and graph split modes.

Model used: https://huggingface.co/unsloth/Qwen3.6-27B-GGUF/blob/main/Qwen3.6-27B-Q8_0.gguf

No MTP for this benchmark.

Used latest version of ik_llama and llama.cpp for today. Just updated and recompiled before benchmarking.

Arguments used for all 3 runs:

-m '<...>/Qwen3.6-27B-Q8_0.gguf' \ --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 \ -np 1 -c 135000 -ngl 99

Arguments used for llama.cpp:

-sm row

-sm tensor

Arguments for ik_llama:

-sm graph

-sm row:

VRAM usage: GPU0: 18.2 / GPU1: 18.5

Results:

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
Qwen/Qwen3.6-27B pp4096 @ d4000 1732.89 ± 14.86 4673.37 ± 40.08 4673.07 ± 40.08 4673.37 ± 40.08
Qwen/Qwen3.6-27B tg128 @ d4000 23.03 ± 0.01 24.00 ± 0.00
Qwen/Qwen3.6-27B pp4096 @ d8000 1766.49 ± 7.45 6848.27 ± 29.08 6847.97 ± 29.08 6848.27 ± 29.08
Qwen/Qwen3.6-27B tg128 @ d8000 22.83 ± 0.01 23.00 ± 0.00
Qwen/Qwen3.6-27B pp4096 @ d16000 1756.67 ± 9.84 11441.05 ± 63.85 11440.74 ± 63.85 11441.05 ± 63.85
Qwen/Qwen3.6-27B tg128 @ d16000 22.44 ± 0.00 23.00 ± 0.00
Qwen/Qwen3.6-27B pp4096 @ d32000 1670.17 ± 7.88 21613.73 ± 101.44 21613.42 ± 101.44 21613.73 ± 101.44
Qwen/Qwen3.6-27B tg128 @ d32000 21.71 ± 0.01 22.00 ± 0.00
Qwen/Qwen3.6-27B pp4096 @ d64000 1481.15 ± 4.23 45976.46 ± 130.94 45976.15 ± 130.94 45976.46 ± 130.94
Qwen/Qwen3.6-27B tg128 @ d64000 20.41 ± 0.00 21.00 ± 0.00
Qwen/Qwen3.6-27B pp4096 @ d128000 1195.01 ± 2.36 110541.23 ± 217.70 110540.93 ± 217.70 110541.23 ± 217.70
Qwen/Qwen3.6-27B tg128 @ d128000 18.23 ± 0.00 19.00 ± 0.00

-sm tensor:

VRAM usage: GPU0: 18.1 / GPU1: 17.9

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
Qwen/Qwen3.6-27B pp4096 @ d4000 1412.73 ± 15.38 5732.50 ± 61.94 5732.15 ± 61.94 5732.50 ± 61.94
Qwen/Qwen3.6-27B tg128 @ d4000 38.95 ± 0.05 40.00 ± 0.00
Qwen/Qwen3.6-27B pp4096 @ d8000 1400.96 ± 5.46 8635.04 ± 32.88 8634.68 ± 32.88 8635.04 ± 32.88
Qwen/Qwen3.6-27B tg128 @ d8000 38.68 ± 0.10 39.00 ± 0.00
Qwen/Qwen3.6-27B pp4096 @ d16000 1381.89 ± 4.16 14543.59 ± 43.73 14543.23 ± 43.73 14543.59 ± 43.73
Qwen/Qwen3.6-27B tg128 @ d16000 38.14 ± 0.11 39.00 ± 0.00
Qwen/Qwen3.6-27B pp4096 @ d32000 1328.03 ± 2.82 27181.67 ± 57.72 27181.31 ± 57.72 27181.67 ± 57.72
Qwen/Qwen3.6-27B tg128 @ d32000 37.13 ± 0.01 38.00 ± 0.00
Qwen/Qwen3.6-27B pp4096 @ d64000 1219.17 ± 2.61 55856.47 ± 119.00 55856.12 ± 119.00 55856.47 ± 119.00
Qwen/Qwen3.6-27B tg128 @ d64000 35.18 ± 0.01 36.00 ± 0.00
Qwen/Qwen3.6-27B pp4096 @ d128000 1036.75 ± 1.70 127414.43 ± 208.98 127414.08 ± 208.98 127414.43 ± 208.98
Qwen/Qwen3.6-27B tg128 @ d128000 31.72 ± 0.12 32.00 ± 0.00

-sm graph (ik_llama):

VRAM usage: GPU0: 17.8 / GPU1: 19.2

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
Qwen/Qwen3.6-27B pp4096 @ d4000 1420.56 ± 17.77 5700.41 ± 70.54 5699.81 ± 70.54 5700.41 ± 70.54
Qwen/Qwen3.6-27B tg128 @ d4000 32.15 ± 0.03 33.00 ± 0.00
Qwen/Qwen3.6-27B pp4096 @ d8000 1387.88 ± 13.61 8716.90 ± 84.91 8716.29 ± 84.91 8716.90 ± 84.91
Qwen/Qwen3.6-27B tg128 @ d8000 31.81 ± 0.01 33.00 ± 0.00
Qwen/Qwen3.6-27B pp4096 @ d16000 1362.43 ± 8.36 14751.24 ± 90.08 14750.64 ± 90.08 14751.24 ± 90.08
Qwen/Qwen3.6-27B tg128 @ d16000 31.13 ± 0.01 32.00 ± 0.00
Qwen/Qwen3.6-27B pp4096 @ d32000 1318.72 ± 9.42 27373.72 ± 195.00 27373.12 ± 195.00 27373.72 ± 195.00
Qwen/Qwen3.6-27B tg128 @ d32000 30.32 ± 0.02 31.00 ± 0.00
Qwen/Qwen3.6-27B pp4096 @ d64000 1216.07 ± 8.43 55999.88 ± 388.37 55999.27 ± 388.37 55999.88 ± 388.37
Qwen/Qwen3.6-27B tg128 @ d64000 28.86 ± 0.04 30.00 ± 0.00
Qwen/Qwen3.6-27B pp4096 @ d128000 1055.71 ± 7.36 125132.30 ± 869.60 125131.69 ± 869.60 125132.30 ± 869.60
Qwen/Qwen3.6-27B tg128 @ d128000 26.35 ± 0.00 27.00 ± 0.00
submitted by /u/grumd
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA