Setup:
+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 610.43.02 KMD Version: 610.43.02 CUDA UMD Version: 13.3 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3080 Off | 00000000:01:00.0 Off | N/A | | 40% 30C P8 10W / 320W | 238MiB / 20480MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 3080 Off | 00000000:03:00.0 Off | N/A | | 40% 29C P8 8W / 320W | 17MiB / 20480MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+
Yes, these are the alibaba 3080 20gb, just arrived today. Great buy tbh.
I've used llama-benchy to benchmark prompt processing speed and token generation with ik_llama and llama.cpp with row, tensor and graph split modes.
Model used: https://huggingface.co/unsloth/Qwen3.6-27B-GGUF/blob/main/Qwen3.6-27B-Q8_0.gguf
No MTP for this benchmark.
Used latest version of ik_llama and llama.cpp for today. Just updated and recompiled before benchmarking.
Arguments used for all 3 runs:
-m '<...>/Qwen3.6-27B-Q8_0.gguf' \ --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 \ -np 1 -c 135000 -ngl 99
Arguments used for llama.cpp:
-sm row
-sm tensor
Arguments for ik_llama:
-sm graph
-sm row:
VRAM usage: GPU0: 18.2 / GPU1: 18.5
Results:
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
| Qwen/Qwen3.6-27B | pp4096 @ d4000 | 1732.89 ± 14.86 | | 4673.37 ± 40.08 | 4673.07 ± 40.08 | 4673.37 ± 40.08 |
| Qwen/Qwen3.6-27B | tg128 @ d4000 | 23.03 ± 0.01 | 24.00 ± 0.00 | | | |
| Qwen/Qwen3.6-27B | pp4096 @ d8000 | 1766.49 ± 7.45 | | 6848.27 ± 29.08 | 6847.97 ± 29.08 | 6848.27 ± 29.08 |
| Qwen/Qwen3.6-27B | tg128 @ d8000 | 22.83 ± 0.01 | 23.00 ± 0.00 | | | |
| Qwen/Qwen3.6-27B | pp4096 @ d16000 | 1756.67 ± 9.84 | | 11441.05 ± 63.85 | 11440.74 ± 63.85 | 11441.05 ± 63.85 |
| Qwen/Qwen3.6-27B | tg128 @ d16000 | 22.44 ± 0.00 | 23.00 ± 0.00 | | | |
| Qwen/Qwen3.6-27B | pp4096 @ d32000 | 1670.17 ± 7.88 | | 21613.73 ± 101.44 | 21613.42 ± 101.44 | 21613.73 ± 101.44 |
| Qwen/Qwen3.6-27B | tg128 @ d32000 | 21.71 ± 0.01 | 22.00 ± 0.00 | | | |
| Qwen/Qwen3.6-27B | pp4096 @ d64000 | 1481.15 ± 4.23 | | 45976.46 ± 130.94 | 45976.15 ± 130.94 | 45976.46 ± 130.94 |
| Qwen/Qwen3.6-27B | tg128 @ d64000 | 20.41 ± 0.00 | 21.00 ± 0.00 | | | |
| Qwen/Qwen3.6-27B | pp4096 @ d128000 | 1195.01 ± 2.36 | | 110541.23 ± 217.70 | 110540.93 ± 217.70 | 110541.23 ± 217.70 |
| Qwen/Qwen3.6-27B | tg128 @ d128000 | 18.23 ± 0.00 | 19.00 ± 0.00 | | | |
-sm tensor:
VRAM usage: GPU0: 18.1 / GPU1: 17.9
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
| Qwen/Qwen3.6-27B | pp4096 @ d4000 | 1412.73 ± 15.38 | | 5732.50 ± 61.94 | 5732.15 ± 61.94 | 5732.50 ± 61.94 |
| Qwen/Qwen3.6-27B | tg128 @ d4000 | 38.95 ± 0.05 | 40.00 ± 0.00 | | | |
| Qwen/Qwen3.6-27B | pp4096 @ d8000 | 1400.96 ± 5.46 | | 8635.04 ± 32.88 | 8634.68 ± 32.88 | 8635.04 ± 32.88 |
| Qwen/Qwen3.6-27B | tg128 @ d8000 | 38.68 ± 0.10 | 39.00 ± 0.00 | | | |
| Qwen/Qwen3.6-27B | pp4096 @ d16000 | 1381.89 ± 4.16 | | 14543.59 ± 43.73 | 14543.23 ± 43.73 | 14543.59 ± 43.73 |
| Qwen/Qwen3.6-27B | tg128 @ d16000 | 38.14 ± 0.11 | 39.00 ± 0.00 | | | |
| Qwen/Qwen3.6-27B | pp4096 @ d32000 | 1328.03 ± 2.82 | | 27181.67 ± 57.72 | 27181.31 ± 57.72 | 27181.67 ± 57.72 |
| Qwen/Qwen3.6-27B | tg128 @ d32000 | 37.13 ± 0.01 | 38.00 ± 0.00 | | | |
| Qwen/Qwen3.6-27B | pp4096 @ d64000 | 1219.17 ± 2.61 | | 55856.47 ± 119.00 | 55856.12 ± 119.00 | 55856.47 ± 119.00 |
| Qwen/Qwen3.6-27B | tg128 @ d64000 | 35.18 ± 0.01 | 36.00 ± 0.00 | | | |
| Qwen/Qwen3.6-27B | pp4096 @ d128000 | 1036.75 ± 1.70 | | 127414.43 ± 208.98 | 127414.08 ± 208.98 | 127414.43 ± 208.98 |
| Qwen/Qwen3.6-27B | tg128 @ d128000 | 31.72 ± 0.12 | 32.00 ± 0.00 | | | |
-sm graph (ik_llama):
VRAM usage: GPU0: 17.8 / GPU1: 19.2
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
| Qwen/Qwen3.6-27B | pp4096 @ d4000 | 1420.56 ± 17.77 | | 5700.41 ± 70.54 | 5699.81 ± 70.54 | 5700.41 ± 70.54 |
| Qwen/Qwen3.6-27B | tg128 @ d4000 | 32.15 ± 0.03 | 33.00 ± 0.00 | | | |
| Qwen/Qwen3.6-27B | pp4096 @ d8000 | 1387.88 ± 13.61 | | 8716.90 ± 84.91 | 8716.29 ± 84.91 | 8716.90 ± 84.91 |
| Qwen/Qwen3.6-27B | tg128 @ d8000 | 31.81 ± 0.01 | 33.00 ± 0.00 | | | |
| Qwen/Qwen3.6-27B | pp4096 @ d16000 | 1362.43 ± 8.36 | | 14751.24 ± 90.08 | 14750.64 ± 90.08 | 14751.24 ± 90.08 |
| Qwen/Qwen3.6-27B | tg128 @ d16000 | 31.13 ± 0.01 | 32.00 ± 0.00 | | | |
| Qwen/Qwen3.6-27B | pp4096 @ d32000 | 1318.72 ± 9.42 | | 27373.72 ± 195.00 | 27373.12 ± 195.00 | 27373.72 ± 195.00 |
| Qwen/Qwen3.6-27B | tg128 @ d32000 | 30.32 ± 0.02 | 31.00 ± 0.00 | | | |
| Qwen/Qwen3.6-27B | pp4096 @ d64000 | 1216.07 ± 8.43 | | 55999.88 ± 388.37 | 55999.27 ± 388.37 | 55999.88 ± 388.37 |
| Qwen/Qwen3.6-27B | tg128 @ d64000 | 28.86 ± 0.04 | 30.00 ± 0.00 | | | |
| Qwen/Qwen3.6-27B | pp4096 @ d128000 | 1055.71 ± 7.36 | | 125132.30 ± 869.60 | 125131.69 ± 869.60 | 125132.30 ± 869.60 |
| Qwen/Qwen3.6-27B | tg128 @ d128000 | 26.35 ± 0.00 | 27.00 ± 0.00 | | | |
submitted by
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.