Pipeline parallelism in llama.cpp may be wasting your VRAM
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
By default, llama.cpp enables pipeline parallelism, presumably to speed up inference. In my testing, I found that pipeline parallelism has no speed benefit and comes at a significant cost of VRAM.
This cost can be avoided by compiling llama.cpp with the -DGGML_SCHED_MAX_COPIES=1 option. This prevents llama.cpp from allocating a much larger compute buffer when pipeline parallelism is enabled.
Pipeline parallelism is enabled when --split-mode layer is used (the default) and all model layers and all compute is offloaded to the GPU. If compiled with the default options, llama.cpp allocates four sched copies instead of one when pipeline parallelism is enabled. I don't know exactly what a sched copy is, but it's a significant contributor to the size of the compute buffer in VRAM. Four copies consume significantly more VRAM, especially when context cache quantization is used.
I did a whole lot of testing to confirm that, with my setup at least, allocating those four sched copies is a complete waste of VRAM. There is no speedup whatsoever.
edit: Multiple users have pointed out in the comments that pipeline parallelism is beneficial when submitting parallel requests. I didn't test that. If you often submit parallel requests, you can test for yourself and see if the speedup is worth the VRAM cost.
For this test, I compared three builds of llama.cpp, all using the Vulkan backend. The first build used the default option, GGML_SCHED_MAX_COPIES=4. The second used GGML_SCHED_MAX_COPIES=1. The third used GGML_BLAS=ON GGML_BLAS_VENDOR=OpenBLAS which, I discovered, coincidentally disables pipeline parallelism.
This is the llama.cpp command I ran with each of the three builds:
./llama-server -m models/Qwen3.6-27B-MTP/Qwen3.6-27B-UD-Q5_K_XL.gguf \ --verbosity 4 \ --no-op-offload \ -fa on \ --mlock \ -ngl 999 \ --temp 0.6 --top-k 20 --top-p 0.95 --presence-penalty 0.0 \ --cache-type-k f16 --cache-type-v q8_0 \ --host 0.0.0.0 Here are the data I collected after three trials:
| Configuration | Trial | Input tokens | Input t/s | Output tokens | Output t/s | Compute GPU1 (MB) | Compute GPU2 (MB) | Compute Host (MB) | Context size (tokens) |
|---|---|---|---|---|---|---|---|---|---|
| Pipeline parallelism with 4 sched copies | 1 | 30564 | 362.66 | 3524 | 17.24 | 1022 | 910 | 364 | 88832 |
| Pipeline parallelism with 4 sched copies | 2 | 30564 | 362.61 | 4072 | 17.24 | 1023 | 913 | 367 | 88832 |
| Pipeline parallelism with 4 sched copies | 3 | 30564 | 362.86 | 4475 | 17.24 | 1022 | 912 | 366 | 88576 |
| Pipeline parallelism with 1 sched copy | 1 | 30564 | 362.99 | 4100 | 17.26 | 242 | 242 | 130 | 113408 |
| Pipeline parallelism with 1 sched copy | 2 | 30564 | 362.61 | 4055 | 17.26 | 243 | 243 | 131 | 113920 |
| Pipeline parallelism with 1 sched copy | 3 | 30564 | 362.40 | 4062 | 17.27 | 243 | 243 | 131 | 113920 |
| No pipeline parallelism | 1 | 30564 | 362.88 | 3482 | 17.28 | 242 | 242 | 130 | 113408 |
| No pipeline parallelism | 2 | 30564 | 362.93 | 3969 | 17.26 | 243 | 243 | 131 | 113920 |
| No pipeline parallelism | 3 | 30564 | 363.01 | 4001 | 17.26 | 243 | 243 | 131 | 113920 |
As you can see, inference speed was virtually identical in all configurations. However, the compute buffer size was much larger with pipeline parallelism and 4 sched copies, which is the llama.cpp default. It consumed an additional 1.5 GB of VRAM with my specific model and settings compared to the other configurations.
The compute buffer bloat seems to be much worse if context cache quantization is used. I tried the same test without the --cache-type-k f16 --cache-type-v q8_0 options and got the following results:
| Configuration | Input tokens | Input t/s | Output tokens | Output t/s | Compute GPU1 (MB) | Compute GPU2 (MB) | Compute Host (MB) | Context size (tokens) |
|---|---|---|---|---|---|---|---|---|
| Pipeline parallelism with 4 sched copies | 30564 | 333.77 | 4614 | 17.07 | 481 | 481 | 327 | 78592 |
| Pipeline parallelism with 1 sched copy | 30564 | 333.66 | 4073 | 17.08 | 219 | 219 | 105 | 87552 |
| No pipeline parallelism | 30564 | 333.71 | 4058 | 17.08 | 219 | 219 | 105 | 87552 |
In this test, the compute buffer was "only" about 0.5 GB bigger with pipeline parallelism and 4 sched copies.
The compute buffer bloat with four sched copies and context quantization is so severe that it partially cancels out the VRAM savings of quantizing the cache!
Of course, all of these findings are specific to my computer and llama.cpp settings. Your results may vary.
Here's more information about my system:
| Component | Details |
|---|---|
| CPU | Intel Core i5-13600K |
| GPU 1 | AMD Radeon RX 6800 XT (16GB) |
| GPU 2 | AMD Radeon RX 6700 XT (12GB) |
| System RAM | 2x16GB DDR5-3200 |
| Operating System | Kubuntu 26.04 HWE |
[link] [comments]
More from r/LocalLLaMA
-
ggml-webgpu: Improve prefill speeds for k-quants + refactor matmul for Q4/Q5/Q8 and k-quants by yomaytk · Pull Request #24225 · ggml-org/llama.cpp
Jun 9
-
2X tk/s (from 19.4 -> 38.1 tk/s on 1 x MI50) Playing with a hypothesis like speculative decoding.. but instead of an additional side model, exploiting that I can run multiple computations side-by-side AS IF I had Qwen3.6-27B loaded twice in memory - small quants don't use all…
Jun 9
-
I fine-tuned Parakeet 0.6B for medical ASR — open weights, local Mac/CUDA/CPU
Jun 9
-
Quick note on the QAT of recent
Jun 8
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.