r/LocalLLaMA · · 4 min read

Pipeline parallelism in llama.cpp may be wasting your VRAM

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

By default, llama.cpp enables pipeline parallelism, presumably to speed up inference. In my testing, I found that pipeline parallelism has no speed benefit and comes at a significant cost of VRAM.

This cost can be avoided by compiling llama.cpp with the -DGGML_SCHED_MAX_COPIES=1 option. This prevents llama.cpp from allocating a much larger compute buffer when pipeline parallelism is enabled.

Pipeline parallelism is enabled when --split-mode layer is used (the default) and all model layers and all compute is offloaded to the GPU. If compiled with the default options, llama.cpp allocates four sched copies instead of one when pipeline parallelism is enabled. I don't know exactly what a sched copy is, but it's a significant contributor to the size of the compute buffer in VRAM. Four copies consume significantly more VRAM, especially when context cache quantization is used.

I did a whole lot of testing to confirm that, with my setup at least, allocating those four sched copies is a complete waste of VRAM. There is no speedup whatsoever.

edit: Multiple users have pointed out in the comments that pipeline parallelism is beneficial when submitting parallel requests. I didn't test that. If you often submit parallel requests, you can test for yourself and see if the speedup is worth the VRAM cost.

For this test, I compared three builds of llama.cpp, all using the Vulkan backend. The first build used the default option, GGML_SCHED_MAX_COPIES=4. The second used GGML_SCHED_MAX_COPIES=1. The third used GGML_BLAS=ON GGML_BLAS_VENDOR=OpenBLAS which, I discovered, coincidentally disables pipeline parallelism.

This is the llama.cpp command I ran with each of the three builds:

./llama-server -m models/Qwen3.6-27B-MTP/Qwen3.6-27B-UD-Q5_K_XL.gguf \ --verbosity 4 \ --no-op-offload \ -fa on \ --mlock \ -ngl 999 \ --temp 0.6 --top-k 20 --top-p 0.95 --presence-penalty 0.0 \ --cache-type-k f16 --cache-type-v q8_0 \ --host 0.0.0.0 

Here are the data I collected after three trials:

Configuration Trial Input tokens Input t/s Output tokens Output t/s Compute GPU1 (MB) Compute GPU2 (MB) Compute Host (MB) Context size (tokens)
Pipeline parallelism with 4 sched copies 1 30564 362.66 3524 17.24 1022 910 364 88832
Pipeline parallelism with 4 sched copies 2 30564 362.61 4072 17.24 1023 913 367 88832
Pipeline parallelism with 4 sched copies 3 30564 362.86 4475 17.24 1022 912 366 88576
Pipeline parallelism with 1 sched copy 1 30564 362.99 4100 17.26 242 242 130 113408
Pipeline parallelism with 1 sched copy 2 30564 362.61 4055 17.26 243 243 131 113920
Pipeline parallelism with 1 sched copy 3 30564 362.40 4062 17.27 243 243 131 113920
No pipeline parallelism 1 30564 362.88 3482 17.28 242 242 130 113408
No pipeline parallelism 2 30564 362.93 3969 17.26 243 243 131 113920
No pipeline parallelism 3 30564 363.01 4001 17.26 243 243 131 113920

As you can see, inference speed was virtually identical in all configurations. However, the compute buffer size was much larger with pipeline parallelism and 4 sched copies, which is the llama.cpp default. It consumed an additional 1.5 GB of VRAM with my specific model and settings compared to the other configurations.

The compute buffer bloat seems to be much worse if context cache quantization is used. I tried the same test without the --cache-type-k f16 --cache-type-v q8_0 options and got the following results:

Configuration Input tokens Input t/s Output tokens Output t/s Compute GPU1 (MB) Compute GPU2 (MB) Compute Host (MB) Context size (tokens)
Pipeline parallelism with 4 sched copies 30564 333.77 4614 17.07 481 481 327 78592
Pipeline parallelism with 1 sched copy 30564 333.66 4073 17.08 219 219 105 87552
No pipeline parallelism 30564 333.71 4058 17.08 219 219 105 87552

In this test, the compute buffer was "only" about 0.5 GB bigger with pipeline parallelism and 4 sched copies.

The compute buffer bloat with four sched copies and context quantization is so severe that it partially cancels out the VRAM savings of quantizing the cache!

Of course, all of these findings are specific to my computer and llama.cpp settings. Your results may vary.

Here's more information about my system:

Component Details
CPU Intel Core i5-13600K
GPU 1 AMD Radeon RX 6800 XT (16GB)
GPU 2 AMD Radeon RX 6700 XT (12GB)
System RAM 2x16GB DDR5-3200
Operating System Kubuntu 26.04 HWE
submitted by /u/Warrenio
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA