r/LocalLLaMA · June 8, 2026 · 4 min read

Pipeline parallelism in llama.cpp may be wasting your VRAM

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

By default, llama.cpp enables pipeline parallelism, presumably to speed up inference. In my testing, I found that pipeline parallelism has no speed benefit and comes at a significant cost of VRAM.

This cost can be avoided by compiling llama.cpp with the -DGGML_SCHED_MAX_COPIES=1 option. This prevents llama.cpp from allocating a much larger compute buffer when pipeline parallelism is enabled.

Pipeline parallelism is enabled when --split-mode layer is used (the default) and all model layers and all compute is offloaded to the GPU. If compiled with the default options, llama.cpp allocates four sched copies instead of one when pipeline parallelism is enabled. I don't know exactly what a sched copy is, but it's a significant contributor to the size of the compute buffer in VRAM. Four copies consume significantly more VRAM, especially when context cache quantization is used.

I did a whole lot of testing to confirm that, with my setup at least, allocating those four sched copies is a complete waste of VRAM. There is no speedup whatsoever.

edit: Multiple users have pointed out in the comments that pipeline parallelism is beneficial when submitting parallel requests. I didn't test that. If you often submit parallel requests, you can test for yourself and see if the speedup is worth the VRAM cost.

For this test, I compared three builds of llama.cpp, all using the Vulkan backend. The first build used the default option, GGML_SCHED_MAX_COPIES=4. The second used GGML_SCHED_MAX_COPIES=1. The third used GGML_BLAS=ON GGML_BLAS_VENDOR=OpenBLAS which, I discovered, coincidentally disables pipeline parallelism.

This is the llama.cpp command I ran with each of the three builds:

./llama-server -m models/Qwen3.6-27B-MTP/Qwen3.6-27B-UD-Q5_K_XL.gguf \ --verbosity 4 \ --no-op-offload \ -fa on \ --mlock \ -ngl 999 \ --temp 0.6 --top-k 20 --top-p 0.95 --presence-penalty 0.0 \ --cache-type-k f16 --cache-type-v q8_0 \ --host 0.0.0.0

Here are the data I collected after three trials:

Configuration	Trial	Input tokens	Input t/s	Output tokens	Output t/s	Compute GPU1 (MB)	Compute GPU2 (MB)	Compute Host (MB)	Context size (tokens)
Pipeline parallelism with 4 sched copies	1	30564	362.66	3524	17.24	1022	910	364	88832
Pipeline parallelism with 4 sched copies	2	30564	362.61	4072	17.24	1023	913	367	88832
Pipeline parallelism with 4 sched copies	3	30564	362.86	4475	17.24	1022	912	366	88576
Pipeline parallelism with 1 sched copy	1	30564	362.99	4100	17.26	242	242	130	113408
Pipeline parallelism with 1 sched copy	2	30564	362.61	4055	17.26	243	243	131	113920
Pipeline parallelism with 1 sched copy	3	30564	362.40	4062	17.27	243	243	131	113920
No pipeline parallelism	1	30564	362.88	3482	17.28	242	242	130	113408
No pipeline parallelism	2	30564	362.93	3969	17.26	243	243	131	113920
No pipeline parallelism	3	30564	363.01	4001	17.26	243	243	131	113920

As you can see, inference speed was virtually identical in all configurations. However, the compute buffer size was much larger with pipeline parallelism and 4 sched copies, which is the llama.cpp default. It consumed an additional 1.5 GB of VRAM with my specific model and settings compared to the other configurations.

The compute buffer bloat seems to be much worse if context cache quantization is used. I tried the same test without the --cache-type-k f16 --cache-type-v q8_0 options and got the following results:

Configuration	Input tokens	Input t/s	Output tokens	Output t/s	Compute GPU1 (MB)	Compute GPU2 (MB)	Compute Host (MB)	Context size (tokens)
Pipeline parallelism with 4 sched copies	30564	333.77	4614	17.07	481	481	327	78592
Pipeline parallelism with 1 sched copy	30564	333.66	4073	17.08	219	219	105	87552
No pipeline parallelism	30564	333.71	4058	17.08	219	219	105	87552

In this test, the compute buffer was "only" about 0.5 GB bigger with pipeline parallelism and 4 sched copies.

The compute buffer bloat with four sched copies and context quantization is so severe that it partially cancels out the VRAM savings of quantizing the cache!

Of course, all of these findings are specific to my computer and llama.cpp settings. Your results may vary.

Here's more information about my system:

Component	Details
CPU	Intel Core i5-13600K
GPU 1	AMD Radeon RX 6800 XT (16GB)
GPU 2	AMD Radeon RX 6700 XT (12GB)
System RAM	2x16GB DDR5-3200
Operating System	Kubuntu 26.04 HWE

submitted by /u/Warrenio
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA