r/LocalLLaMA · · 1 min read

Blackwell and PDL performance increase

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Llama.cpp recently introduced support for Programmatic Dependent Launch (PDL), which is a new feature in Nvidia GPUs (CC >= 90, not including ADA) such as Blackwell. (See PR 22522.)

In short, PDL enables more efficient execution of kernels and as a result better performance. So far, it's not enabled by default, if you don't know about it, you will likely miss it.

To enable PDL you need to build Llama.cpp with the '-D GGML_CUDA_PDL=ON' flag and it's not yet enabled for all kernels, there is likely more performance to be had once more kernels are enabled with PDL.

(To later disable PDL, if needed, do 'export GGML_CUDA_PDL=0' before starting llama.cpp)

Benchmarks

Model pp512 tg128 pp512 @ PDL tg128 @ PDL pp % tg %
Qwen 3.6 35B.A3B MXFP4 5412.39 ± 62.58 172.72 ± 3.94 5416.55 ± 58.92 183.03 ± 0.93 0 5.97
Qwen 3.6 35B.A3B UD-Q5_K_XL 4564.77 ± 47.55 162.24 ± 6.67 4582.22 ± 45.65 177.11 ± 1.29 0 9.17
Gemma 4 26B.A4B NVFP4 6728.74 ± 89.56 107.39 ± 2.44 6850.46 ± 97.86 112.71 ± 0.38 1.8 4.95
Qwen 3.6 27B NVFP4 2687.16 ± 70.18 41.31 ± 0.03 2708.97 ± 55.56 42.22 ± 0.05 0 2.2

(All tests run with b9282 and results are best of two on an RTX Pro 4500 Blackwell 32GB.)

Conclusion

There is virtually no difference on pre-fill, however there is on average 5% to 6% performance boost on token generation based on above tests. According to the PR, somewhere between 4% and 10% improvement on token generation is expected.

As mentioned, this is not enabled by default when building, if you are on Blackwell, this is a free lunch and worth trying out.

submitted by /u/UncleRedz
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA