r/LocalLLaMA · · 4 min read

Build 9254 fixes my TG regression and adds PDL for NVIDIA GPUs

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I was seeing TG regression on both mtp and non models with the last few builds and had to fall back to b9202 but I just ran the new b9254 and TG has been restored with a bonus 3% uplift on 2x5060ti 16gb on tensor split.

I ran cmake with the PDL flag to give it a shot. I'm going to test without it soon to compare but I'm getting consistent results 3k PP & 127 tg/s on qwen3.6-35b-a3b-Q4_K_XL

I'm not saying PDL is the reason for any of my results but at least this build is working as good or better than b9202. time will tell

Conversation

aendkcommented3 weeks ago

Overview

Programmatic Dependent Launch (PDL) is a CUDA optimization for newer NVIDIA GPUs (CC >= 90; does not include Ada).
It enables overlapping execution of CUDA kernels of the same CUDA stream. Like CUDA graphs, it reduces kernel launch overhead on the device. The benefits of both are additive (PDL + CG > CG > PDL).
This can best be seen visually in this Nsight Systems screenshot of a single CUDA stream; kernels which should normally be strictly ordered are run concurrently:

PDL was already proposed last year in #15479.
This PR integrates better into the CUDA graph semantics, and has vastly better performance. On an RTX PRO 6000, a token generation phase speedup of 10% is not unusual, on DGX Spark, I've seen 4-5% improvement (model dependent, see detailed stats below).

For full PDL performance, kernels need to be equipped with two new features: A synchronization barrier (GGML_CUDA_PDL_SYNC) and a launch signal (GGML_CUDA_PDL_LC). The synchronization barrier limits the kernel execution to wait on the data written by the preceeding kernel so that no race conditions or premature data accesses take place. The launch signal indicates at which point the current kernel can tolerate the start of the next kernel alongside it. Additionally, kernels need to be launched via the new ggml_cuda_kernel_launch() function.

The synchronization barrier can be placed by carefully inspecting the kernel code and identifying the first "real" data access (e.g. excluding pointer arithmetic) of the kernel input. The launch signal placement requires a bit of hand-tuning and benchmarking. In this draft PR, I enrolled all kernels used in gpt-oss 20b, qwen3.5 and nemotron 120B Super. Because these kernels are shared with other models, I've tested more models. I saw speed-ups in almost all models in token generation phases, with prefill/context phases being mostly neutral.

Applied Heuristics:

  • In this draft, for the synchronization barrier placement, I assumed that the first "real" data access of each kernel to be an input tensor. If the are cases where a preceding kernel outputs a scalar and the current kernel reads this scalar before GGML_CUDA_PDL_SYNC, a data race could occur. Before marking this merge-ready, I will double check this again. When reviewing, this should be kept in mind.
  • Correct placement of GGML_CUDA_PDL_LC is a bit of trial and error. This is visible in some kernels where I've commented out some suboptimal placements in some commits. In some kernels, placing GGML_CUDA_PDL_LC is even perf negative (most notably mul_mat_vec_q). Generally, the earlier the signal is placed in the kernel, the more latency limited the kernel is, and the more shared resource contention (due to the premature launch of the successive kernel) the kernel can tolerate.

Further Info on this Implementation

  • This approach can be used even if some kernels in the graph are not enrolled into PDL. If two successive kernels are enrolled, they leverage PDL (eg quantize_q8 and mul_mat_vec_q are enrolled in PDL and are present in many models).
  • Kernels can be enrolled one-by-one.
  • Optimizing the placement of the GGML_CUDA_PDL_LC flag is a bit of trial & error, but good placement for one model appears to be beneficial for other models, too. In internal testing, I did not run into settings which are for example beneficial for model A, but worse for model B performance.

Known issues/TODOs

  • Currently, there is no tooling like memcheck to identify a race condition in the case of an incorrectly placed GGML_CUDA_PDL_SYNC.
  • Need to find a way to automatically disable PDL for unsupported (NVIDIA) GPUs. A simple check on GGML_CUDA_CC_HOPPER did not work.
  • More kernels can be moved to PDL (different launch + sync barrier).
  • Need to remove commented out launch signal experimentation.
  • Like for CUDA graphs themselves, it might make sense to roll this feature out for token generation only at first. Need to check if that is feasible.

How to test it

You need to have a newer NVIDIA GPU (e.g. Blackwell), and you need to compile with -D GGML_CUDA_PDL=ON

How to enroll other kernels into PDL

  • Step 1 : modify the kernel launch with ggml_cuda_kernel_launch() and set GGML_CUDA_PDL_SYNC(). Modifying the kernel launch without setting the sync barrier leads to a race condition.
  • Step 2: Iterate on the placement of GGML_CUDA_PDL_LC(). My loose heuristic was to place it at the function start, measure performance, and then repeat the process for different locations in the middle of the kernel. I then picked the best performing placement. In my testing, placing it near the bottom of a kernel was almost always unproductive.
submitted by /u/Bulky-Priority6824
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA