Build 9254 fixes my TG regression and adds PDL for NVIDIA GPUs
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
I was seeing TG regression on both mtp and non models with the last few builds and had to fall back to b9202 but I just ran the new b9254 and TG has been restored with a bonus 3% uplift on 2x5060ti 16gb on tensor split.
I ran cmake with the PDL flag to give it a shot. I'm going to test without it soon to compare but I'm getting consistent results 3k PP & 127 tg/s on qwen3.6-35b-a3b-Q4_K_XL
I'm not saying PDL is the reason for any of my results but at least this build is working as good or better than b9202. time will tell
Conversation
aendkcommented3 weeks ago
Overview
Programmatic Dependent Launch (PDL) is a CUDA optimization for newer NVIDIA GPUs (CC >= 90; does not include Ada).
It enables overlapping execution of CUDA kernels of the same CUDA stream. Like CUDA graphs, it reduces kernel launch overhead on the device. The benefits of both are additive (PDL + CG > CG > PDL).
This can best be seen visually in this Nsight Systems screenshot of a single CUDA stream; kernels which should normally be strictly ordered are run concurrently:
PDL was already proposed last year in #15479.
This PR integrates better into the CUDA graph semantics, and has vastly better performance. On an RTX PRO 6000, a token generation phase speedup of 10% is not unusual, on DGX Spark, I've seen 4-5% improvement (model dependent, see detailed stats below).
For full PDL performance, kernels need to be equipped with two new features: A synchronization barrier (GGML_CUDA_PDL_SYNC) and a launch signal (GGML_CUDA_PDL_LC). The synchronization barrier limits the kernel execution to wait on the data written by the preceeding kernel so that no race conditions or premature data accesses take place. The launch signal indicates at which point the current kernel can tolerate the start of the next kernel alongside it. Additionally, kernels need to be launched via the new ggml_cuda_kernel_launch() function.
The synchronization barrier can be placed by carefully inspecting the kernel code and identifying the first "real" data access (e.g. excluding pointer arithmetic) of the kernel input. The launch signal placement requires a bit of hand-tuning and benchmarking. In this draft PR, I enrolled all kernels used in gpt-oss 20b, qwen3.5 and nemotron 120B Super. Because these kernels are shared with other models, I've tested more models. I saw speed-ups in almost all models in token generation phases, with prefill/context phases being mostly neutral.
Applied Heuristics:
- In this draft, for the synchronization barrier placement, I assumed that the first "real" data access of each kernel to be an input tensor. If the are cases where a preceding kernel outputs a scalar and the current kernel reads this scalar before
GGML_CUDA_PDL_SYNC, a data race could occur. Before marking this merge-ready, I will double check this again. When reviewing, this should be kept in mind. - Correct placement of
GGML_CUDA_PDL_LCis a bit of trial and error. This is visible in some kernels where I've commented out some suboptimal placements in some commits. In some kernels, placingGGML_CUDA_PDL_LCis even perf negative (most notablymul_mat_vec_q). Generally, the earlier the signal is placed in the kernel, the more latency limited the kernel is, and the more shared resource contention (due to the premature launch of the successive kernel) the kernel can tolerate.
Further Info on this Implementation
- This approach can be used even if some kernels in the graph are not enrolled into PDL. If two successive kernels are enrolled, they leverage PDL (eg
quantize_q8andmul_mat_vec_qare enrolled in PDL and are present in many models). - Kernels can be enrolled one-by-one.
- Optimizing the placement of the
GGML_CUDA_PDL_LCflag is a bit of trial & error, but good placement for one model appears to be beneficial for other models, too. In internal testing, I did not run into settings which are for example beneficial for model A, but worse for model B performance.
Known issues/TODOs
- Currently, there is no tooling like memcheck to identify a race condition in the case of an incorrectly placed
GGML_CUDA_PDL_SYNC. - Need to find a way to automatically disable PDL for unsupported (NVIDIA) GPUs. A simple check on
GGML_CUDA_CC_HOPPERdid not work. - More kernels can be moved to PDL (different launch + sync barrier).
- Need to remove commented out launch signal experimentation.
- Like for CUDA graphs themselves, it might make sense to roll this feature out for token generation only at first. Need to check if that is feasible.
How to test it
You need to have a newer NVIDIA GPU (e.g. Blackwell), and you need to compile with -D GGML_CUDA_PDL=ON
How to enroll other kernels into PDL
- Step 1 : modify the kernel launch with
ggml_cuda_kernel_launch()and setGGML_CUDA_PDL_SYNC(). Modifying the kernel launch without setting the sync barrier leads to a race condition. - Step 2: Iterate on the placement of
GGML_CUDA_PDL_LC(). My loose heuristic was to place it at the function start, measure performance, and then repeat the process for different locations in the middle of the kernel. I then picked the best performing placement. In my testing, placing it near the bottom of a kernel was almost always unproductive.
[link] [comments]
More from r/LocalLLaMA
-
AMD Powers Next-Generation Agent Computers with New Ryzen AI Halo Developer Platform and Ryzen AI Max PRO 400 Series Processors
May 21
-
Qwen3.6 27B and llama.cpp appreciation post
May 21
-
Same task in github-copilot, pi, claude-code, and opencode with Qwen3.6 27B
May 21
-
Training a vision model from scratch on iPod touch 4 images
May 21
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.