llama.cpp releases · · 3 min read

b9254

Mirrored from llama.cpp releases for archival readability. Support the source by reading on the original site.

Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (#22522)

  • Adds initial PDL setup.

  • Adds PDL barriers based on simple heuristic: place "sync" before first input pointer access, and "launch" after last write, e.g. to tensors like dst.

  • Further optimization pass of the first half of kernels

  • Optimized PDL barriers for the second batch of kernels

  • Further refinements after rebase.

  • Moves pdl logic to separate function, removes some whitespace

  • Strips post-hoc PDL logic

  • Adds stream capture PDL setup. Enrolls quantize_q8_1 to leverage pdl to
    overlap execution with previous kernels

  • Enrolls mul_mat_vec_q, rms_norm_f32 and k_bin_bcast (partly) into PDL

  • Enrolls mmvf, rope, set-rows and topk kernels for gpt-oss into PDL

  • Introduce ggml_cuda_kernel_launch, to abstract away cudaLaunchKernelEx,
    to enable hip/musa compatibility

  • Enrolls cpy_scalar_contiguous, k_get_rows_float and rms_norm_f32

  • Enrolls flash_attn_combine_results

  • Fix: Drops needless and broken check of CUDA arch for PDL. PDL either
    works or is without effect.

  • Enrolls flash-attention kernels to pdl

  • Fix: inlines ggml_cuda_kernel_launch, and uses perfect forwarding for
    kernels args. This fixes PDL.

  • Perf: Enrolls k_bin_bcast variadic template invocation into PDL, via
    and template alias and template expansion

  • Enrolls all remaining kernels for qwen3-coder-next into PDL

  • Remove all PDL LC calls to create a baseline

  • Added LC according to internal guidance and tested kernel performance.

  • Enrols missing qwen3-5 kernels passively into PDL.

  • Kernel optimizations (LC signals) for qwen3.5

  • Enrolls ssm-scan kernels into PDL

  • Adds GGML_CUDA_PDL command line option to toggle PDL.

  • Fix: Ada and lower compilation by guarding PDL calls correctly

  • Cleanup: Removes commented out GGML_CUDA_PDL_LC

  • Cleanup: Removes experimental comments

  • Adds 90-virtual to build script so that Hopper GPUs can leverage PDL.

  • Adds stricter checks to enable PDL, adds env-check to disable it, and removes now superfluous compile option to enable PDL.

  • Fix: Correct PDL en/disablement based on device-side arch check. Host
    side check is UB. Required moving from macros to inlined functions

  • Fix: default-disable PDL. Enable by setting GGML_CUDA_ENABLE_PDL=1

  • Enable PDL by default for Hopper+ devices

  • Enrolls softcap_f32 and two flash_attn kernels into PDL.

  • Improves flash attn PDL barrier placement

  • Fix: Perf regression on ada; excludes ada and below from PDL launches

  • Improves some sync barrier placements

  • Drops superfluous constructor

  • Adds #endif guard comments

  • Reverts experimental change to top-k-moe.cu, which moved expensive allocations
    in front of the PDL barrier. It did not have a meaningful impact.

  • Exchanges GGML_CUDA_DISABLE_PDL with GGML_CUDA_PDL. IFF GGML_CUDA_PDL=0
    PDL is disabled

  • Revert "Drops superfluous constructor". Adds const to remaining
    arguments

This reverts commit 12b1d25.

  • Cleanup: Removes and fixes some comments and whitespace

  • Clarifies comment of sync-barrier position

  • Relocates and refactors PDL launch functions and accessories

  • Adds error checking to the regular kernel launch path

  • Drops "auto" in favor of "ggml_cuda_kernel_params"

  • Adds "const" to ggml_cuda_kernel_launch_params

  • [Whitespace] Adds final newline to common.cuh to make editorconfig CI job happy

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from llama.cpp releases