llama.cpp releases · June 27, 2026 · 1 min read

b9827

Mirrored from llama.cpp releases for archival readability. Support the source by reading on the original site.

[CUDA] Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy (#25057)

[CUDA] Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy

Add a CUDA ggml_cpy fast path for same-type, same-shape strided copies that are just 2D pitched block copies.
When tensors are not fully contiguous but each row is contiguous, it now uses cudaMemcpy2DAsync instead of the slow element-wise scalar copy kernel.

This fixes the GDN recurrent snapshot update with -np 4, where rollback slots are separated by cache stride gaps.

Add new tests that execute the new optimized strided copy path
Return unsupported for strided copy in OpenVINO, as new tests are failing

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from llama.cpp releases