llama.cpp releases
455 articles archived · Visit source ↗ · RSS
-
llama.cpp releases dev-tools 1mo ago
b9313
ggml : Parallelize quant LUT init ( #23595 ) Use OpenMP to parallelize iq2xs_init_impl and iq3xs_init_impl. Move the OpenMP detection from ggml-cpu to ggml-base. Update OpenMP dependencies in ggml-config.cmake.in. macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon…
28 -
llama.cpp releases dev-tools 1mo ago
b9311
vendor : update cpp-httplib to 0.45.1 ( #23639 ) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU) Ubuntu x64 (Vulkan) Ubuntu arm64 (Vulkan) Ubuntu…
25 -
llama.cpp releases dev-tools 1mo ago
b9310
server: fix checkpoints creation ( #22929 ) common : add common_chat_split_by_role cont : fix spans to reach end of message server: fix checkpoints creation extract message_spans from chat templates find the prompt token position before the latest user message split prompt…
36 -
llama.cpp releases dev-tools 1mo ago
b9309: perplexity : fix even more integer overflows (#23623)
Co-authored-by: Stanisław Szymczyk [email protected]
30 -
llama.cpp releases dev-tools 1mo ago
b9305
cmake : fix ui build ( #23592 ) cmake/ui : add -fPIC to llama-ui static lib cmake : rename host compiled embed helper macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU)…
29 -
llama.cpp releases dev-tools 1mo ago
b9301
hexagon: apply repl optimization in flash attn softmax as #22993 ( #23 …
27 -
llama.cpp releases dev-tools 1mo ago
b9297
model : add NVFP4 MTP scale tensors ( #23563 ) Add NVFP4 MTP scale tensors Link Qwen3.5 MTP tensors Aligned nullptr macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU)…
11 -
llama.cpp releases dev-tools 1mo ago
b9296
ggml : Check the right iface method before using the fallback 2d get ( #23514 ) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU) Ubuntu x64…
9 -
llama.cpp releases dev-tools 1mo ago
b9295
vulkan: fix windows find_package of SPIRV-Headers ( #23215 ) vulkan: fix windows find_package of SPIRV-Headers not windows-only macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu…
17 -
llama.cpp releases dev-tools 1mo ago
b9294
opencl: generalize Adreno MoE kernels on M ( #23449 ) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU) Ubuntu x64 (Vulkan) Ubuntu arm64 (Vulkan)…
17 -
llama.cpp releases dev-tools 1mo ago
b9291
SYCL: improve MoE prefill throughput ( #23142 ) change k_copy_src1_to_contiguous so that uses a precomputed contiguous mapping where all rows "owned" by an expert are in one slice with a know starts and ends switch the O(n_as * n_routed_rows) contraption to a counting sort-based…
27 -
llama.cpp releases dev-tools 1mo ago
b9292
perplexity : fix integer overflow ( #23496 ) Co-authored-by: Stanisław Szymczyk [email protected] macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU)…
13 -
llama.cpp releases dev-tools 1mo ago
b9290
sycl : Level Zero detection in ggml_sycl_init ( #23097 ) [SYCL] Centralize Level Zero detection in ggml_sycl_init use the same wording get back the warning macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework…
15 -
llama.cpp releases dev-tools 1mo ago
b9289
SYCL : gated_delta_net K>1 ( #23174 ) sycl_gated_delta_net K>1 editor_config macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU) Ubuntu x64 (Vulkan)…
34 -
llama.cpp releases dev-tools 1mo ago
b9286
ggml-zendnn : add Q8_0 quantization support ( #23414 ) ggml-zendnn : add Q8_0 quantization support ggml-zendnn : sync with latest ZenDNN ggml-zendnn : address review comments for Q8_0 macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS…
15 -
llama.cpp releases dev-tools 1mo ago
b9285
cmake : build router app only during standalone builds ( #23521 ) Co-authored-by: Stanisław Szymczyk [email protected] macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64…
17 -
llama.cpp releases dev-tools 1mo ago
b9284
vocab : fix HybridDNA tokenizer ( #23466 ) vocab : mark hybriddna k-mers to avoid BPE token collisions improved loop Co-authored-by: Sigbjørn Skjæret [email protected] macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel…
9 -
llama.cpp releases dev-tools 1mo ago
b9283
cmake : add install() for impl libraries + fix apple builds ( #23511 ) pi : update ci : fix ios build ci : fix andoroid ci : fix apple builds cmake : add install() for impl libraries Add install(TARGETS LIBRARY) for all -impl libraries that were changed from STATIC to shared…
14 -
llama.cpp releases dev-tools 1mo ago
b9279
vulkan: fuse snake activation (mul, sin, sqr, mul, add) ( #22855 ) vulkan: fuse snake activation (mul, sin, sqr, mul, add) Add snake.comp shader with F32 / F16 / BF16 pipelines and ggml_vk_snake_dispatch_fused. The matcher recognizes the naive 5 op decomposition emitted by audio…
23 -
llama.cpp releases dev-tools 1mo ago
b9277
tests : move save-load-state from examples to tests ( #23336 ) tests : move save-load-state from examples to tests Move examples/save-load-state/ to tests/test-save-load-state.cpp Remove subdirectory reference from examples/CMakeLists.txt Add test to tests/CMakeLists.txt as a…
25 -
llama.cpp releases dev-tools 1mo ago
b9276
server: expose prompt token counts in /slots endpoint ( #23454 ) Add n_prompt_tokens, n_prompt_tokens_processed, and n_prompt_tokens_cache to the /slots JSON response. These fields are already tracked internally but were not exposed, making it impossible for clients to monitor…
15 -
llama.cpp releases dev-tools 1mo ago
b9275
metal : optimize concat kernel and fix set kernel threads ( #23411 ) metal : fix GGML_OP_SET kernel threads tests : extend test_cpy to support different src/dst shapes Extend test_cpy to support different source and destination tensor shapes for CPY operations (reshaping), where…
37 -
llama.cpp releases dev-tools 1mo ago
b9274
server : free draft/MTP resources on sleep to fix VRAM leak ( #23461 ) The destroy() function in server_context_impl only cleaned up the main model and context (via llama_init.reset()) but did not free the speculative decoder (spec), draft context (ctx_dft), or draft model…
22 -
llama.cpp releases dev-tools 1mo ago
b9273
server: re-inject subcommand when router spawns children under unified binary ( #23442 ) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU) Ubuntu…
32 -
llama.cpp releases dev-tools 1mo ago
b9272
app : add batched-bench, fit-params, quantize & perplexity ( #23459 ) app : add batched-bench, fit-params, quantize & perplexity Signed-off-by: Adrien Gallouët [email protected] Add missing main.cpp Signed-off-by: Adrien Gallouët [email protected] Add EOL Signed-off-by:…
37 -
llama.cpp releases dev-tools 1mo ago
b9271
mtp: use inp_out_ids for skipping logit computation ( #23433 ) when doing a follow-up decode for the draft model, we were always doing the logit computation even though it is not required. macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS…
23 -
llama.cpp releases dev-tools 1mo ago
b9270
vocab : add Carbon-3B (HybridDNATokenizer) support ( #23410 ) vocab : add Carbon-3B (HybridDNATokenizer) support Adds a new BPE pre-type LLAMA_VOCAB_PRE_TYPE_CARBON for the HybridDNATokenizer used by HuggingFaceBio/Carbon-{500M,3B,8B}. The base BPE is Qwen3-4B-Base's; what…
11 -
llama.cpp releases dev-tools 1mo ago
b9267
ggml : Check the right iface method before using the fallback 2d get ( #23306 ) Probably no backends implement only one of 2d get/set, but this might be annoying for some future backend developer trying to add 2d get/set. macOS/iOS: macOS Apple Silicon (arm64) macOS Apple…
11 -
llama.cpp releases dev-tools 1mo ago
b9266
llama-graph: fix null-buffer crash in llm_graph_input_attn_kv_iswa for SWA-only models ( #23131 ) When a model has zero non-SWA attention layers (e.g. a SWA-only slice of Gemma 4), the base KV cache has no layer tensors. The input tensors (self_k_idxs, self_v_idxs, self_kq_mask)…
13 -
llama.cpp releases dev-tools 1mo ago
b9264
app : show version ( #23426 ) Signed-off-by: Adrien Gallouët [email protected] macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU) Ubuntu x64…
26 -
llama.cpp releases dev-tools 1mo ago
b9265: hexagon: ssm-conv fix for large prompts (#23307)
hexagon: remove gathers and better handling of vtcm in ssm-conv hexagon: relax ssm-conv gating requirements hexagon: add new prefill ssm-conv backend test hexagon: remove trailing white space hex-rope: uninline rope_cache_init, otherwise it breaks after rebaseing with SSM_CONV…
34 -
llama.cpp releases dev-tools 1mo ago
b9263
mtmd, model : merge HunyuanOCR into HunyuanVL and fix OCR vision precision ( #23329 ) HunyuanOCR shares the same HF arch and vision layout as HunyuanVL butwas split into a separate path that skipped the +0.1 bilinear sampler used by the HF reference. Collapse OCR into the…
12 -
llama.cpp releases dev-tools 1mo ago
b9260
opencl: refactor backend initilization ( #23318 ) opencl: refactor initialization opencl: refactor GPU identification opencl: rename for consistency opencl: cache global mem size in dev_ctx opencl: adjust log level opencl: load argsort and flash_attn kernels in supports_op…
7 -
llama.cpp releases dev-tools 1mo ago
b9259
common/speculative : fix nullptr crash in get_devices_str ( #23386 ) ggml_backend_dev_by_name always appends a nullptr sentinel to the devices vector. Skipping nullptr entries prevents assertion failure in ggml_backend_dev_name. Assisted-by: llama.cpp:local pi macOS/iOS: macOS…
20 -
llama.cpp releases dev-tools 1mo ago
b9258
mtmd : DeepSeek-OCR image processing fixes, img_tool::resize padding refactor ( #23345 ) mtmd : deepseek-ocr fixes, improvements and refactoring image processing changes to achieve full parity with Pillow (reference impl) SAM mask casting only when flash-attn is on SAM refactor…
24 -
llama.cpp releases dev-tools 1mo ago
b9257
vulkan: optimize operations in the IM2COL shader ( #22685 ) vulkan: optimize operations in the IM2COL shader Add comments and improve the code formatting macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux:…
23 -
llama.cpp releases dev-tools 1mo ago
b9255
hexagon: HMX quantized matmul rework ( #23368 ) hmx-mm: update debug logging in hmx-mm hmx-mm: update dequant logic to use HVX_vector_x2/4 hmx-mm: remove non-pipelined version of the quantize matmul It seems that we don't reall need non-pipelined version hmx-mm: use activation…
36 -
llama.cpp releases dev-tools 1mo ago
b9254
Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) ( #22522 ) Adds initial PDL setup. Adds PDL barriers based on simple heuristic: place "sync" before first input pointer access, and "launch" after last write, e.g. to tensors like dst.…
17 -
llama.cpp releases dev-tools 1mo ago
b9253
app : introduce the llama unified executable ( #23296 ) app : introduce the llama unified executable Signed-off-by: Adrien Gallouët [email protected] Use serve for server Signed-off-by: Adrien Gallouët [email protected] Hide completion and bench, add help command…
26 -
llama.cpp releases dev-tools 1mo ago
b9251
mtmd: fit_params now take into account mmproj ( #21489 ) mtmd: fit_params now take into account mmproj rename alloc_compute_meta to reserve_compute_meta rm unused functions add ggml_backend_dev_t support add debug log macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon…
23 -
llama.cpp releases dev-tools 1mo ago
b9247
metal : optimize pad + cpy ( #23354 ) metal : optimize pad metal : optinmize cpy cont : better row packing in threadgroup macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64…
29 -
llama.cpp releases dev-tools 1mo ago
b9245
ggml-cuda: tune RDNA3 Q6_K MMVQ nwarps ( #23349 ) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU) Ubuntu x64 (Vulkan) Ubuntu arm64 (Vulkan) Ubuntu…
35 -
llama.cpp releases dev-tools 1mo ago
b9244
opencl: add MoE support for q4_k, q5_k, q6_k on Adreno ( #23303 ) opencl: add q4_k moe support opencl: add q5_k moe support opencl: add q6_k moe support opencl: adjust format Co-authored-by: Li He [email protected] macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon…
33 -
llama.cpp releases dev-tools 1mo ago
b9243
hexagon: add MROPE and IMROPE support in HTP rope op ( #23317 ) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU) Ubuntu x64 (Vulkan) Ubuntu arm64…
18 -
llama.cpp releases dev-tools 1mo ago
b9235
llama : MTP clean-up ( #23269 ) llama : disable equal splits for recurrent memory with partial rollback spec : re-enable p-min with MTP drafts spec : re-enable ngram spec in combination with RS rollback spec : fix ngram-map-* params spec : fix acceptance logic in combined ngram…
27 -
llama.cpp releases dev-tools 1mo ago
b9240
common: fix --help for --verbosity ( #23278 ) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU) Ubuntu x64 (Vulkan) Ubuntu arm64 (Vulkan) Ubuntu x64…
5 -
llama.cpp releases dev-tools 1mo ago
b9239
common: fix --fit verbosity with --verbosity 4 ( #23282 ) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU) Ubuntu x64 (Vulkan) Ubuntu arm64…
35 -
llama.cpp releases dev-tools 1mo ago
b9222
hexagon: add support for TRI op ( #22822 ) Hexagon: TRI HVX Kernel addition to ggml hexagon HTP ops and context addressed PR review comments for TRI op hexagon: clang format hex-unary: remove merge conflict markers hex-ggml: remove duplicate op cases (merge conflict) hex-ggml:…
36