llama.cpp releases · · 1 min read

b9330

Mirrored from llama.cpp releases for archival readability. Support the source by reading on the original site.

model: tag ffn_latent as MUL_MAT to fix buft probe (#23664)

ffn_latent_down/up are declared GGML_OP_MUL in LLM_TENSOR_INFOS but
nemotron-h feeds them through ggml_mul_mat. The loader buft probe asks
the backend about the declared op, so it tested an elementwise MUL on a
q8_0 weight. That used to return true unconditionally and the weight
stayed on GPU by luck. Once supports_op told the truth, the probe got a
no and the loader pushed the weight and its matmul to CPU, splitting the
graph. Tagging it MUL_MAT asks the real question, the math is unchanged.

Verified on Nemotron 3 Super 120B Q5_K_M: from 64.9 back to 103.22 t/s.

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

UI:

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from llama.cpp releases