r/LocalLLaMA · · 1 min read

b9410 MTP VRAM Save for F16 and FA llama.cpp

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

B9410

llama: use f16 mask for FA to save VRAM

23764

Merged am17an merged 3 commits into ggml-org:master from am17an:kq_mask_f16 13 hours ago Conversation17 (17) Commits3 (3) Checks27 (27) Files changed4 (4) Conversation u/am17an am17an commented 3 days ago • Overview Currently we reserve the KQ mask in f32 even if FA is used, which is then is converted to f16 while passing to backends. The f32 mask still uses the compute buffer even though is not used, taking up extra VRAM. This PR reserves the kq-mask in f16. This provides 1.2GB of VRAM saving at -ub 2048 and ~300Mb at -ub 512 when using MTP

submitted by /u/Bulky-Priority6824
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA