b9410 MTP VRAM Save for F16 and FA llama.cpp
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
llama: use f16 mask for FA to save VRAM
23764
Merged am17an merged 3 commits into ggml-org:master from am17an:kq_mask_f16 13 hours ago Conversation17 (17) Commits3 (3) Checks27 (27) Files changed4 (4) Conversation u/am17an am17an commented 3 days ago • Overview Currently we reserve the KQ mask in f32 even if FA is used, which is then is converted to f16 while passing to backends. The f32 mask still uses the compute buffer even though is not used, taking up extra VRAM. This PR reserves the kq-mask in f16. This provides 1.2GB of VRAM saving at -ub 2048 and ~300Mb at -ub 512 when using MTP
[link] [comments]
More from r/LocalLLaMA
-
Cost Analysis of my $6.4k Local LLM Server
May 30
-
Would a MacBook M5 16/24/32GB be an upgrade, complement, or waste next to my RTX 4060 laptop?
May 30
-
Running Qwen 3.6 35b MoE With Zoo Code On M1 Max is Amazing! Fully local, battery-powered coding powerhouse!
May 30
-
Whisper.cpp is underwhelming
May 30
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.