Linux - Why does llama.cpp ROCm consume SO much VRAM for KV cache compared to Vulkan?
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
I have a docker stack with a bunch of AI services and llama.cpp server is the brain.
I've got a working vulkan yml snippet for llama.cpp but out of curiosity, I flipped it to ROCM (latest build) and did not see ANY performance improvement. In fact, I noticed that for the SAME model, SAME context setting and same KV Cache quant (Q8_0) - the ROCm version consumed 29.1gb of VRAM -vs- 25.3gb with Vulkan.
Am I missing something here? Is this phenomenon unique to my GPU or some other variable in my setup, hardware or software?
[link] [comments]
More from r/LocalLLaMA
-
A First Comprehensive Study of TurboQuant: Accuracy and Performance
May 14
-
NVIDIA Reportedly Prepares RTX 5090 Price Hike Amid Rising GDDR7 Costs (maybe RTX 50 and PRO series as well)
May 14
-
I tracked EU GPU prices across 15 stores for 50+ days - RTX 5090 is the only card not dropping in price
May 14
-
MLX 16/8/4/2-bit quants of nvidia/llama-embed-nemotron-8b
May 14
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.