Performance When Offloading Large Models to System RAM?
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
I noticed for people running large models, or those that would be cost prohibitive to have all in GPU VRAM, I noticed that the dominate strategy is one GPU with a large pool of system DRAM to offload the weights, as per GB VRAM is always more expensive than normal DDR5.
However, if that is the case, there any advantage to have a large VRAM pool anyways, or would, for example, running Deepseek V4 Pro on a RTX 5090(48GB) be any different than an RTX6000 (96GB)? Since experts switch pretty often, and are sometimes different between sequential tokens, it would seem that the experts are constantly have to swap between VRAM and system memory? If that is the case, are the larger, faster GPUs only worth it for better prefill performance, as during decode, the constant streaming of expert is bottlenecked by system ram bandwidth, and maybe even PCIe bandwidth? Given an identical system with a 5090 vs RTX6000, would performance be the same regardless during decoding?
However, it would seem like if you can store more than one expert, their is a chance the next expert can be cached in VRAM. How does performance scale the more experts you can have in VRAM? If you were to build a system for Deepseek v4 Pro, would it make seen to have two vs one RTX6000s? Or do you need to have the vast majority of expert in VRAM to make a difference?
Curious about y'all's thoughts.
[link] [comments]
More from r/LocalLLaMA
-
BitCPM-CANN: Native 1.58-Bit Large Language Model Training on Ascend NPU
May 24
-
GPU VRAM only for small models with llama.cpp: is it possible?
May 24
-
Qwen3.6-35B-A3B vs Gemma4-26B-A4B
May 24
-
Qwen Plays ̶p̶̶o̶̶k̶̶e̶̶m̶̶o̶̶n̶ ? / QWEN PLAYS DCSS! - qwen3.6-35b-a3b@q4_k_xl plays open source roguelike adventure DCSS (and does a decent job)
May 24
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.