r/LocalLLaMA · May 24, 2026 · 1 min read

Performance When Offloading Large Models to System RAM?

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I noticed for people running large models, or those that would be cost prohibitive to have all in GPU VRAM, I noticed that the dominate strategy is one GPU with a large pool of system DRAM to offload the weights, as per GB VRAM is always more expensive than normal DDR5.

However, if that is the case, there any advantage to have a large VRAM pool anyways, or would, for example, running Deepseek V4 Pro on a RTX 5090(48GB) be any different than an RTX6000 (96GB)? Since experts switch pretty often, and are sometimes different between sequential tokens, it would seem that the experts are constantly have to swap between VRAM and system memory? If that is the case, are the larger, faster GPUs only worth it for better prefill performance, as during decode, the constant streaming of expert is bottlenecked by system ram bandwidth, and maybe even PCIe bandwidth? Given an identical system with a 5090 vs RTX6000, would performance be the same regardless during decoding?

However, it would seem like if you can store more than one expert, their is a chance the next expert can be cached in VRAM. How does performance scale the more experts you can have in VRAM? If you were to build a system for Deepseek v4 Pro, would it make seen to have two vs one RTX6000s? Or do you need to have the vast majority of expert in VRAM to make a difference?

Curious about y'all's thoughts.

submitted by /u/itisyeetime
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA