Are the rich RAM /poor GPU people wrong here?
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Hello Guys,
I know everyone has his definition of local models, but for me i see 2 "reasonable" type of frontier local models.
a dense one that barely fit in a 32GB ou 24GB of gpu for the most "reasonable" GPU wealthy guys and a MOE in the 100B params, the 100ish B billion params can be run on hybrid offload with a decent speed on a 128GB ram, since 128GB is the max a standard motherboard can support. Again it's cheap but common people can still afford it, it's still cheaper than a car 😄 .
We see a lot of limit dense models, like qwen 27B, but for for the 100 MOE type there was only the Qwen 3.5 122B, they didn't even release the 3.6. the best MOE models range in the 30-35B.
does it mean that for rich ram and poor GPU people we don't have much choice, and the big GPU was the only good road?
Of course you can cram minimaxi like with Q3 or deepseek V3 in Q1. but for tool calling , speed and real usage it's barely usable.
I bought a strix halo before the ram-pocalypse, but i see very few use case for the 128GB exept being able to load multiple models that can be done with llama swap
[link] [comments]
More from r/LocalLLaMA
-
Orthrus-Qwen3-8B : up to 7.8×tokens/forward on Qwen3-8B, frozen backbone, provably identical output distribution
May 15
-
I built a self-hosted open-source MCP server that gives any local LLM real financial data — SEC filings, 13F, insider & congressional trades, short data, FRED
May 15
-
Built a fully offline suitcase robot around a Jetson Orin NX SUPER 16GB. Gemma 4 E4B, ~200ms cached TTFT, 30+ sensors, no WiFi/BT/cellular. He has opinions.
May 15
-
MTP Incoming today!
May 15
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.