What is the point of MoE models, beyond being faster?
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Hi. Besides the fact that an xByA MoE models runs as fast as a yA models but produces better results, what are other benefits of pursuing an MoE architecture and not a dense one with e.g. x/2 (or x/3) parameters?
Given that we need enough RAM for xB parameter anyway, aren't MoEs at a disadvantage when RAM is scarce, like the current situation?
And thinking of limit cases, is there a limit on x/y, so that it doesn't make sense e.g. to train a 100B1A MoE model?
Thanks.
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.