r/LocalLLaMA · · 1 min read

What is the point of MoE models, beyond being faster?

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Hi. Besides the fact that an xByA MoE models runs as fast as a yA models but produces better results, what are other benefits of pursuing an MoE architecture and not a dense one with e.g. x/2 (or x/3) parameters?

Given that we need enough RAM for xB parameter anyway, aren't MoEs at a disadvantage when RAM is scarce, like the current situation?

And thinking of limit cases, is there a limit on x/y, so that it doesn't make sense e.g. to train a 100B1A MoE model?

Thanks.

submitted by /u/ihatebeinganonymous
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA