Multi Tier MoE Caching
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
I've never seen much discussion around this, but it feels like where MoE inference is heading.
The bulk of big models we use, GLM 5.2, Deepseek V4, Stepfun, Minimix are MoE meaning inference is run on a small subsection of the experts. Currently we scatter these experts over a mixture of CPU and GPU ram, giving us an aggregate speed of the two pipelines combined.
A fairly typical system may look like:
128gb of DDR5 6000mhz at ~48gb/s
24gb of GDDR6X at ~936gb/s
Assuming all memory is used, we have a combined bandwidth of about ~188gb/s
I added some debugging to see the standard activation in something like Qwen3.6 35b, when processing a large C# codebase, multiple prompts on top to fill up my context. I get this:
Top 1% of experts represents 20% of activations.
Top 5% of experts represents 50% of activations.
Top 10% of experts represents 70% of activations.
Top 15% of experts represents 80% of activations.
Top 20% of experts represents 85% of activations.
Meaning if I could shift just 20% of my experts (or layers/tensors) to the GPU, I should get 85% of activations running at full speed. Caches could adapt to the session over time, perhaps even maintaining separate hot sets for coding, creative writing, etc.
This isn't a new idea. There are quite a few papers on hierarchical caching and expert prefetching, and some practical implementations already exist:
PowerInfer (how the Tiiny.ai box claims to be able to run 122b models):
https://github.com/Tiiny-AI/PowerInfer
Lidenburg's llama.cpp branch:
https://github.com/Lidenburg/llama.cpp
HOBBIT, FlashMoE, Fiddler, DuoServe-MoE, M2Cache, etc.
I'm curious what others think, know of any work happening in the area etc. It's obviously mainly focused on advancements to hybrid ram/vram setups, but still touches on things like the recent developments to allow running of models from nvme on Mac.
[link] [comments]
More from r/LocalLLaMA
-
Why Dario is on fire: lesson from dotcom bubble.
Jun 30
-
Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought
Jun 30
-
I Hate Dario Amodei, and everything he stands for.
Jun 29
-
Introducing LongCat-2.0 - , a large-scale MoE language model with 1.6 trillion total parameters and ~48 billion activated per token. This was the stealth model that was on Openrouter under the name 'owl-alpha'.
Jun 29
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.