r/LocalLLaMA · June 23, 2026 · 2 min read

Multi Tier MoE Caching

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I've never seen much discussion around this, but it feels like where MoE inference is heading.

The bulk of big models we use, GLM 5.2, Deepseek V4, Stepfun, Minimix are MoE meaning inference is run on a small subsection of the experts. Currently we scatter these experts over a mixture of CPU and GPU ram, giving us an aggregate speed of the two pipelines combined.

A fairly typical system may look like:

128gb of DDR5 6000mhz at ~48gb/s
24gb of GDDR6X at ~936gb/s

Assuming all memory is used, we have a combined bandwidth of about ~188gb/s

I added some debugging to see the standard activation in something like Qwen3.6 35b, when processing a large C# codebase, multiple prompts on top to fill up my context. I get this:

Top 1% of experts represents 20% of activations.
Top 5% of experts represents 50% of activations.
Top 10% of experts represents 70% of activations.
Top 15% of experts represents 80% of activations.
Top 20% of experts represents 85% of activations.

Meaning if I could shift just 20% of my experts (or layers/tensors) to the GPU, I should get 85% of activations running at full speed. Caches could adapt to the session over time, perhaps even maintaining separate hot sets for coding, creative writing, etc.

This isn't a new idea. There are quite a few papers on hierarchical caching and expert prefetching, and some practical implementations already exist:

PowerInfer (how the Tiiny.ai box claims to be able to run 122b models):
https://github.com/Tiiny-AI/PowerInfer

Lidenburg's llama.cpp branch:
https://github.com/Lidenburg/llama.cpp

HOBBIT, FlashMoE, Fiddler, DuoServe-MoE, M2Cache, etc.

I'm curious what others think, know of any work happening in the area etc. It's obviously mainly focused on advancements to hybrid ram/vram setups, but still touches on things like the recent developments to allow running of models from nvme on Mac.

submitted by /u/Legitimate-Dog5690
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA