r/LocalLLaMA · · 3 min read

Luce Spark: a 35B MoE on a 16 GB GPU, without the offload tax

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Luce Spark: a 35B MoE on a 16 GB GPU, without the offload tax

Hey fellow Llamas, your time is precious, so I'll keep it short (while trying to explain everything lol).

TL;DR:

  • 33-35B MoE on a 16 GB GPU. Qwen3.6 35B-A3B: 13.3 GiB (was ~20.5). Laguna XS.2 33B-A3B: 14.6 GiB (was 18.8). Both measured on an RTX 3090, both under 16 GiB.
  • Only the active experts stay on the GPU. An A3B model routes to ~8 of 256 experts per token. Spark calibrates which experts your traffic hits and keeps those hot; the long tail lives in system RAM and is swapped in on demand through a bounded GPU cache.
  • Self-tuning. The placement is learned from live routing and written next to the model. Each restart loads a better profile. No corpus, no offline calibration step required.
  • One command, both backends. dflash_server <model.gguf> --spark works for laguna and qwen35moe. The server picks cache size, loads the learned profile if present, and keeps persisting it.
  • Offload without the speed cliff. Under offload, laguna runs the whole token as one fused graph, not 40 per-layer graphs. At full residency that graph is bit-identical to all-GPU and just as fast (119 tok/s); at 60% residency it holds ~100 tok/s (1.5x over a naive offload at 66).

This is open-source and you can find it here: https://github.com/Luce-Org/lucebox-hub (Apache2.0).

None of the base idea is magic. Expert offloading is old: llama.cpp does it (--n-cpu-moe / --cpu-moe), ktransformers does it, ik_llama.cpp does it. Keeping the hot experts on the GPU and the rest in RAM is the standard trick.

How it works, three pieces:

  • Calibrated placement. Spark accumulates per-(layer, expert) routing frequencies from real requests and pins the most-used set. On held-out traffic this drops the cold-hit rate from 36% (uniform split) to about 7%.
  • Bounded async cache. A fixed ring of spare GPU slots. On a cold-expert hit the weights copy async from pinned host memory, overlapped with compute, into a spare slot, evicting the LRU entry. A miss costs throughput, not a stall. The ring is a small over-allocation of the hot expert stack, so a swap is just copying three weight tensors and updating one routing entry, served by the existing GPU FFN with no special path. Same mechanism for both backends.
  • One fused graph. The offloaded path was building 40 per-layer graphs per token. Folding the routed FFN into the attention graph and running the whole token as one graph removes that submission overhead. At full residency the fused decode is bit-identical to all-GPU (128/128 tokens, verified by spark/bench.py) and runs at the same ~119 tok/s.

Memory, peak VRAM on a 3090, ctx 4096:

\Model All-GPU Spark Saved Fits 16GB``

\Laguna XS.2 33B-A3B 18.8 GiB 14.6 GiB 4.2 GiB yes``

\Qwen3.6 35B-A3B ~20.5 GiB 13.3 GiB ~7 GiB yes``

Speed, where the gains come from:

\Config Decode % of all-GPU``

\Naive offload (uniform) 66 55%``

\Spark, calibrated placement 81 68%``

\Spark, calibrated + cache + fused graph ~100 ~85%``

\All-GPU (needs 24 GB) 119 100%``

One self-tuning command:

# laguna or qwen35moe, same flag

\dflash_server models/Qwen3.6-35B-A3B-Q4_K_M.gguf --spark``

# optional: cache slots per layer (default 32)

\dflash_server models/laguna-xs2-Q4_K_M.gguf --spark --spark-slots 48``

Honest limitations:

  • Measured on a 3090 (24 GB). Peak VRAM lands under 16 GiB, but we have not yet run it on an actual 16 GB card. If someone has a 4060 Ti 16GB / 5060 Ti 16GB, I would love a real number.
  • Offload still trails all-GPU a little. Closing the last ~15% needs either more VRAM or predicting the next experts, and token-level prediction caps around 53% recall, so that is open work, not a free lunch.
  • No head-to-head against llama.cpp --n-cpu-moe on identical settings yet. That is the comparison we most want to add.

We worked hard on this to help the local ai community. Of course we may have made mistakes. Feedback is more than welcome!

EDIT: made the post more concise sorry guys 😂

submitted by /u/sandropuppo
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA