r/LocalLLaMA · May 26, 2026 · 1 min read

Cactus Hybrid Router: Gemma4-2B can match Gemini-3.1-Flash-Lite by routing 15-55% of tasks to Gemini And Running The Rest Locally.

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Cactus Hybrid Router: Gemma4-2B can match Gemini-3.1-Flash-Lite by routing 15-55% of tasks to Gemini And Running The Rest Locally.

Last week, we announced the “Simple Attention Network” and trained Needle, a 26m function call model that beats models 10-25x its size. Some LocalLlama Redditors asked if we could use make a router model. We now built “Cactus Hybrid Router”, a 65k parameter model that decodes on the fly when to complete a task with the edge model or route to frontier cloud.

https://preview.redd.it/jm23ff7r1k3h1.png?width=1453&format=png&auto=webp&s=2091ec952216beb2d987d536b08df3aec58fec94

Robust router performance, even when you quantize the edge model. This is Cactus Quants though, our 4bit uniform nears fp16 naturally.

https://preview.redd.it/4ri8bkuw1k3h1.png?width=2048&format=png&auto=webp&s=415e8165d5421d509634c165a3fb9feb2f83c209

Adjustable edge-cloud ratio for optimized resource allocation, cause why run "what is the capital of France?" through a trillion-parameter frontier model on expensive infra?

https://preview.redd.it/dwtg7noc2k3h1.png?width=904&format=png&auto=webp&s=0ecde47c439e7a29af3dca441a9098c98ca38e29

Same 64k router handles text-only, vision and audio prompts.

We'd love to hear your thoughts on this, what are we not thinking about?

Live AI and coding require a lot of inference, hence much pressure on the cloud infra.

Why not run rudimentary tasks locally and only escalate to cloud as a step towards edge?

https://github.com/cactus-compute/cactus

submitted by /u/Henrie_the_dreamer
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA