r/LocalLLaMA · · 2 min read

Mimo 2.5 is _fast_ at large context (dual RTX Pro 6000)

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Mimo 2.5 is _fast_ at large context (dual RTX Pro 6000)

For agentic work fast high context is king, OpenCode fills the window quickly and most models that feel snappy at 8k context turn into dial-up ADSL brrr by the time you're at 150k context deep. So I've been testing lots of models and runners trying to get "local Sonnet" on 2x RTX PRO 6000 (Spoiler, yes!).

The drop-off is all about how each model handles attention and Mimo 2.5 stays fast on these cards because uses the same 5-to-1 local/global sliding-window attention that Gemma 3 does: most layers only look at recent tokens, while some still read full context, so it stays quick without losing the plot.

While MiniMax M3 and DeepSeek V4 rely on custom GPU kernel nobody's written for "consumer" Blackwell yet. Their kernels are written for datacenter Blackwell (SM100, the B200 class). So MiniMax M3 silently falls back to dense attention and slows to a crawl, and DeepSeek V4's ops drop to CPU and grinds to a halt at 14 t/s. Reason that Unsloth still hasn't shipped a GGUF for DeepSeek V4 flash is most likely this: https://github.com/ggml-org/llama.cpp/discussions/22376

I tested lots with SGLang and vLLM with NVFP4 variants, but no dice. It does run slightly faster baseline but attention still slows down the same on larger context. NVFP4 on SM120 is buggy right now regardless: https://github.com/sgl-project/sglang/issues/19637

Step 3.7 Flash also use sliding-window hybrid (3-to-1 instead of 5-to-1) and keeps up at higher context around 40 t/s at 178k, so it's a good alternative! (Side note: Step 3.7 Flash seems more driven/creative with fictional writing, if that's your thing.)

In my private coding benchmark Opus nails it including an edge case, while Sonnet gets the core right, and these local model I've tested (Mimo 2.5, MiniMax 2.7, MiniMax M3, Step 3.7 Flash) landed right at Sonnet's level in quality (No, not you Qwen 3.5 122B, sorry). The neat part is Mimo 2.5 solves it in ~4 minutes (same as Opus/Sonnet), while MiniMax M3 takes ~40 minutes (go make a coffee. then lunch, water plants, watch grass grow.)

(Bonus: In my testing seems that MiniMax M3 (427B) vs M2.7 (229B) are roughly same quality with same VRAM limit, just M3 is slower and the intelligence improvments on official benchmarks seem to be because it's a larger model).

TLDR; Software is behind making many of the latest models usable on RTX 5090 / RTX PRO 6000, but Mimo 2.5 and Step 3.7 Flash are using an "older" approach that works great for agentic large context work.

submitted by /u/xquarx
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA