r/MachineLearning · June 3, 2026 · 1 min read

MiniMax dropped a new attention architecture. [N]

Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.

MiniMax dropped a new attention architecture. [N]

It contains something interesting about context windows.

They’re natively scaling to 1M tokens with MiniMax Sparse Attention (MSA), bypassing standard quadratic complexity by completely restructuring the memory access patterns at the operator level.

Instead of relying on typical sparse approximations that degrade recall, MSA utilizes a clean "KV outer gather Q" approach.

By treating KV blocks as the outer loop to aggregate hit queries, hardware memory reads remain strictly contiguous, and each block is fetched exactly once.

The low-level performance gains are interesting:

→ 4× faster execution speed compared to Flash-Sparse-Attention.

→ Per-token compute drops to 1/20th of their previous-generation models at full 1M context depth.

→ 9× speedup in prefilling and a 15× speedup in decoding phases.

Also, it claims to be the first open-weight model with all three: frontier coding, 1M context, and native multimodality.

Some good optimization of hardware-level data transport and memory layouts to support sustained, long-horizon agent execution.

Thoughts?

submitted by /u/superintelligence03
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/MachineLearning