MiniMax dropped a new attention architecture. [N]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
| It contains something interesting about context windows. They’re natively scaling to 1M tokens with MiniMax Sparse Attention (MSA), bypassing standard quadratic complexity by completely restructuring the memory access patterns at the operator level. Instead of relying on typical sparse approximations that degrade recall, MSA utilizes a clean "KV outer gather Q" approach. By treating KV blocks as the outer loop to aggregate hit queries, hardware memory reads remain strictly contiguous, and each block is fetched exactly once. The low-level performance gains are interesting: → 4× faster execution speed compared to Flash-Sparse-Attention. → Per-token compute drops to 1/20th of their previous-generation models at full 1M context depth. → 9× speedup in prefilling and a 15× speedup in decoding phases. Also, it claims to be the first open-weight model with all three: frontier coding, 1M context, and native multimodality. Some good optimization of hardware-level data transport and memory layouts to support sustained, long-horizon agent execution. Thoughts? [link] [comments] |
More from r/MachineLearning
-
Thoughts on Logical Intelligence’s Kona [D]
Jun 2
-
MTPAMI Survey Paper Length for submission time? [D]
Jun 2
-
Is the hallucination problem solved for document search? [D]
Jun 2
-
Backpropagation destroys V1 brain alignment in one epoch, tracking RSA alignment to fMRI across training for BP, FA, predictive coding, and STDP [R]
Jun 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.