Elastic Attention Cores for Scalable Vision Transformers [R]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
| Wanted to share our latest paper on an alternative building block for Vision Transformers. Illustration of our model's accuracy and dense features Traditional ViTs utilize dense (N2) self-attention, which can become pretty costly at higher resolutions. In this work, we propose an alternative backbone with a core-periphery block-sparse attention structure that scales as (2NC + C2) for C core tokens. We further train this using nested dropout, which enables test-time elastic adjustments to the inference cost. The whole model can achieve very competitive dense & classification accuracy compared with DINOv3, and is stable across resolutions (256 all the way to 1024). Interestingly, the core-dense attention patterns exhibit strong emergent behavior. At early layers of the network the attention maps are isotropic (spherical), but become increasingly semantically aligned deeper into the network. Visual Elastic Core Attention paper abstract While adjusting the number of core tokens, if you decrease the number of cores, the attention patterns become more diffuse & cover a spatially larger region. If you increase the number of core tokens, the attention patterns become smaller & more concentrated. Paper: https://arxiv.org/abs/2605.12491 Project with the code (still in progress): https://github.com/alansong1322/VECA Happy to answer any questions about our research. [link] [comments] |
More from r/MachineLearning
-
Trained transformer-based chess models to play like humans (including thinking time) [P]
May 13
-
Scenema Audio: Zero-shot expressive voice cloning and speech generation [N]
May 13
-
What kinds of models are people training with document data? [P]
May 13
-
Have the "on-hold" durations been getting longer for arXiv submissions? [D]
May 13
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.