Our work introduces SMART, a framework that unlocks the latent multi-vector capabilities of standard single-vector models for multimodal retrieval. While single-vector retrievers are highly efficient, they often discard the fine-grained, local evidence critical for dense retrieval tasks. To address this, SMART applies direct late-interaction over the frozen hidden states of your model during inference, acting as a plug-and-play upgrade that consistently improves performance across diverse modalities.</p>\n<p>Not only does this approach push state-of-the-art performance further on MMEB-V2, but with simple, lightweight post-training, it also enables single-vector models to outperform heavily trained multi-vector counterparts on Visual Document retrieval. Ultimately, it offers both a highly efficient inference enhancement and a powerful finetuning technique to get the absolute most out of existing embedding models, saving both time and compute. Code and weights are open-sourced!</p>\n","updatedAt":"2026-05-26T01:35:16.431Z","author":{"_id":"6508b164abdde5290e5e4939","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6508b164abdde5290e5e4939/lQgAs3BHwCyI7Go1QA62m.jpeg","fullname":"Harris Zhang","name":"HanSolo9682","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.900867760181427},"editors":["HanSolo9682"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6508b164abdde5290e5e4939/lQgAs3BHwCyI7Go1QA62m.jpeg"],"reactions":[{"reaction":"🔥","users":["ZebangCheng"],"count":1},{"reaction":"🚀","users":["bbbdbbb"],"count":1}],"isReport":false},"replies":[{"id":"6a14ff014322b5a611a46beb","author":{"_id":"665d84f901724a21c2115fc2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/665d84f901724a21c2115fc2/1TQ2zouvvUrPiTN5sqpbQ.jpeg","fullname":"ZebangCheng","name":"ZebangCheng","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false},"createdAt":"2026-05-26T02:01:37.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Good job !!!","html":"<p>Good job !!!</p>\n","updatedAt":"2026-05-26T02:01:37.711Z","author":{"_id":"665d84f901724a21c2115fc2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/665d84f901724a21c2115fc2/1TQ2zouvvUrPiTN5sqpbQ.jpeg","fullname":"ZebangCheng","name":"ZebangCheng","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6539796590805054},"editors":["ZebangCheng"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/665d84f901724a21c2115fc2/1TQ2zouvvUrPiTN5sqpbQ.jpeg"],"reactions":[{"reaction":"❤️","users":["HanSolo9682"],"count":1}],"isReport":false,"parentCommentId":"6a14f8d420370a209156a472"}}]},{"id":"6a14fe61f6e973a8b378d36b","author":{"_id":"665d84f901724a21c2115fc2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/665d84f901724a21c2115fc2/1TQ2zouvvUrPiTN5sqpbQ.jpeg","fullname":"ZebangCheng","name":"ZebangCheng","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false},"createdAt":"2026-05-26T01:58:57.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Very interesting idea! I’m curious whether SMART works equally well for smaller or older embedding models, or if its gains mainly depend on strong modern backbones like Qwen3-VL-Embedding.","html":"<p>Very interesting idea! I’m curious whether SMART works equally well for smaller or older embedding models, or if its gains mainly depend on strong modern backbones like Qwen3-VL-Embedding.</p>\n","updatedAt":"2026-05-26T01:58:57.211Z","author":{"_id":"665d84f901724a21c2115fc2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/665d84f901724a21c2115fc2/1TQ2zouvvUrPiTN5sqpbQ.jpeg","fullname":"ZebangCheng","name":"ZebangCheng","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8089184761047363},"editors":["ZebangCheng"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/665d84f901724a21c2115fc2/1TQ2zouvvUrPiTN5sqpbQ.jpeg"],"reactions":[{"reaction":"❤️","users":["bbbdbbb"],"count":1}],"isReport":false},"replies":[{"id":"6a14ff85af2b662d24012e5b","author":{"_id":"6508b164abdde5290e5e4939","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6508b164abdde5290e5e4939/lQgAs3BHwCyI7Go1QA62m.jpeg","fullname":"Harris Zhang","name":"HanSolo9682","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false},"createdAt":"2026-05-26T02:03:49.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"Great question! We actually did try on other models like VLM2Vec and GME to showcase the generalizability of our method. Table 1 will be where to look at!","html":"<p>Great question! We actually did try on other models like VLM2Vec and GME to showcase the generalizability of our method. Table 1 will be where to look at!</p>\n","updatedAt":"2026-05-26T02:26:23.724Z","author":{"_id":"6508b164abdde5290e5e4939","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6508b164abdde5290e5e4939/lQgAs3BHwCyI7Go1QA62m.jpeg","fullname":"Harris Zhang","name":"HanSolo9682","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.944726288318634},"editors":["HanSolo9682"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6508b164abdde5290e5e4939/lQgAs3BHwCyI7Go1QA62m.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"6a14fe61f6e973a8b378d36b"}}]}],"primaryEmailConfirmed":false,"paper":{"id":"2605.24938","authors":[{"_id":"6a14f6cfb57a1823d57089b9","user":{"_id":"6508b164abdde5290e5e4939","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6508b164abdde5290e5e4939/lQgAs3BHwCyI7Go1QA62m.jpeg","isPro":false,"fullname":"Harris Zhang","user":"HanSolo9682","type":"user","name":"HanSolo9682"},"name":"Jianrui Zhang","status":"claimed_verified","statusLastChangedAt":"2026-05-26T07:48:22.558Z","hidden":false},{"_id":"6a14f6cfb57a1823d57089ba","name":"Hyun Jung Lee","hidden":false},{"_id":"6a14f6cfb57a1823d57089bb","user":{"_id":"66f86ea620b9a626575e60e7","avatarUrl":"/avatars/9aa48cde78784e706d349f61d83b590c.svg","isPro":false,"fullname":"Sukanta Ganguly","user":"sukantag","type":"user","name":"sukantag"},"name":"Sukanta Ganguly","status":"claimed_verified","statusLastChangedAt":"2026-05-26T07:48:20.562Z","hidden":false},{"_id":"6a14f6cfb57a1823d57089bc","name":"Tae-Eui Kam","hidden":false},{"_id":"6a14f6cfb57a1823d57089bd","name":"Donghyun Kim","hidden":false},{"_id":"6a14f6cfb57a1823d57089be","name":"Yong Jae Lee","hidden":false}],"publishedAt":"2026-05-24T00:00:00.000Z","submittedOnDailyAt":"2026-05-26T00:00:00.000Z","title":"Your Embedding Model is SMARTer Than You Think","submittedOnDailyBy":{"_id":"6508b164abdde5290e5e4939","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6508b164abdde5290e5e4939/lQgAs3BHwCyI7Go1QA62m.jpeg","isPro":false,"fullname":"Harris Zhang","user":"HanSolo9682","type":"user","name":"HanSolo9682"},"summary":"Multimodal retrieval relies heavily on single-vector retrievers, which compress rich, sequential token sequences into one single global representation. While efficient, they discard fine-grained, local evidence critical for dense retrieval tasks. Multi-vector approaches were introduced as a solution, but they strictly require training and many ignore the necessity of a globally summarizing representation. To address this, we introduce SMART, a framework that unlocks the latent multi-vector capabilities of standard single-vector models. We first demonstrate that standard contrastive training on the pooled embedding implicitly shapes the retrieval geometry of preceding hidden states via gradient flow. By applying direct late-interaction over these frozen hidden states during inference, SMART acts as a plug-and-play upgrade that consistently improves performance across diverse modalities, improving even the state-of-the-art models further on MMEB-V2. We also reveal SMART's superior performance, as simple lightweight post-training not only saves time and compute, but also brings forth further improvement on Visual Document retrieval, allowing a single-vector model to outperform SoTA multi-vector counterparts. Ultimately, SMART offers both a highly efficient inference enhancement and a powerful finetuning technique for multimodal retrieval. We open source our code and weights at https://github.com/HanSolo9682/SMART.","upvotes":15,"discussionId":"6a14f6d0b57a1823d57089bf","githubRepo":"https://github.com/HanSolo9682/SMART","githubRepoAddedBy":"user","ai_summary":"SMART enhances multimodal retrieval by leveraging latent multi-vector capabilities from single-vector models through contrastive training and late-interaction inference, achieving state-of-the-art performance with reduced computational costs.","ai_keywords":["multimodal retrieval","single-vector retrievers","multi-vector approaches","contrastive training","late-interaction","pooled embedding","hidden states","gradient flow","Visual Document retrieval","SoTA"],"githubStars":0},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6508b164abdde5290e5e4939","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6508b164abdde5290e5e4939/lQgAs3BHwCyI7Go1QA62m.jpeg","isPro":false,"fullname":"Harris Zhang","user":"HanSolo9682","type":"user"},{"_id":"636cb2ab720cd391cb2a6989","avatarUrl":"/avatars/49cbe57b8ad5ca0c01d0e6951911aa07.svg","isPro":false,"fullname":"bbbdbbb","user":"bbbdbbb","type":"user"},{"_id":"665d84f901724a21c2115fc2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/665d84f901724a21c2115fc2/1TQ2zouvvUrPiTN5sqpbQ.jpeg","isPro":true,"fullname":"ZebangCheng","user":"ZebangCheng","type":"user"},{"_id":"64d265019badd06a0587c848","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64d265019badd06a0587c848/qrOtJ7gxUCzgVwduFDktf.jpeg","isPro":false,"fullname":"Dao Tuan Trung","user":"termanteus","type":"user"},{"_id":"63b7b2c6bd2d153522821766","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63b7b2c6bd2d153522821766/aHtga-_OUdOrg_TRrXO08.jpeg","isPro":false,"fullname":"Mu Cai","user":"mucai","type":"user"},{"_id":"677f8ec859ee993c8379d2a7","avatarUrl":"/avatars/e75ef1d17e8e4c8777a2818be281ac4e.svg","isPro":false,"fullname":"Hyunjung Lee","user":"hyulee","type":"user"},{"_id":"67f5275e3dbafe16f52c6749","avatarUrl":"/avatars/1a610b3bb4935c2034eb568370121348.svg","isPro":false,"fullname":"Eunjung Jo","user":"he8e","type":"user"},{"_id":"68ba7724074aa57daddca6bc","avatarUrl":"/avatars/d7e9e3fac1cac61dabd391bc2eb5fd67.svg","isPro":false,"fullname":"Keon Oh Kim","user":"keonoh00","type":"user"},{"_id":"64d4615cf8082bf19b916492","avatarUrl":"/avatars/8e1b59565ec5e4b31090cf1b911781b9.svg","isPro":false,"fullname":"wongyukim","user":"wongyukim","type":"user"},{"_id":"6916a5f1ade20e8d86c19c80","avatarUrl":"/avatars/5a051b1775b370dfd4074f0ac44e272c.svg","isPro":false,"fullname":"MinJoo Lim","user":"minjoolim","type":"user"},{"_id":"64af72d4a609b29cc7b5919b","avatarUrl":"/avatars/bc33b6bfa6995ea953f71366184f19d3.svg","isPro":false,"fullname":"Aniket Rege","user":"aniketr","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.24938.md"}">
Your Embedding Model is SMARTer Than You Think
Abstract
SMART enhances multimodal retrieval by leveraging latent multi-vector capabilities from single-vector models through contrastive training and late-interaction inference, achieving state-of-the-art performance with reduced computational costs.
AI-generated summary
Multimodal retrieval relies heavily on single-vector retrievers, which compress rich, sequential token sequences into one single global representation. While efficient, they discard fine-grained, local evidence critical for dense retrieval tasks. Multi-vector approaches were introduced as a solution, but they strictly require training and many ignore the necessity of a globally summarizing representation. To address this, we introduce SMART, a framework that unlocks the latent multi-vector capabilities of standard single-vector models. We first demonstrate that standard contrastive training on the pooled embedding implicitly shapes the retrieval geometry of preceding hidden states via gradient flow. By applying direct late-interaction over these frozen hidden states during inference, SMART acts as a plug-and-play upgrade that consistently improves performance across diverse modalities, improving even the state-of-the-art models further on MMEB-V2. We also reveal SMART's superior performance, as simple lightweight post-training not only saves time and compute, but also brings forth further improvement on Visual Document retrieval, allowing a single-vector model to outperform SoTA multi-vector counterparts. Ultimately, SMART offers both a highly efficient inference enhancement and a powerful finetuning technique for multimodal retrieval. We open source our code and weights at https://github.com/HanSolo9682/SMART.
Community
Our work introduces SMART, a framework that unlocks the latent multi-vector capabilities of standard single-vector models for multimodal retrieval. While single-vector retrievers are highly efficient, they often discard the fine-grained, local evidence critical for dense retrieval tasks. To address this, SMART applies direct late-interaction over the frozen hidden states of your model during inference, acting as a plug-and-play upgrade that consistently improves performance across diverse modalities.
Not only does this approach push state-of-the-art performance further on MMEB-V2, but with simple, lightweight post-training, it also enables single-vector models to outperform heavily trained multi-vector counterparts on Visual Document retrieval. Ultimately, it offers both a highly efficient inference enhancement and a powerful finetuning technique to get the absolute most out of existing embedding models, saving both time and compute. Code and weights are open-sourced!
Very interesting idea! I’m curious whether SMART works equally well for smaller or older embedding models, or if its gains mainly depend on strong modern backbones like Qwen3-VL-Embedding.
Great question! We actually did try on other models like VLM2Vec and GME to showcase the generalizability of our method. Table 1 will be where to look at!
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.24938 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.24938 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.24938 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.