Hugging Face Daily Papers · May 26, 2026 · 6 min read

Your Embedding Model is SMARTer Than You Think

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Our work introduces SMART, a framework that unlocks the latent multi-vector capabilities of standard single-vector models for multimodal retrieval. While single-vector retrievers are highly efficient, they often discard the fine-grained, local evidence critical for dense retrieval tasks. To address this, SMART applies direct late-interaction over the frozen hidden states of your model during inference, acting as a plug-and-play upgrade that consistently improves performance across diverse modalities.\nNot only does this approach push state-of-the-art performance further on MMEB-V2, but with simple, lightweight post-training, it also enables single-vector models to outperform heavily trained multi-vector counterparts on Visual Document retrieval. Ultimately, it offers both a highly efficient inference enhancement and a powerful finetuning technique to get the absolute most out of existing embedding models, saving both time and compute. Code and weights are open-sourced!\n","updatedAt":"2026-05-26T01:35:16.431Z","author":{"_id":"6508b164abdde5290e5e4939","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6508b164abdde5290e5e4939/lQgAs3BHwCyI7Go1QA62m.jpeg","fullname":"Harris Zhang","name":"HanSolo9682","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.900867760181427},"editors":["HanSolo9682"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6508b164abdde5290e5e4939/lQgAs3BHwCyI7Go1QA62m.jpeg"],"reactions":[{"reaction":"🔥","users":["ZebangCheng"],"count":1},{"reaction":"🚀","users":["bbbdbbb"],"count":1}],"isReport":false},"replies":[{"id":"6a14ff014322b5a611a46beb","author":{"_id":"665d84f901724a21c2115fc2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/665d84f901724a21c2115fc2/1TQ2zouvvUrPiTN5sqpbQ.jpeg","fullname":"ZebangCheng","name":"ZebangCheng","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false},"createdAt":"2026-05-26T02:01:37.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Good job !!!","html":"Good job !!!\n","updatedAt":"2026-05-26T02:01:37.711Z","author":{"_id":"665d84f901724a21c2115fc2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/665d84f901724a21c2115fc2/1TQ2zouvvUrPiTN5sqpbQ.jpeg","fullname":"ZebangCheng","name":"ZebangCheng","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6539796590805054},"editors":["ZebangCheng"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/665d84f901724a21c2115fc2/1TQ2zouvvUrPiTN5sqpbQ.jpeg"],"reactions":[{"reaction":"❤️","users":["HanSolo9682"],"count":1}],"isReport":false,"parentCommentId":"6a14f8d420370a209156a472"}}]},{"id":"6a14fe61f6e973a8b378d36b","author":{"_id":"665d84f901724a21c2115fc2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/665d84f901724a21c2115fc2/1TQ2zouvvUrPiTN5sqpbQ.jpeg","fullname":"ZebangCheng","name":"ZebangCheng","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false},"createdAt":"2026-05-26T01:58:57.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Very interesting idea! I’m curious whether SMART works equally well for smaller or older embedding models, or if its gains mainly depend on strong modern backbones like Qwen3-VL-Embedding.","html":"Very interesting idea! I’m curious whether SMART works equally well for smaller or older embedding models, or if its gains mainly depend on strong modern backbones like Qwen3-VL-Embedding.\n","updatedAt":"2026-05-26T01:58:57.211Z","author":{"_id":"665d84f901724a21c2115fc2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/665d84f901724a21c2115fc2/1TQ2zouvvUrPiTN5sqpbQ.jpeg","fullname":"ZebangCheng","name":"ZebangCheng","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8089184761047363},"editors":["ZebangCheng"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/665d84f901724a21c2115fc2/1TQ2zouvvUrPiTN5sqpbQ.jpeg"],"reactions":[{"reaction":"❤️","users":["bbbdbbb"],"count":1}],"isReport":false},"replies":[{"id":"6a14ff85af2b662d24012e5b","author":{"_id":"6508b164abdde5290e5e4939","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6508b164abdde5290e5e4939/lQgAs3BHwCyI7Go1QA62m.jpeg","fullname":"Harris Zhang","name":"HanSolo9682","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false},"createdAt":"2026-05-26T02:03:49.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"Great question! We actually did try on other models like VLM2Vec and GME to showcase the generalizability of our method. Table 1 will be where to look at!","html":"Great question! We actually did try on other models like VLM2Vec and GME to showcase the generalizability of our method. Table 1 will be where to look at!\n","updatedAt":"2026-05-26T02:26:23.724Z","author":{"_id":"6508b164abdde5290e5e4939","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6508b164abdde5290e5e4939/lQgAs3BHwCyI7Go1QA62m.jpeg","fullname":"Harris Zhang","name":"HanSolo9682","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.944726288318634},"editors":["HanSolo9682"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6508b164abdde5290e5e4939/lQgAs3BHwCyI7Go1QA62m.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"6a14fe61f6e973a8b378d36b"}}]}],"primaryEmailConfirmed":false,"paper":{"id":"2605.24938","authors":[{"_id":"6a14f6cfb57a1823d57089b9","user":{"_id":"6508b164abdde5290e5e4939","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6508b164abdde5290e5e4939/lQgAs3BHwCyI7Go1QA62m.jpeg","isPro":false,"fullname":"Harris Zhang","user":"HanSolo9682","type":"user","name":"HanSolo9682"},"name":"Jianrui Zhang","status":"claimed_verified","statusLastChangedAt":"2026-05-26T07:48:22.558Z","hidden":false},{"_id":"6a14f6cfb57a1823d57089ba","name":"Hyun Jung Lee","hidden":false},{"_id":"6a14f6cfb57a1823d57089bb","user":{"_id":"66f86ea620b9a626575e60e7","avatarUrl":"/avatars/9aa48cde78784e706d349f61d83b590c.svg","isPro":false,"fullname":"Sukanta Ganguly","user":"sukantag","type":"user","name":"sukantag"},"name":"Sukanta Ganguly","status":"claimed_verified","statusLastChangedAt":"2026-05-26T07:48:20.562Z","hidden":false},{"_id":"6a14f6cfb57a1823d57089bc","name":"Tae-Eui Kam","hidden":false},{"_id":"6a14f6cfb57a1823d57089bd","name":"Donghyun Kim","hidden":false},{"_id":"6a14f6cfb57a1823d57089be","name":"Yong Jae Lee","hidden":false}],"publishedAt":"2026-05-24T00:00:00.000Z","submittedOnDailyAt":"2026-05-26T00:00:00.000Z","title":"Your Embedding Model is SMARTer Than You Think","submittedOnDailyBy":{"_id":"6508b164abdde5290e5e4939","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6508b164abdde5290e5e4939/lQgAs3BHwCyI7Go1QA62m.jpeg","isPro":false,"fullname":"Harris Zhang","user":"HanSolo9682","type":"user","name":"HanSolo9682"},"summary":"Multimodal retrieval relies heavily on single-vector retrievers, which compress rich, sequential token sequences into one single global representation. While efficient, they discard fine-grained, local evidence critical for dense retrieval tasks. Multi-vector approaches were introduced as a solution, but they strictly require training and many ignore the necessity of a globally summarizing representation. To address this, we introduce SMART, a framework that unlocks the latent multi-vector capabilities of standard single-vector models. We first demonstrate that standard contrastive training on the pooled embedding implicitly shapes the retrieval geometry of preceding hidden states via gradient flow. By applying direct late-interaction over these frozen hidden states during inference, SMART acts as a plug-and-play upgrade that consistently improves performance across diverse modalities, improving even the state-of-the-art models further on MMEB-V2. We also reveal SMART's superior performance, as simple lightweight post-training not only saves time and compute, but also brings forth further improvement on Visual Document retrieval, allowing a single-vector model to outperform SoTA multi-vector counterparts. Ultimately, SMART offers both a highly efficient inference enhancement and a powerful finetuning technique for multimodal retrieval. We open source our code and weights at https://github.com/HanSolo9682/SMART.","upvotes":15,"discussionId":"6a14f6d0b57a1823d57089bf","githubRepo":"https://github.com/HanSolo9682/SMART","githubRepoAddedBy":"user","ai_summary":"SMART enhances multimodal retrieval by leveraging latent multi-vector capabilities from single-vector models through contrastive training and late-interaction inference, achieving state-of-the-art performance with reduced computational costs.","ai_keywords":["multimodal retrieval","single-vector retrievers","multi-vector approaches","contrastive training","late-interaction","pooled embedding","hidden states","gradient flow","Visual Document retrieval","SoTA"],"githubStars":0},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6508b164abdde5290e5e4939","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6508b164abdde5290e5e4939/lQgAs3BHwCyI7Go1QA62m.jpeg","isPro":false,"fullname":"Harris Zhang","user":"HanSolo9682","type":"user"},{"_id":"636cb2ab720cd391cb2a6989","avatarUrl":"/avatars/49cbe57b8ad5ca0c01d0e6951911aa07.svg","isPro":false,"fullname":"bbbdbbb","user":"bbbdbbb","type":"user"},{"_id":"665d84f901724a21c2115fc2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/665d84f901724a21c2115fc2/1TQ2zouvvUrPiTN5sqpbQ.jpeg","isPro":true,"fullname":"ZebangCheng","user":"ZebangCheng","type":"user"},{"_id":"64d265019badd06a0587c848","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64d265019badd06a0587c848/qrOtJ7gxUCzgVwduFDktf.jpeg","isPro":false,"fullname":"Dao Tuan Trung","user":"termanteus","type":"user"},{"_id":"63b7b2c6bd2d153522821766","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63b7b2c6bd2d153522821766/aHtga-_OUdOrg_TRrXO08.jpeg","isPro":false,"fullname":"Mu Cai","user":"mucai","type":"user"},{"_id":"677f8ec859ee993c8379d2a7","avatarUrl":"/avatars/e75ef1d17e8e4c8777a2818be281ac4e.svg","isPro":false,"fullname":"Hyunjung Lee","user":"hyulee","type":"user"},{"_id":"67f5275e3dbafe16f52c6749","avatarUrl":"/avatars/1a610b3bb4935c2034eb568370121348.svg","isPro":false,"fullname":"Eunjung Jo","user":"he8e","type":"user"},{"_id":"68ba7724074aa57daddca6bc","avatarUrl":"/avatars/d7e9e3fac1cac61dabd391bc2eb5fd67.svg","isPro":false,"fullname":"Keon Oh Kim","user":"keonoh00","type":"user"},{"_id":"64d4615cf8082bf19b916492","avatarUrl":"/avatars/8e1b59565ec5e4b31090cf1b911781b9.svg","isPro":false,"fullname":"wongyukim","user":"wongyukim","type":"user"},{"_id":"6916a5f1ade20e8d86c19c80","avatarUrl":"/avatars/5a051b1775b370dfd4074f0ac44e272c.svg","isPro":false,"fullname":"MinJoo Lim","user":"minjoolim","type":"user"},{"_id":"64af72d4a609b29cc7b5919b","avatarUrl":"/avatars/bc33b6bfa6995ea953f71366184f19d3.svg","isPro":false,"fullname":"Aniket Rege","user":"aniketr","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.24938.md"}">

Papers

arxiv:2605.24938

Your Embedding Model is SMARTer Than You Think

Published on May 24

· Submitted by

Harris Zhang on May 26

Upvote

Authors:

Jianrui Zhang ,

Sukanta Ganguly ,

Abstract

SMART enhances multimodal retrieval by leveraging latent multi-vector capabilities from single-vector models through contrastive training and late-interaction inference, achieving state-of-the-art performance with reduced computational costs.

AI-generated summary

Multimodal retrieval relies heavily on single-vector retrievers, which compress rich, sequential token sequences into one single global representation. While efficient, they discard fine-grained, local evidence critical for dense retrieval tasks. Multi-vector approaches were introduced as a solution, but they strictly require training and many ignore the necessity of a globally summarizing representation. To address this, we introduce SMART, a framework that unlocks the latent multi-vector capabilities of standard single-vector models. We first demonstrate that standard contrastive training on the pooled embedding implicitly shapes the retrieval geometry of preceding hidden states via gradient flow. By applying direct late-interaction over these frozen hidden states during inference, SMART acts as a plug-and-play upgrade that consistently improves performance across diverse modalities, improving even the state-of-the-art models further on MMEB-V2. We also reveal SMART's superior performance, as simple lightweight post-training not only saves time and compute, but also brings forth further improvement on Visual Document retrieval, allowing a single-vector model to outperform SoTA multi-vector counterparts. Ultimately, SMART offers both a highly efficient inference enhancement and a powerful finetuning technique for multimodal retrieval. We open source our code and weights at https://github.com/HanSolo9682/SMART.

View arXiv page View PDF GitHub 0 Add to collection

Community

HanSolo9682

Paper author Paper submitter about 6 hours ago

Not only does this approach push state-of-the-art performance further on MMEB-V2, but with simple, lightweight post-training, it also enables single-vector models to outperform heavily trained multi-vector counterparts on Visual Document retrieval. Ultimately, it offers both a highly efficient inference enhancement and a powerful finetuning technique to get the absolute most out of existing embedding models, saving both time and compute. Code and weights are open-sourced!

ZebangCheng

about 6 hours ago

Good job !!!

ZebangCheng

about 6 hours ago

Very interesting idea! I’m curious whether SMART works equally well for smaller or older embedding models, or if its gains mainly depend on strong modern backbones like Qwen3-VL-Embedding.

HanSolo9682

Paper author about 6 hours ago

•

edited about 6 hours ago

Great question! We actually did try on other models like VLM2Vec and GME to showcase the generalizability of our method. Table 1 will be where to look at!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.24938

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.24938 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.24938 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.24938 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

No comments yet. Sign in and be the first to say something.

Your Embedding Model is SMARTer Than You Think

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 1

Discussion (0)

More from Hugging Face Daily Papers