Hugging Face Daily Papers · · 5 min read

KaLM-Reranker-V1: Fast but Not Late Interaction for Compressed Document Reranking

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

We present KaLM-Reranker-V1, a fast but not late-interaction (FBNL) reranker that decouples query and passage computation while retaining expressive relevance modeling.</p>\n<p>Built on an encoder-decoder architecture, KaLM-Reranker-V1 uses the encoder to pre-encode passages with Matryoshka embedding pooling, while the decoder models the system instruction, user instruction, and query intent; cross-attention then captures relevance between the query context and passage representations. This design makes KaLM-Reranker-V1 efficient through decoupled passage encoding, yet not late interaction, by preserving rich relevance modeling through cross-attention.</p>\n<p>We instantiate KaLM-Reranker-V1 in three sizes, Nano, Small, and Large, with 0.27B, 1B, and 4B activated parameters, respectively.</p>\n","updatedAt":"2026-06-23T02:47:12.519Z","author":{"_id":"658c0b0574e79b9a8e9de89a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/658c0b0574e79b9a8e9de89a/8HeMEOT5cLEzauXGlrqF_.jpeg","fullname":"Xinping Zhao","name":"Yuki131","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":9,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.83843994140625},"editors":["Yuki131"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/658c0b0574e79b9a8e9de89a/8HeMEOT5cLEzauXGlrqF_.jpeg"],"reactions":[{"reaction":"🔥","users":["GregoriusRey","uuicicn","happycocktail","hotchpotch","Yuki131","aikacl"],"count":6}],"isReport":false}},{"id":"6a3a07febb154680d178e108","author":{"_id":"67cb163b8a2fe70351ab18dc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/85CkWHxQB6wwBqRPEHFPs.png","fullname":"Wenhao Li","name":"Bluexxx","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-06-23T04:13:50.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Awesome job! Learned so much, thanks a lot!","html":"<p>Awesome job! Learned so much, thanks a lot!</p>\n","updatedAt":"2026-06-23T04:13:50.914Z","author":{"_id":"67cb163b8a2fe70351ab18dc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/85CkWHxQB6wwBqRPEHFPs.png","fullname":"Wenhao Li","name":"Bluexxx","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7662972211837769},"editors":["Bluexxx"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/85CkWHxQB6wwBqRPEHFPs.png"],"reactions":[{"reaction":"🤯","users":["Yuki131","happycocktail","aikacl"],"count":3}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.22807","authors":[{"_id":"6a39ecbefdcd3514343bb4d0","user":{"_id":"658c0b0574e79b9a8e9de89a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/658c0b0574e79b9a8e9de89a/8HeMEOT5cLEzauXGlrqF_.jpeg","isPro":false,"fullname":"Xinping Zhao","user":"Yuki131","type":"user","name":"Yuki131"},"name":"Xinping Zhao","status":"claimed_verified","statusLastChangedAt":"2026-06-23T13:56:56.498Z","hidden":false},{"_id":"6a39ecbefdcd3514343bb4d1","name":"Jiaxin Xu","hidden":false},{"_id":"6a39ecbefdcd3514343bb4d2","name":"Ziqi Dai","hidden":false},{"_id":"6a39ecbefdcd3514343bb4d3","name":"Xin Zhang","hidden":false},{"_id":"6a39ecbefdcd3514343bb4d4","name":"Shouzheng Huang","hidden":false},{"_id":"6a39ecbefdcd3514343bb4d5","name":"Danyu Tang","hidden":false},{"_id":"6a39ecbefdcd3514343bb4d6","name":"Xinshuo Hu","hidden":false},{"_id":"6a39ecbefdcd3514343bb4d7","name":"Meishan Zhang","hidden":false},{"_id":"6a39ecbefdcd3514343bb4d8","name":"Baotian Hu","hidden":false},{"_id":"6a39ecbefdcd3514343bb4d9","name":"Min Zhang","hidden":false}],"publishedAt":"2026-06-22T00:00:00.000Z","submittedOnDailyAt":"2026-06-23T00:00:00.000Z","title":"KaLM-Reranker-V1: Fast but Not Late Interaction for Compressed Document Reranking","submittedOnDailyBy":{"_id":"658c0b0574e79b9a8e9de89a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/658c0b0574e79b9a8e9de89a/8HeMEOT5cLEzauXGlrqF_.jpeg","isPro":false,"fullname":"Xinping Zhao","user":"Yuki131","type":"user","name":"Yuki131"},"summary":"As retrieval systems scale, high-quality reranking becomes increasingly important. However, most existing rerankers, whether encoder-based or decoder-based, jointly encode the query and passage, tightly coupling their computation and limiting deployment efficiency as well as flexibility. We present KaLM-Reranker-V1, a fast but not late-interaction (FBNL) reranker that decouples query and passage computation while retaining expressive relevance modeling. Built on an encoder-decoder architecture, KaLM-Reranker-V1 uses the encoder to pre-encode passages with Matryoshka embedding pooling, while the decoder models the system instruction, user instruction, and query intent; cross-attention then captures relevance between the query context and passage representations. This design makes KaLM-Reranker-V1 efficient through decoupled passage encoding, yet not late interaction, by preserving rich relevance modeling through cross-attention. We instantiate KaLM-Reranker-V1 in three sizes, Nano, Small, and Large, with 0.27B, 1B, and 4B activated parameters, respectively. Extensive experiments on BEIR, MIRACL, and LMEB demonstrate that KaLM-Reranker-V1 achieves strong reranking performance with superior efficiency. On BEIR, KaLM-Reranker-V1 achieves state-of-the-art performance, on par with strong industrial models such as the Qwen3-Reranker series; on MIRACL, despite not being extensively trained on multilingual data, KaLM-Reranker-V1 still shows excellent reranking performance. Moreover, on LMEB, reranking models demonstrate a clear advantage, with even the 0.27B Nano model remaining competitive with 7-12B embedding models.","upvotes":39,"discussionId":"6a39ecbefdcd3514343bb4da","projectPage":"https://kalm-embedding.github.io/","ai_summary":"KaLM-Reranker-V1 is a fast reranker that decouples query and passage computation using encoder-decoder architecture with Matryoshka embedding pooling and cross-attention for efficient relevance modeling.","ai_keywords":["reranker","encoder-decoder architecture","Matryoshka embedding pooling","cross-attention","late-interaction","parameter-efficient fine-tuning","BEIR","MIRACL","LMEB"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"68b9671dbd700936ab567c04","name":"KaLM-Embedding","fullname":"KaLM-Embedding","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/658c0b0574e79b9a8e9de89a/XteH4tcmsR2E4OT89MPxv.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"658c0b0574e79b9a8e9de89a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/658c0b0574e79b9a8e9de89a/8HeMEOT5cLEzauXGlrqF_.jpeg","isPro":false,"fullname":"Xinping Zhao","user":"Yuki131","type":"user"},{"_id":"672c2549cf09d152f4c37bb1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/672c2549cf09d152f4c37bb1/QElv9mGzzDgQMmP3dUL2q.jpeg","isPro":false,"fullname":"Shouzheng Huang","user":"bue0912","type":"user"},{"_id":"652fb8bcc9dd2692a25ef2e3","avatarUrl":"/avatars/461e6cc1c3441cde18192b080b0b8576.svg","isPro":false,"fullname":"Haoyuan Shi","user":"MrSunshy","type":"user"},{"_id":"64f05f261a108efe45dfeda1","avatarUrl":"/avatars/cd9e68dbede75a34cf286569e60ac2af.svg","isPro":false,"fullname":"Haozhan Shen","user":"SZhanZ","type":"user"},{"_id":"62de46a83af42b53b1cb6931","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62de46a83af42b53b1cb6931/jV7xzQFcqy-8-xouXy2FN.jpeg","isPro":false,"fullname":"Xinshuo Hu","user":"YanshekWoo","type":"user"},{"_id":"63b6dbc8ccebeadccc888456","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1673396893898-63b6dbc8ccebeadccc888456.jpeg","isPro":false,"fullname":"Xin Zhang","user":"izhx","type":"user"},{"_id":"6746dac3760625862ef71983","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/0oE3-P8eRDtwCHaM2VauG.png","isPro":false,"fullname":"cosy","user":"cosyy","type":"user"},{"_id":"6309f9c165a977feb99a2c64","avatarUrl":"/avatars/7237ffe2f2e3d0679ddb2b839af620b7.svg","isPro":false,"fullname":"rao","user":"jay7rao","type":"user"},{"_id":"624e4d255834b27447e540c1","avatarUrl":"/avatars/384c4f0fec415a9e9dccf631bce25669.svg","isPro":false,"fullname":"Learner","user":"LearnerDDD","type":"user"},{"_id":"681df8ef0431b568d1005d40","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/lxoO-BKxljuWLtQNGKIVx.jpeg","isPro":false,"fullname":"Theo Wan","user":"TheoWAN","type":"user"},{"_id":"67a569a006d90a17bc94b6ab","avatarUrl":"/avatars/7b144b8e022ffca2ec2705126f85052e.svg","isPro":false,"fullname":"zwRuan","user":"zwRuan","type":"user"},{"_id":"6711cd2c975dccf1f86ac6fe","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/xaLpj5ubd9HXl0aFQcLgJ.png","isPro":false,"fullname":"Leilei ZHAO","user":"ZHAOll","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"68b9671dbd700936ab567c04","name":"KaLM-Embedding","fullname":"KaLM-Embedding","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/658c0b0574e79b9a8e9de89a/XteH4tcmsR2E4OT89MPxv.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.22807.md","query":{}}">
Papers
arxiv:2606.22807

KaLM-Reranker-V1: Fast but Not Late Interaction for Compressed Document Reranking

Published on Jun 22
· Submitted by
Xinping Zhao
on Jun 23
Authors:
,
,
,
,
,
,
,
,

Abstract

KaLM-Reranker-V1 is a fast reranker that decouples query and passage computation using encoder-decoder architecture with Matryoshka embedding pooling and cross-attention for efficient relevance modeling.

As retrieval systems scale, high-quality reranking becomes increasingly important. However, most existing rerankers, whether encoder-based or decoder-based, jointly encode the query and passage, tightly coupling their computation and limiting deployment efficiency as well as flexibility. We present KaLM-Reranker-V1, a fast but not late-interaction (FBNL) reranker that decouples query and passage computation while retaining expressive relevance modeling. Built on an encoder-decoder architecture, KaLM-Reranker-V1 uses the encoder to pre-encode passages with Matryoshka embedding pooling, while the decoder models the system instruction, user instruction, and query intent; cross-attention then captures relevance between the query context and passage representations. This design makes KaLM-Reranker-V1 efficient through decoupled passage encoding, yet not late interaction, by preserving rich relevance modeling through cross-attention. We instantiate KaLM-Reranker-V1 in three sizes, Nano, Small, and Large, with 0.27B, 1B, and 4B activated parameters, respectively. Extensive experiments on BEIR, MIRACL, and LMEB demonstrate that KaLM-Reranker-V1 achieves strong reranking performance with superior efficiency. On BEIR, KaLM-Reranker-V1 achieves state-of-the-art performance, on par with strong industrial models such as the Qwen3-Reranker series; on MIRACL, despite not being extensively trained on multilingual data, KaLM-Reranker-V1 still shows excellent reranking performance. Moreover, on LMEB, reranking models demonstrate a clear advantage, with even the 0.27B Nano model remaining competitive with 7-12B embedding models.

Community

Paper author Paper submitter about 22 hours ago

We present KaLM-Reranker-V1, a fast but not late-interaction (FBNL) reranker that decouples query and passage computation while retaining expressive relevance modeling.

Built on an encoder-decoder architecture, KaLM-Reranker-V1 uses the encoder to pre-encode passages with Matryoshka embedding pooling, while the decoder models the system instruction, user instruction, and query intent; cross-attention then captures relevance between the query context and passage representations. This design makes KaLM-Reranker-V1 efficient through decoupled passage encoding, yet not late interaction, by preserving rich relevance modeling through cross-attention.

We instantiate KaLM-Reranker-V1 in three sizes, Nano, Small, and Large, with 0.27B, 1B, and 4B activated parameters, respectively.

Awesome job! Learned so much, thanks a lot!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.22807
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 3

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.22807 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.22807 in a Space README.md to link it from this page.

Collections including this paper 2

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers