Hugging Face Daily Papers · · 4 min read

Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

We show that the unemebdding matrix within LLMs serve as an overlooked feature extractor for free. It encodes a latent semantic space; filtering out its effects from the primary text embeddings markedly improves zero-shot representation performance. We also empirically confirm that this can be achieved through a simple linear transformation, which results in a reduction in vector dimensionality as an bonus.</p>\n","updatedAt":"2026-06-08T03:48:52.704Z","author":{"_id":"662aa42f4eaa187e4cf6827b","avatarUrl":"/avatars/17139f0b6e8092cf4c135028db03a7ff.svg","fullname":"Songhao Wu","name":"shwu","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.9348164200782776},"editors":["shwu"],"editorAvatarUrls":["/avatars/17139f0b6e8092cf4c135028db03a7ff.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.07502","authors":[{"_id":"6a263870e4c258a02949202a","name":"Songhao Wu","hidden":false},{"_id":"6a263870e4c258a02949202b","name":"Zhongxin Chen","hidden":false},{"_id":"6a263870e4c258a02949202c","name":"Yuxuan Liu","hidden":false},{"_id":"6a263870e4c258a02949202d","name":"Heng Cui","hidden":false},{"_id":"6a263870e4c258a02949202e","name":"Cong Li","hidden":false},{"_id":"6a263870e4c258a02949202f","name":"Rui Yan","hidden":false}],"publishedAt":"2026-06-05T00:00:00.000Z","submittedOnDailyAt":"2026-06-08T00:00:00.000Z","title":"Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings","submittedOnDailyBy":{"_id":"662aa42f4eaa187e4cf6827b","avatarUrl":"/avatars/17139f0b6e8092cf4c135028db03a7ff.svg","isPro":false,"fullname":"Songhao Wu","user":"shwu","type":"user","name":"shwu"},"summary":"Large language models exhibit impressive zero-shot capabilities across a wide range of downstream tasks. However, they struggle to function as off-the-shelf embedding models, leading to suboptimal performance on massive text embedding benchmarks. In this paper, we identify a potential cause underlying this deficiency. Our motivation stems from an unexpected observation: text embeddings tend to align with frequent but uninformative tokens when projected onto the vocabulary space. We argue that this excessive expression of high-frequency tokens suppresses the model's ability to capture nuanced semantics. To address this, we introduce EmbedFilter, a simple linear transformation designed to refine text embeddings derived from LLMs directly. Specifically, we uncover that the unembedding matrix within LLMs encodes a latent space that is actively writing these frequent tokens into embedding space. By filtering out this subspace, EmbedFilter suppress the influence of high-frequency tokens, thereby enhancing semantic representations. As a compelling byproduct, this enables an inherent dimensionality reduction, lowering index storage and speedup retrieval while fully preserving the refined embedding quality. Our experiments across multiple LLM backbones demonstrate that LLMs equipped with EmbedFilter achieve superior zero-shot downstream performance even with significantly reduced embedding dimensions. We hope our findings provide deeper insights into the mechanisms of LLM-based representations and inspire more principled designs to improve text embeddings training. Our code is available at https://github.com/CentreChen/EmbFilter.","upvotes":49,"discussionId":"6a263870e4c258a029492030","githubRepo":"https://github.com/CentreChen/EmbFilter","githubRepoAddedBy":"user","ai_summary":"Text embeddings from large language models are enhanced by EmbedFilter, a linear transformation that reduces the influence of high-frequency tokens and improves semantic representations while enabling dimensionality reduction.","ai_keywords":["large language models","text embeddings","unembedding matrix","high-frequency tokens","dimensionality reduction","zero-shot performance","semantic representations","linear transformation"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":2},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"662aa42f4eaa187e4cf6827b","avatarUrl":"/avatars/17139f0b6e8092cf4c135028db03a7ff.svg","isPro":false,"fullname":"Songhao Wu","user":"shwu","type":"user"},{"_id":"64b8ca3c5067873176d4b436","avatarUrl":"/avatars/b659d147b2454b47c9a7e89bbed525fc.svg","isPro":false,"fullname":"AngLv","user":"AngLv","type":"user"},{"_id":"6530ec0213906da29b8f6298","avatarUrl":"/avatars/aa9fbfeed7e9f7e026d93001e212ee21.svg","isPro":false,"fullname":"Song Jin","user":"jinsong0415","type":"user"},{"_id":"68390c1e627dfd60c9e184a2","avatarUrl":"/avatars/d88dcd34b07a33e77878d2371c377bae.svg","isPro":false,"fullname":"MavisWang30","user":"MavisWang","type":"user"},{"_id":"683916da7d10095339f45f42","avatarUrl":"/avatars/dfb6f485127dd8470dfc8cd9844cc35a.svg","isPro":false,"fullname":"Di Wu","user":"DiWV","type":"user"},{"_id":"69537df758a1c51864b371c4","avatarUrl":"/avatars/df4ce2a43fb0660be2622b2a80c6c0d2.svg","isPro":false,"fullname":"GregWu","user":"GregYoung","type":"user"},{"_id":"68391b5d8987f50a5eb98fb6","avatarUrl":"/avatars/90fa289240846e79e507004b14adca47.svg","isPro":false,"fullname":"TyreseT","user":"TyreseT","type":"user"},{"_id":"683918b17f983113fae19c92","avatarUrl":"/avatars/8eca941a64569abb18ed5762dbab570a.svg","isPro":false,"fullname":"YuchuanHe","user":"YuchuanH","type":"user"},{"_id":"683915e44d9866c160291500","avatarUrl":"/avatars/e601a89f42865f9c975469da8fd638bb.svg","isPro":false,"fullname":"Franklin I Marvin","user":"FranklinIM","type":"user"},{"_id":"66e006bb0b027148713f9a51","avatarUrl":"/avatars/841418391cb971de02f3cfc195a0ed0a.svg","isPro":false,"fullname":"HT","user":"haoyut","type":"user"},{"_id":"644b6fec6e07376abbb9f404","avatarUrl":"/avatars/51bab058e4b17602db694ee0e8b84ead.svg","isPro":false,"fullname":"Jixiang Hong","user":"jxhong","type":"user"},{"_id":"627a124ffe55fa0f8ce0eaf7","avatarUrl":"/avatars/41e0dc029faed6dc45d620c5fe2652a5.svg","isPro":false,"fullname":"Serendipity","user":"Yuhan","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":1,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.07502.md"}">
Papers
arxiv:2606.07502

Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings

Published on Jun 5
· Submitted by
Songhao Wu
on Jun 8
#1 Paper of the day
Authors:
,
,
,
,
,

Abstract

Text embeddings from large language models are enhanced by EmbedFilter, a linear transformation that reduces the influence of high-frequency tokens and improves semantic representations while enabling dimensionality reduction.

Large language models exhibit impressive zero-shot capabilities across a wide range of downstream tasks. However, they struggle to function as off-the-shelf embedding models, leading to suboptimal performance on massive text embedding benchmarks. In this paper, we identify a potential cause underlying this deficiency. Our motivation stems from an unexpected observation: text embeddings tend to align with frequent but uninformative tokens when projected onto the vocabulary space. We argue that this excessive expression of high-frequency tokens suppresses the model's ability to capture nuanced semantics. To address this, we introduce EmbedFilter, a simple linear transformation designed to refine text embeddings derived from LLMs directly. Specifically, we uncover that the unembedding matrix within LLMs encodes a latent space that is actively writing these frequent tokens into embedding space. By filtering out this subspace, EmbedFilter suppress the influence of high-frequency tokens, thereby enhancing semantic representations. As a compelling byproduct, this enables an inherent dimensionality reduction, lowering index storage and speedup retrieval while fully preserving the refined embedding quality. Our experiments across multiple LLM backbones demonstrate that LLMs equipped with EmbedFilter achieve superior zero-shot downstream performance even with significantly reduced embedding dimensions. We hope our findings provide deeper insights into the mechanisms of LLM-based representations and inspire more principled designs to improve text embeddings training. Our code is available at https://github.com/CentreChen/EmbFilter.

Community

We show that the unemebdding matrix within LLMs serve as an overlooked feature extractor for free. It encodes a latent semantic space; filtering out its effects from the primary text embeddings markedly improves zero-shot representation performance. We also empirically confirm that this can be achieved through a simple linear transformation, which results in a reduction in vector dimensionality as an bonus.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.07502
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.07502 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.07502 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.07502 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers