Retrieval-augmented generation (RAG) typically treats retrieval and generation as separate systems. We ask whether an attention-based encoder-decoder can instead retrieve directly from its own internal representations. We introduce INTRA (INTrinsic Retrieval via Attention), a framework where decoder attention queries score pre-encoded evidence chunks that are then directly reused as context for generation. By construction, INTRA unifies retrieval and generation, eliminating the retriever-generator mismatch typical of RAG pipelines. This design also amortizes context encoding by reusing precomputed encoder states across queries. On question-answering benchmarks, INTRA outperforms strong engineered retrieval pipelines on both evidence recall and end-to-end answer quality. Our results demonstrate that attention-based models already possess a retrieval mechanism that can be elicited, rather than added as an external module.</p>\n","updatedAt":"2026-05-14T14:22:05.802Z","author":{"_id":"630480fa6dbbb80f16352ee3","avatarUrl":"/avatars/f39ce2fe96a578f42a57e3bfe3a2d137.svg","fullname":"Elad Hoffer","name":"ehoffer","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8860291838645935},"editors":["ehoffer"],"editorAvatarUrls":["/avatars/f39ce2fe96a578f42a57e3bfe3a2d137.svg"],"reactions":[],"isReport":false}},{"id":"6a067aa99c42a63f2c32bea4","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":355,"isUserFollowing":false},"createdAt":"2026-05-15T01:45:13.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs](https://huggingface.co/papers/2604.12610) (2026)\n* [Latent Abstraction for Retrieval-Augmented Generation](https://huggingface.co/papers/2604.17866) (2026)\n* [Bottleneck Tokens for Unified Multimodal Retrieval](https://huggingface.co/papers/2604.11095) (2026)\n* [A Unified Model and Document Representation for On-Device Retrieval-Augmented Generation](https://huggingface.co/papers/2604.14403) (2026)\n* [R$^3$AG: Retriever Routing for Retrieval-Augmented Generation](https://huggingface.co/papers/2604.22849) (2026)\n* [From BM25 to Corrective RAG: Benchmarking Retrieval Strategies for Text-and-Table Documents](https://huggingface.co/papers/2604.01733) (2026)\n* [SAGE: Selective Attention-Guided Extraction for Token-Efficient Document Indexing](https://huggingface.co/papers/2604.15583) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2604.12610\">Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.17866\">Latent Abstraction for Retrieval-Augmented Generation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.11095\">Bottleneck Tokens for Unified Multimodal Retrieval</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.14403\">A Unified Model and Document Representation for On-Device Retrieval-Augmented Generation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.22849\">R$^3$AG: Retriever Routing for Retrieval-Augmented Generation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.01733\">From BM25 to Corrective RAG: Benchmarking Retrieval Strategies for Text-and-Table Documents</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.15583\">SAGE: Selective Attention-Guided Extraction for Token-Efficient Document Indexing</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{"user":"librarian-bot"}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/librarian-bot\">@<span class=\"underline\">librarian-bot</span></a></span> </span></span> recommend</code></p>\n","updatedAt":"2026-05-15T01:45:13.008Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":355,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7003945112228394},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.05806","authors":[{"_id":"6a05da46b1a8cbabc9f0955e","name":"Elad Hoffer","hidden":false},{"_id":"6a05da46b1a8cbabc9f0955f","name":"Yochai Blau","hidden":false},{"_id":"6a05da46b1a8cbabc9f09560","name":"Edan Kinderman","hidden":false},{"_id":"6a05da46b1a8cbabc9f09561","name":"Ron Banner","hidden":false},{"_id":"6a05da46b1a8cbabc9f09562","name":"Daniel Soudry","hidden":false},{"_id":"6a05da46b1a8cbabc9f09563","name":"Boris Ginsburg","hidden":false}],"publishedAt":"2026-05-08T00:00:00.000Z","submittedOnDailyAt":"2026-05-14T00:00:00.000Z","title":"Retrieval from Within: An Intrinsic Capability of Attention-Based Models","submittedOnDailyBy":{"_id":"630480fa6dbbb80f16352ee3","avatarUrl":"/avatars/f39ce2fe96a578f42a57e3bfe3a2d137.svg","isPro":false,"fullname":"Elad Hoffer","user":"ehoffer","type":"user","name":"ehoffer"},"summary":"Retrieval-augmented generation (RAG) typically treats retrieval and generation as separate systems. We ask whether an attention-based encoder-decoder can instead retrieve directly from its own internal representations. We introduce INTRA (INTrinsic Retrieval via Attention), a framework where decoder attention queries score pre-encoded evidence chunks that are then directly reused as context for generation. By construction, INTRA unifies retrieval and generation, eliminating the retriever-generator mismatch typical of RAG pipelines. This design also amortizes context encoding by reusing precomputed encoder states across queries. On question-answering benchmarks, INTRA outperforms strong engineered retrieval pipelines on both evidence recall and end-to-end answer quality. Our results demonstrate that attention-based models already possess a retrieval mechanism that can be elicited, rather than added as an external module.","upvotes":3,"discussionId":"6a05da47b1a8cbabc9f09564","ai_summary":"INTRA demonstrates that attention-based models can perform retrieval directly from internal representations, unifying retrieval and generation while improving evidence recall and answer quality.","ai_keywords":["retrieval-augmented generation","attention-based encoder-decoder","decoder attention queries","pre-encoded evidence chunks","intrinsic retrieval","retriever-generator mismatch","evidence recall","end-to-end answer quality"],"organization":{"_id":"60262b67268c201cdc8b7d43","name":"nvidia","fullname":"NVIDIA","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65df9200dc3292a8983e5017/Vs5FPVCH-VZBipV3qKTuy.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"634dbe99418913d58464e3e7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/634dbe99418913d58464e3e7/VvQh0e_r1axTSCcznI1l2.png","isPro":true,"fullname":"Dan Lougen","user":"DJLougen","type":"user"},{"_id":"64b2f97434a92b848c7e941e","avatarUrl":"/avatars/c699c50f3b43cd1641469521127753bb.svg","isPro":false,"fullname":"Nagori","user":"MohammedNaeem","type":"user"},{"_id":"661e8e57ebe3616a1b084101","avatarUrl":"/avatars/b72ed568a97b147b54339a5c26185f71.svg","isPro":false,"fullname":"Travis King","user":"travisking","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"60262b67268c201cdc8b7d43","name":"nvidia","fullname":"NVIDIA","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65df9200dc3292a8983e5017/Vs5FPVCH-VZBipV3qKTuy.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.05806.md"}">
Retrieval from Within: An Intrinsic Capability of Attention-Based Models
Abstract
INTRA demonstrates that attention-based models can perform retrieval directly from internal representations, unifying retrieval and generation while improving evidence recall and answer quality.
AI-generated summary
Retrieval-augmented generation (RAG) typically treats retrieval and generation as separate systems. We ask whether an attention-based encoder-decoder can instead retrieve directly from its own internal representations. We introduce INTRA (INTrinsic Retrieval via Attention), a framework where decoder attention queries score pre-encoded evidence chunks that are then directly reused as context for generation. By construction, INTRA unifies retrieval and generation, eliminating the retriever-generator mismatch typical of RAG pipelines. This design also amortizes context encoding by reusing precomputed encoder states across queries. On question-answering benchmarks, INTRA outperforms strong engineered retrieval pipelines on both evidence recall and end-to-end answer quality. Our results demonstrate that attention-based models already possess a retrieval mechanism that can be elicited, rather than added as an external module.
Community
Retrieval-augmented generation (RAG) typically treats retrieval and generation as separate systems. We ask whether an attention-based encoder-decoder can instead retrieve directly from its own internal representations. We introduce INTRA (INTrinsic Retrieval via Attention), a framework where decoder attention queries score pre-encoded evidence chunks that are then directly reused as context for generation. By construction, INTRA unifies retrieval and generation, eliminating the retriever-generator mismatch typical of RAG pipelines. This design also amortizes context encoding by reusing precomputed encoder states across queries. On question-answering benchmarks, INTRA outperforms strong engineered retrieval pipelines on both evidence recall and end-to-end answer quality. Our results demonstrate that attention-based models already possess a retrieval mechanism that can be elicited, rather than added as an external module.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.05806 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.05806 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.05806 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.