Hugging Face Daily Papers · · 4 min read

Geometric Factual Recall in Transformers

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

We show that transformers memorize facts geometrically: subject embeddings encode superpositions of their attributes, and the MLP acts as a generic relation-conditioned selector.</p>\n<p><a href=\"https://cdn-uploads.huggingface.co/production/uploads/635686ec5aeb69011c7d1abd/4ZfSxsJfyQ9FlvKUUz46f.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/635686ec5aeb69011c7d1abd/4ZfSxsJfyQ9FlvKUUz46f.png\" alt=\"image\"></a></p>\n","updatedAt":"2026-05-13T03:35:19.135Z","author":{"_id":"635686ec5aeb69011c7d1abd","avatarUrl":"/avatars/c59034ad2c9c2daf4b4a8d3c56449f5e.svg","fullname":"Shauli Ravfogel","name":"ravfogs","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.744587779045105},"editors":["ravfogs"],"editorAvatarUrls":["/avatars/c59034ad2c9c2daf4b4a8d3c56449f5e.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.12426","authors":[{"_id":"6a03f0f886b054ce2fa40e99","name":"Shauli Ravfogel","hidden":false},{"_id":"6a03f0f886b054ce2fa40e9a","name":"Gilad Yehudai","hidden":false},{"_id":"6a03f0f886b054ce2fa40e9b","name":"Joan Bruna","hidden":false},{"_id":"6a03f0f886b054ce2fa40e9c","name":"Alberto Bietti","hidden":false}],"publishedAt":"2026-05-12T00:00:00.000Z","submittedOnDailyAt":"2026-05-13T00:00:00.000Z","title":"Geometric Factual Recall in Transformers","submittedOnDailyBy":{"_id":"635686ec5aeb69011c7d1abd","avatarUrl":"/avatars/c59034ad2c9c2daf4b4a8d3c56449f5e.svg","isPro":false,"fullname":"Shauli Ravfogel","user":"ravfogs","type":"user","name":"ravfogs"},"summary":"How do transformer language models memorize factual associations? A common view casts internal weight matrices as associative memories over pairs of embeddings, requiring parameter counts that scale linearly with the number of facts. We develop a theoretical and empirical account of an alternative, geometric form of memorization in which learned embeddings encode relational structure directly, and the MLP plays a qualitatively different role. In a controlled setting where a single-layer transformer must memorize random bijections from subjects to a shared attribute set, we prove that a logarithmic embedding dimension suffices: subject embeddings encode linear superpositions of their associated attribute vectors, and a small MLP acts as a relation-conditioned selector that extracts the relevant attribute via ReLU gating, and not as an associative key-value mapping. We extend these results to the multi-hop setting -- chains of relational queries such as ``Who is the mother of the wife of x?'' -- providing constructions with and without chain-of-thought that exhibit a provable capacity-depth tradeoff, complemented by a matching information-theoretic lower bound. Empirically, gradient descent discovers solutions with precisely the predicted structure. Once trained, the MLP transfers zero-shot to entirely new bijections when subject embeddings are appropriately re-initialized, revealing that it has learned a generic selection mechanism rather than memorized any particular set of facts.","upvotes":0,"discussionId":"6a03f0f986b054ce2fa40e9d","ai_summary":"Transformer language models use geometric memorization where embeddings encode linear superpositions of attributes and MLPs act as relation-conditioned selectors rather than associative key-value mappings.","ai_keywords":["transformer language models","internal weight matrices","associative memories","embedding dimension","linear superpositions","MLP","ReLU gating","relational queries","chain-of-thought","information-theoretic lower bound","gradient descent","zero-shot transfer"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.12426.md"}">
Papers
arxiv:2605.12426

Geometric Factual Recall in Transformers

Published on May 12
· Submitted by
Shauli Ravfogel
on May 13
Authors:
,
,
,

Abstract

Transformer language models use geometric memorization where embeddings encode linear superpositions of attributes and MLPs act as relation-conditioned selectors rather than associative key-value mappings.

AI-generated summary

How do transformer language models memorize factual associations? A common view casts internal weight matrices as associative memories over pairs of embeddings, requiring parameter counts that scale linearly with the number of facts. We develop a theoretical and empirical account of an alternative, geometric form of memorization in which learned embeddings encode relational structure directly, and the MLP plays a qualitatively different role. In a controlled setting where a single-layer transformer must memorize random bijections from subjects to a shared attribute set, we prove that a logarithmic embedding dimension suffices: subject embeddings encode linear superpositions of their associated attribute vectors, and a small MLP acts as a relation-conditioned selector that extracts the relevant attribute via ReLU gating, and not as an associative key-value mapping. We extend these results to the multi-hop setting -- chains of relational queries such as ``Who is the mother of the wife of x?'' -- providing constructions with and without chain-of-thought that exhibit a provable capacity-depth tradeoff, complemented by a matching information-theoretic lower bound. Empirically, gradient descent discovers solutions with precisely the predicted structure. Once trained, the MLP transfers zero-shot to entirely new bijections when subject embeddings are appropriately re-initialized, revealing that it has learned a generic selection mechanism rather than memorized any particular set of facts.

Community

Paper submitter about 17 hours ago

We show that transformers memorize facts geometrically: subject embeddings encode superpositions of their attributes, and the MLP acts as a generic relation-conditioned selector.

image

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.12426
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.12426 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.12426 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.12426 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers