We show that transformers memorize facts geometrically: subject embeddings encode superpositions of their attributes, and the MLP acts as a generic relation-conditioned selector.</p>\n<p><a href=\"https://cdn-uploads.huggingface.co/production/uploads/635686ec5aeb69011c7d1abd/4ZfSxsJfyQ9FlvKUUz46f.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/635686ec5aeb69011c7d1abd/4ZfSxsJfyQ9FlvKUUz46f.png\" alt=\"image\"></a></p>\n","updatedAt":"2026-05-13T03:35:19.135Z","author":{"_id":"635686ec5aeb69011c7d1abd","avatarUrl":"/avatars/c59034ad2c9c2daf4b4a8d3c56449f5e.svg","fullname":"Shauli Ravfogel","name":"ravfogs","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.744587779045105},"editors":["ravfogs"],"editorAvatarUrls":["/avatars/c59034ad2c9c2daf4b4a8d3c56449f5e.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.12426","authors":[{"_id":"6a03f0f886b054ce2fa40e99","name":"Shauli Ravfogel","hidden":false},{"_id":"6a03f0f886b054ce2fa40e9a","name":"Gilad Yehudai","hidden":false},{"_id":"6a03f0f886b054ce2fa40e9b","name":"Joan Bruna","hidden":false},{"_id":"6a03f0f886b054ce2fa40e9c","name":"Alberto Bietti","hidden":false}],"publishedAt":"2026-05-12T00:00:00.000Z","submittedOnDailyAt":"2026-05-13T00:00:00.000Z","title":"Geometric Factual Recall in Transformers","submittedOnDailyBy":{"_id":"635686ec5aeb69011c7d1abd","avatarUrl":"/avatars/c59034ad2c9c2daf4b4a8d3c56449f5e.svg","isPro":false,"fullname":"Shauli Ravfogel","user":"ravfogs","type":"user","name":"ravfogs"},"summary":"How do transformer language models memorize factual associations? A common view casts internal weight matrices as associative memories over pairs of embeddings, requiring parameter counts that scale linearly with the number of facts. We develop a theoretical and empirical account of an alternative, geometric form of memorization in which learned embeddings encode relational structure directly, and the MLP plays a qualitatively different role. In a controlled setting where a single-layer transformer must memorize random bijections from subjects to a shared attribute set, we prove that a logarithmic embedding dimension suffices: subject embeddings encode linear superpositions of their associated attribute vectors, and a small MLP acts as a relation-conditioned selector that extracts the relevant attribute via ReLU gating, and not as an associative key-value mapping. We extend these results to the multi-hop setting -- chains of relational queries such as ``Who is the mother of the wife of x?'' -- providing constructions with and without chain-of-thought that exhibit a provable capacity-depth tradeoff, complemented by a matching information-theoretic lower bound. Empirically, gradient descent discovers solutions with precisely the predicted structure. Once trained, the MLP transfers zero-shot to entirely new bijections when subject embeddings are appropriately re-initialized, revealing that it has learned a generic selection mechanism rather than memorized any particular set of facts.","upvotes":0,"discussionId":"6a03f0f986b054ce2fa40e9d","ai_summary":"Transformer language models use geometric memorization where embeddings encode linear superpositions of attributes and MLPs act as relation-conditioned selectors rather than associative key-value mappings.","ai_keywords":["transformer language models","internal weight matrices","associative memories","embedding dimension","linear superpositions","MLP","ReLU gating","relational queries","chain-of-thought","information-theoretic lower bound","gradient descent","zero-shot transfer"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.12426.md"}">
Geometric Factual Recall in Transformers
Abstract
Transformer language models use geometric memorization where embeddings encode linear superpositions of attributes and MLPs act as relation-conditioned selectors rather than associative key-value mappings.
AI-generated summary
How do transformer language models memorize factual associations? A common view casts internal weight matrices as associative memories over pairs of embeddings, requiring parameter counts that scale linearly with the number of facts. We develop a theoretical and empirical account of an alternative, geometric form of memorization in which learned embeddings encode relational structure directly, and the MLP plays a qualitatively different role. In a controlled setting where a single-layer transformer must memorize random bijections from subjects to a shared attribute set, we prove that a logarithmic embedding dimension suffices: subject embeddings encode linear superpositions of their associated attribute vectors, and a small MLP acts as a relation-conditioned selector that extracts the relevant attribute via ReLU gating, and not as an associative key-value mapping. We extend these results to the multi-hop setting -- chains of relational queries such as ``Who is the mother of the wife of x?'' -- providing constructions with and without chain-of-thought that exhibit a provable capacity-depth tradeoff, complemented by a matching information-theoretic lower bound. Empirically, gradient descent discovers solutions with precisely the predicted structure. Once trained, the MLP transfers zero-shot to entirely new bijections when subject embeddings are appropriately re-initialized, revealing that it has learned a generic selection mechanism rather than memorized any particular set of facts.
Community
We show that transformers memorize facts geometrically: subject embeddings encode superpositions of their attributes, and the MLP acts as a generic relation-conditioned selector.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.12426 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.12426 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.12426 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.