Hugging Face Daily Papers · May 20, 2026 · 8 min read

RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

LLMs often fail on inputs well within their advertised context lengths. We show that these failures are not merely engineering issues, but from intrinsic limitations of RoPE in long contexts.\nMain finding: In long contexts, RoPE-based attention frequently assigns the same attention weight to a token even when it is moved to different positions. Similarly, it can assign the same attention weight to different tokens at the same position.\nIn this sense, RoPE attention fails to distinguish both where a token appears and what token appears there — hence the title.\nWe prove these results theoretically and verify them empirically. While the theoretical analysis focuses on a single attention head, we complement it with experiments on real multi-layer, multi-head LLMs. The experiments confirm failures predicted by our theory: LLMs optimized for needle-in-a-haystack-style retrieval will inevitably struggle on a very simple task that asks for the k-th item in a list.\nMy personal takeaway: advertised context lengths should be interpreted with care. Future long-context LMs may require rethinking how position and token order are represented. With current architectures, agentic frameworks that break long contexts into shorter ones may be a more effective way to work around the intrinsic limitations of RoPE.\n","updatedAt":"2026-05-20T17:29:49.891Z","author":{"_id":"660ec5a2509153ca49775a7c","avatarUrl":"/avatars/97570fc245cc8ec7628da9c13bd35b71.svg","fullname":"Hao Peng","name":"haopeng01","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8996908664703369},"editors":["haopeng01"],"editorAvatarUrls":["/avatars/97570fc245cc8ec7628da9c13bd35b71.svg"],"reactions":[],"isReport":false}},{"id":"6a0e64533560b86a8325db57","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":358,"isUserFollowing":false},"createdAt":"2026-05-21T01:48:03.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Remember to Forget: Gated Adaptive Positional Encoding](https://huggingface.co/papers/2605.10414) (2026)\n* [Short Data, Long Context: Distilling Positional Knowledge in Transformers](https://huggingface.co/papers/2604.06070) (2026)\n* [TIDE: Every Layer Knows the Token Beneath the Context](https://huggingface.co/papers/2605.06216) (2026)\n* [Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings](https://huggingface.co/papers/2604.18603) (2026)\n* [Shuffle the Context: RoPE-Perturbed Self-Distillation for Long-Context Adaptation](https://huggingface.co/papers/2604.14339) (2026)\n* [EndPrompt: Efficient Long-Context Extension via Terminal Anchoring](https://huggingface.co/papers/2605.14589) (2026)\n* [Screening Is Enough](https://huggingface.co/papers/2604.01178) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. \nThe following papers were recommended by the Semantic Scholar API \n<ul>\n<li><a href=\"https://huggingface.co/papers/2605.10414\">Remember to Forget: Gated Adaptive Positional Encoding</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.06070\">Short Data, Long Context: Distilling Positional Knowledge in Transformers</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.06216\">TIDE: Every Layer Knows the Token Beneath the Context</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.18603\">Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.14339\">Shuffle the Context: RoPE-Perturbed Self-Distillation for Long-Context Adaptation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.14589\">EndPrompt: Efficient Long-Context Extension via Terminal Anchoring</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.01178\">Screening Is Enough</a> (2026)</li>\n</ul>\n Please give a thumbs up to this comment if you found it helpful!\n If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><a href=\"/librarian-bot\">@librarian-bot</a> recommend</code>\n","updatedAt":"2026-05-21T01:48:03.518Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":358,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6970409750938416},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.15514","authors":[{"_id":"6a0def79d1ef9ecdf71c0e63","name":"Yufeng Du","hidden":false},{"_id":"6a0def79d1ef9ecdf71c0e64","name":"Phillip Harris","hidden":false},{"_id":"6a0def79d1ef9ecdf71c0e65","name":"Minyang Tian","hidden":false},{"_id":"6a0def79d1ef9ecdf71c0e66","name":"Eliu A Huerta","hidden":false},{"_id":"6a0def79d1ef9ecdf71c0e67","name":"Srikanth Ronanki","hidden":false},{"_id":"6a0def79d1ef9ecdf71c0e68","name":"Subendhu Rongali","hidden":false},{"_id":"6a0def79d1ef9ecdf71c0e69","name":"Aram Galstyan","hidden":false},{"_id":"6a0def79d1ef9ecdf71c0e6a","name":"Hao Peng","hidden":false}],"publishedAt":"2026-05-15T00:00:00.000Z","submittedOnDailyAt":"2026-05-20T00:00:00.000Z","title":"RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably","submittedOnDailyBy":{"_id":"660ec5a2509153ca49775a7c","avatarUrl":"/avatars/97570fc245cc8ec7628da9c13bd35b71.svg","isPro":false,"fullname":"Hao Peng","user":"haopeng01","type":"user","name":"haopeng01"},"summary":"We identify intrinsic limitations of Rotary Positional Embeddings (RoPE) in Transformer-based long-context language models. Our theoretical analysis abstracts away from the specific content of the context and depends only on its length. We prove that as context length increases, RoPE-based attention becomes unpredictable and loses two properties that are central to its effectiveness. First, it loses its locality bias: RoPE is no more likely to favor nearer positions than substantially farther ones. Second, it loses consistency in token relevance: a key vector that receives a higher attention score than an alternative at one position may receive a lower score at another. In both cases, the probability of failure approaches 0.5, no better than random guessing. We further prove that the attention score can remain unchanged when a key token is moved to a different position, or even replaced by a different token, indicating a failure to distinguish positions or tokens. Adjusting the RoPE base trades off distinguishing positions against distinguishing tokens but cannot preserve both at the same time. Increasing the RoPE base hyperparameter, a common practice in today's long-context models, helps distinguish different tokens, but inevitably sacrifices the ability to distinguish positions. Our empirical analysis shows that multi-head, multi-layer architectures are insufficient to overcome these limitations. Our findings suggest that fundamentally new mechanisms for encoding position and token order may be needed in future Transformer long-context language models.","upvotes":0,"discussionId":"6a0def79d1ef9ecdf71c0e6b","ai_summary":"Rotary Positional Embeddings in Transformer models lose locality bias and token relevance consistency as context length increases, leading to unpredictable attention patterns that cannot be mitigated by multi-head, multi-layer architectures.","ai_keywords":["Rotary Positional Embeddings","Transformer-based long-context language models","attention mechanism","locality bias","token relevance","multi-head architecture","multi-layer architecture"],"organization":{"_id":"65448bef5b5d9185ba3202b9","name":"UIUC-CS","fullname":"University of Illinois at Urbana-Champaign","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65448b21fcb96b8b48733729/ycqcXFayMTTD_KpE37067.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"organization":{"_id":"65448bef5b5d9185ba3202b9","name":"UIUC-CS","fullname":"University of Illinois at Urbana-Champaign","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65448b21fcb96b8b48733729/ycqcXFayMTTD_KpE37067.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.15514.md"}">

Papers

arxiv:2605.15514

RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably

Published on May 15

· Submitted by

Hao Peng on May 20

University of Illinois at Urbana-Champaign

Upvote

Authors:

Abstract

Rotary Positional Embeddings in Transformer models lose locality bias and token relevance consistency as context length increases, leading to unpredictable attention patterns that cannot be mitigated by multi-head, multi-layer architectures.

AI-generated summary

We identify intrinsic limitations of Rotary Positional Embeddings (RoPE) in Transformer-based long-context language models. Our theoretical analysis abstracts away from the specific content of the context and depends only on its length. We prove that as context length increases, RoPE-based attention becomes unpredictable and loses two properties that are central to its effectiveness. First, it loses its locality bias: RoPE is no more likely to favor nearer positions than substantially farther ones. Second, it loses consistency in token relevance: a key vector that receives a higher attention score than an alternative at one position may receive a lower score at another. In both cases, the probability of failure approaches 0.5, no better than random guessing. We further prove that the attention score can remain unchanged when a key token is moved to a different position, or even replaced by a different token, indicating a failure to distinguish positions or tokens. Adjusting the RoPE base trades off distinguishing positions against distinguishing tokens but cannot preserve both at the same time. Increasing the RoPE base hyperparameter, a common practice in today's long-context models, helps distinguish different tokens, but inevitably sacrifices the ability to distinguish positions. Our empirical analysis shows that multi-head, multi-layer architectures are insufficient to overcome these limitations. Our findings suggest that fundamentally new mechanisms for encoding position and token order may be needed in future Transformer long-context language models.

View arXiv page View PDF Add to collection

Community

haopeng01

Paper submitter about 9 hours ago

LLMs often fail on inputs well within their advertised context lengths. We show that these failures are not merely engineering issues, but from intrinsic limitations of RoPE in long contexts.

Main finding: In long contexts, RoPE-based attention frequently assigns the same attention weight to a token even when it is moved to different positions. Similarly, it can assign the same attention weight to different tokens at the same position.

In this sense, RoPE attention fails to distinguish both where a token appears and what token appears there — hence the title.

We prove these results theoretically and verify them empirically. While the theoretical analysis focuses on a single attention head, we complement it with experiments on real multi-layer, multi-head LLMs.
The experiments confirm failures predicted by our theory: LLMs optimized for needle-in-a-haystack-style retrieval will inevitably struggle on a very simple task that asks for the k-th item in a list.

My personal takeaway: advertised context lengths should be interpreted with care. Future long-context LMs may require rethinking how position and token order are represented. With current architectures, agentic frameworks that break long contexts into shorter ones may be a more effective way to work around the intrinsic limitations of RoPE.

librarian-bot

13 minutes ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.15514

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.15514 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.15514 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.15514 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers