Hugging Face Daily Papers · May 27, 2026 · 5 min read

Can LLMs Introspect? A Reality Check

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Can large language models detect and report their own internal states? A number of studies have argued that the answer to this question is yes. We argue, based on lessons from human metacognition research, that this conclusion may be premature: to be convinced of this conclusion we need to distinguish genuine introspection from pattern matching based on surface-level cues. Furthermore, we argue that behavioral evidence alone is inherently insufficient to establish strong introspective claims.\n","updatedAt":"2026-05-27T16:33:46.381Z","author":{"_id":"635686ec5aeb69011c7d1abd","avatarUrl":"/avatars/c59034ad2c9c2daf4b4a8d3c56449f5e.svg","fullname":"Shauli Ravfogel","name":"ravfogs","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9529954791069031},"editors":["ravfogs"],"editorAvatarUrls":["/avatars/c59034ad2c9c2daf4b4a8d3c56449f5e.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.26242","authors":[{"_id":"6a171c1bda9422d403a42218","name":"Shashwat Singh","hidden":false},{"_id":"6a171c1bda9422d403a42219","name":"Tal Linzen","hidden":false},{"_id":"6a171c1bda9422d403a4221a","name":"Shauli Ravfogel","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/635686ec5aeb69011c7d1abd/2Rd3UxN7_mPINHtX1PffB.png"],"publishedAt":"2026-05-25T00:00:00.000Z","submittedOnDailyAt":"2026-05-27T00:00:00.000Z","title":"Can LLMs Introspect? A Reality Check","submittedOnDailyBy":{"_id":"635686ec5aeb69011c7d1abd","avatarUrl":"/avatars/c59034ad2c9c2daf4b4a8d3c56449f5e.svg","isPro":false,"fullname":"Shauli Ravfogel","user":"ravfogs","type":"user","name":"ravfogs"},"summary":"Can large language models detect and report their own internal states? A number of studies have argued that the answer to this question is yes. We argue, based on lessons from human metacognition research, that this conclusion may be premature: to be convinced of this conclusion we need to distinguish genuine introspection from pattern matching based on surface-level cues. Furthermore, we argue that behavioral evidence alone is inherently insufficient to establish strong introspective claims.\n We re-examine two recently introduced evaluation paradigms in light of this consideration. In the first paradigm, models are expected to detect whether their internal states have been tampered with. We find that models cannot reliably distinguish such interventions on their internal states from manipulations of the input, suggesting that their success in the original studies reflects their ability to detect anomalies more generally, as opposed to interventions on their internal states in particular. In the second paradigm we examine, models are tasked with predicting labels derived from their own hidden states. Here, we find that classifiers that only have access to the input achieve equivalent performance to the model's own in-context predictions, indicating that the original results do not conclusively demonstrate that the model has privileged access to its internal representations. We further introduce a relabeled control setting, where models cannot rely on the semantics of the task to solve it, and instead must rely on the internal representation; models perform closer to chance on this better-controlled version of the task. Taken together, these results indicate that current evidence is insufficient to establish that LLMs display metacognitive monitoring.","upvotes":1,"discussionId":"6a171c1bda9422d403a4221b","ai_summary":"Large language models may not genuinely detect their internal states, as their apparent introspective abilities could reflect surface-level pattern matching rather than true metacognitive monitoring.","ai_keywords":["large language models","internal states","metacognitive monitoring","introspection","pattern matching","hidden states","in-context predictions","behavioral evidence"],"organization":{"_id":"5ee76f04464d0272c8b2455e","name":"nyu-mll","fullname":"NYU Machine Learning for Language","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1592226989768-5ee771bd464d0272c8b2455f.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"69bce2411525bcef2828372f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/xypIlys1QJjis3sEP63de.png","isPro":false,"fullname":"Морозов Екатерина","user":"aidenramirez79g","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"5ee76f04464d0272c8b2455e","name":"nyu-mll","fullname":"NYU Machine Learning for Language","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1592226989768-5ee771bd464d0272c8b2455f.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.26242.md"}">

Papers

arxiv:2605.26242

Can LLMs Introspect? A Reality Check

Published on May 25

· Submitted by

Shauli Ravfogel on May 27

NYU Machine Learning for Language

Upvote

Authors:

Abstract

Large language models may not genuinely detect their internal states, as their apparent introspective abilities could reflect surface-level pattern matching rather than true metacognitive monitoring.

AI-generated summary

Can large language models detect and report their own internal states? A number of studies have argued that the answer to this question is yes. We argue, based on lessons from human metacognition research, that this conclusion may be premature: to be convinced of this conclusion we need to distinguish genuine introspection from pattern matching based on surface-level cues. Furthermore, we argue that behavioral evidence alone is inherently insufficient to establish strong introspective claims. We re-examine two recently introduced evaluation paradigms in light of this consideration. In the first paradigm, models are expected to detect whether their internal states have been tampered with. We find that models cannot reliably distinguish such interventions on their internal states from manipulations of the input, suggesting that their success in the original studies reflects their ability to detect anomalies more generally, as opposed to interventions on their internal states in particular. In the second paradigm we examine, models are tasked with predicting labels derived from their own hidden states. Here, we find that classifiers that only have access to the input achieve equivalent performance to the model's own in-context predictions, indicating that the original results do not conclusively demonstrate that the model has privileged access to its internal representations. We further introduce a relabeled control setting, where models cannot rely on the semantics of the task to solve it, and instead must rely on the internal representation; models perform closer to chance on this better-controlled version of the task. Taken together, these results indicate that current evidence is insufficient to establish that LLMs display metacognitive monitoring.

View arXiv page View PDF Add to collection

Community

ravfogs

Paper submitter about 8 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.26242

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.26242 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.26242 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.26242 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Can LLMs Introspect? A Reality Check

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers