Hugging Face Daily Papers · · 7 min read

On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Do LLM annotators actually follow the definitions we give them?</p>\n<p>Our paper studies <strong>how model-internalized priors shape LLM annotation behavior</strong>. Across 9 models and 5 toxicity-related datasets, we find that performance is better explained by definition alignment than by text memorization. We introduce <strong>Definition-Specific Familiarity (DSF), a lightweight diagnostic for measuring whether a model’s internal concept matches the task definition</strong>.</p>\n<p>We also find strong “decision stickiness”: most zero-shot errors persist even after aligned definitions and few-shot examples, and high-confidence errors are especially hard to correct. Models can also confidently follow misaligned definitions, making confidence an unreliable indicator of whether the intended labeling standard is being applied.</p>\n<p>The takeaway: LLM annotation pipelines should not assume that prompt definitions fully control model behavior. Definition design and definition alignment need to be measured explicitly.</p>\n<p><a href=\"https://cdn-uploads.huggingface.co/production/uploads/62d4cfcd4b0933c48f45ae33/j_939uMlAqq4BsTx4HSgd.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/62d4cfcd4b0933c48f45ae33/j_939uMlAqq4BsTx4HSgd.png\" alt=\"rq_overviews_llm_adap\"></a></p>\n","updatedAt":"2026-06-12T21:18:30.406Z","author":{"_id":"62d4cfcd4b0933c48f45ae33","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62d4cfcd4b0933c48f45ae33/tCiLvQMywA0UKUkDl5Srl.png","fullname":"Rafal Kocielnik","name":"RKocielnik","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8453903794288635},"editors":["RKocielnik"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/62d4cfcd4b0933c48f45ae33/tCiLvQMywA0UKUkDl5Srl.png"],"reactions":[],"isReport":false}},{"id":"6a2cb71119d4300f61957593","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":364,"isUserFollowing":false},"createdAt":"2026-06-13T01:49:05.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Attribute-Based Diagnosis of LLM Alignment with Hate Speech Annotations](https://huggingface.co/papers/2605.27025) (2026)\n* [From Self to Other: Evaluating Demographic Perspective-Taking in LLM Hate Speech Annotation](https://huggingface.co/papers/2606.06266) (2026)\n* [Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification](https://huggingface.co/papers/2604.17112) (2026)\n* [The Silent Vote: Improving Zero-Shot LLM Reliability by Aggregating Semantic Neighborhoods](https://huggingface.co/papers/2605.09739) (2026)\n* [Soft-Prompt Tuning for Fair and Efficient LLM Benchmark Evaluation](https://huggingface.co/papers/2606.12117) (2026)\n* [Long Live Fine-Tuning: Task-Specific Transformers Outperform Zero-Shot LLMs for Misinformation Response Classification on Reddit](https://huggingface.co/papers/2606.04274) (2026)\n* [The Curse of Helpfulness: Inverse Scaling Law in Robustness to Distractor Instructions via DistractionIF](https://huggingface.co/papers/2605.29491) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2605.27025\">Attribute-Based Diagnosis of LLM Alignment with Hate Speech Annotations</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.06266\">From Self to Other: Evaluating Demographic Perspective-Taking in LLM Hate Speech Annotation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.17112\">Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.09739\">The Silent Vote: Improving Zero-Shot LLM Reliability by Aggregating Semantic Neighborhoods</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.12117\">Soft-Prompt Tuning for Fair and Efficient LLM Benchmark Evaluation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.04274\">Long Live Fine-Tuning: Task-Specific Transformers Outperform Zero-Shot LLMs for Misinformation Response Classification on Reddit</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.29491\">The Curse of Helpfulness: Inverse Scaling Law in Robustness to Distractor Instructions via DistractionIF</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{&quot;user&quot;:&quot;librarian-bot&quot;}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/librarian-bot\">@<span class=\"underline\">librarian-bot</span></a></span> </span></span> recommend</code></p>\n","updatedAt":"2026-06-13T01:49:05.883Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":364,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7517797350883484},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.00467","authors":[{"_id":"6a2c75a3a0d4daae4285f065","name":"Etienne Casanova","hidden":false},{"_id":"6a2c75a3a0d4daae4285f066","name":"Rafal Kocielnik","hidden":false},{"_id":"6a2c75a3a0d4daae4285f067","name":"R. Michael Alvarez","hidden":false}],"publishedAt":"2026-05-30T00:00:00.000Z","submittedOnDailyAt":"2026-06-12T00:00:00.000Z","title":"On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance","submittedOnDailyBy":{"_id":"62d4cfcd4b0933c48f45ae33","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62d4cfcd4b0933c48f45ae33/tCiLvQMywA0UKUkDl5Srl.png","isPro":true,"fullname":"Rafal Kocielnik","user":"RKocielnik","type":"user","name":"RKocielnik"},"summary":"Large Language Models (LLMs) are increasingly used for zero-shot annotation and LLM-as-a-judge tasks, yet their reliability hinges on how model-internalized priors interact with user-provided instructions. We investigate three dimensions of this interaction: (1) how an LLM's familiarity with data and task definitions affects performance, (2) the extent to which additional information in prompts can correct zero-shot errors (\"decision stickiness\"), and (3) model susceptibility to misaligned task definitions. Through experiments on toxicity detection across diverse datasets (spanning social media, gaming, news, and forums) using both dense and mixture-of-experts models, we find that nearly two-thirds of zero-shot errors are resistant to correction, with an overall rescue rate (fraction of initial errors corrected by prompting) of only 34.8%. High-confidence errors prove especially resistant to correction. When given misaligned definitions, LLMs follow them while maintaining confidence levels unchanged from the aligned condition. Crucially, we introduce Definition-Specific Familiarity (DSF), which measures alignment between a model's internal concept and the task definition. After controlling for dataset-level confounds, DSF shows a positive association with model performance (partial r = +0.41), while three distinct memorization metrics (ROUGE-L, BERTScore, and embedding cosine similarity) all fail to show a positive association. These findings show the limitations of prompt-based correction in annotation tasks, highlighting the importance of definition alignment over text-level memorization.","upvotes":0,"discussionId":"6a2c75a4a0d4daae4285f068","ai_summary":"Large language models exhibit limited ability to correct zero-shot errors through prompting, with model performance more strongly linked to definition-specific familiarity than text-level memorization metrics.","ai_keywords":["zero-shot annotation","LLM-as-a-judge","model-internalized priors","decision stickiness","toxicity detection","dense models","mixture-of-experts models","rescue rate","high-confidence errors","misaligned task definitions","Definition-Specific Familiarity","ROUGE-L","BERTScore","embedding cosine similarity"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"68489d5fef240a91f79ba016","name":"caltech","fullname":"California institute of technology","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68489cc6e0f0802846902f01/RnxRgxa3Oc-31kS597SMI.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"organization":{"_id":"68489d5fef240a91f79ba016","name":"caltech","fullname":"California institute of technology","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68489cc6e0f0802846902f01/RnxRgxa3Oc-31kS597SMI.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.00467.md","query":{}}">
Papers
arxiv:2606.00467

On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance

Published on May 30
· Submitted by
Rafal Kocielnik
on Jun 12
Authors:
,
,

Abstract

Large language models exhibit limited ability to correct zero-shot errors through prompting, with model performance more strongly linked to definition-specific familiarity than text-level memorization metrics.

Large Language Models (LLMs) are increasingly used for zero-shot annotation and LLM-as-a-judge tasks, yet their reliability hinges on how model-internalized priors interact with user-provided instructions. We investigate three dimensions of this interaction: (1) how an LLM's familiarity with data and task definitions affects performance, (2) the extent to which additional information in prompts can correct zero-shot errors ("decision stickiness"), and (3) model susceptibility to misaligned task definitions. Through experiments on toxicity detection across diverse datasets (spanning social media, gaming, news, and forums) using both dense and mixture-of-experts models, we find that nearly two-thirds of zero-shot errors are resistant to correction, with an overall rescue rate (fraction of initial errors corrected by prompting) of only 34.8%. High-confidence errors prove especially resistant to correction. When given misaligned definitions, LLMs follow them while maintaining confidence levels unchanged from the aligned condition. Crucially, we introduce Definition-Specific Familiarity (DSF), which measures alignment between a model's internal concept and the task definition. After controlling for dataset-level confounds, DSF shows a positive association with model performance (partial r = +0.41), while three distinct memorization metrics (ROUGE-L, BERTScore, and embedding cosine similarity) all fail to show a positive association. These findings show the limitations of prompt-based correction in annotation tasks, highlighting the importance of definition alignment over text-level memorization.

Community

Paper submitter about 5 hours ago

Do LLM annotators actually follow the definitions we give them?

Our paper studies how model-internalized priors shape LLM annotation behavior. Across 9 models and 5 toxicity-related datasets, we find that performance is better explained by definition alignment than by text memorization. We introduce Definition-Specific Familiarity (DSF), a lightweight diagnostic for measuring whether a model’s internal concept matches the task definition.

We also find strong “decision stickiness”: most zero-shot errors persist even after aligned definitions and few-shot examples, and high-confidence errors are especially hard to correct. Models can also confidently follow misaligned definitions, making confidence an unreliable indicator of whether the intended labeling standard is being applied.

The takeaway: LLM annotation pipelines should not assume that prompt definitions fully control model behavior. Definition design and definition alignment need to be measured explicitly.

rq_overviews_llm_adap

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.00467
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.00467 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.00467 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.00467 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers