Do LLM annotators actually follow the definitions we give them?</p>\n<p>Our paper studies <strong>how model-internalized priors shape LLM annotation behavior</strong>. Across 9 models and 5 toxicity-related datasets, we find that performance is better explained by definition alignment than by text memorization. We introduce <strong>Definition-Specific Familiarity (DSF), a lightweight diagnostic for measuring whether a model’s internal concept matches the task definition</strong>.</p>\n<p>We also find strong “decision stickiness”: most zero-shot errors persist even after aligned definitions and few-shot examples, and high-confidence errors are especially hard to correct. Models can also confidently follow misaligned definitions, making confidence an unreliable indicator of whether the intended labeling standard is being applied.</p>\n<p>The takeaway: LLM annotation pipelines should not assume that prompt definitions fully control model behavior. Definition design and definition alignment need to be measured explicitly.</p>\n<p><a href=\"https://cdn-uploads.huggingface.co/production/uploads/62d4cfcd4b0933c48f45ae33/j_939uMlAqq4BsTx4HSgd.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/62d4cfcd4b0933c48f45ae33/j_939uMlAqq4BsTx4HSgd.png\" alt=\"rq_overviews_llm_adap\"></a></p>\n","updatedAt":"2026-06-12T21:18:30.406Z","author":{"_id":"62d4cfcd4b0933c48f45ae33","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62d4cfcd4b0933c48f45ae33/tCiLvQMywA0UKUkDl5Srl.png","fullname":"Rafal Kocielnik","name":"RKocielnik","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8453903794288635},"editors":["RKocielnik"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/62d4cfcd4b0933c48f45ae33/tCiLvQMywA0UKUkDl5Srl.png"],"reactions":[],"isReport":false}},{"id":"6a2cb71119d4300f61957593","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":364,"isUserFollowing":false},"createdAt":"2026-06-13T01:49:05.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Attribute-Based Diagnosis of LLM Alignment with Hate Speech Annotations](https://huggingface.co/papers/2605.27025) (2026)\n* [From Self to Other: Evaluating Demographic Perspective-Taking in LLM Hate Speech Annotation](https://huggingface.co/papers/2606.06266) (2026)\n* [Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification](https://huggingface.co/papers/2604.17112) (2026)\n* [The Silent Vote: Improving Zero-Shot LLM Reliability by Aggregating Semantic Neighborhoods](https://huggingface.co/papers/2605.09739) (2026)\n* [Soft-Prompt Tuning for Fair and Efficient LLM Benchmark Evaluation](https://huggingface.co/papers/2606.12117) (2026)\n* [Long Live Fine-Tuning: Task-Specific Transformers Outperform Zero-Shot LLMs for Misinformation Response Classification on Reddit](https://huggingface.co/papers/2606.04274) (2026)\n* [The Curse of Helpfulness: Inverse Scaling Law in Robustness to Distractor Instructions via DistractionIF](https://huggingface.co/papers/2605.29491) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2605.27025\">Attribute-Based Diagnosis of LLM Alignment with Hate Speech Annotations</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.06266\">From Self to Other: Evaluating Demographic Perspective-Taking in LLM Hate Speech Annotation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.17112\">Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.09739\">The Silent Vote: Improving Zero-Shot LLM Reliability by Aggregating Semantic Neighborhoods</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.12117\">Soft-Prompt Tuning for Fair and Efficient LLM Benchmark Evaluation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.04274\">Long Live Fine-Tuning: Task-Specific Transformers Outperform Zero-Shot LLMs for Misinformation Response Classification on Reddit</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.29491\">The Curse of Helpfulness: Inverse Scaling Law in Robustness to Distractor Instructions via DistractionIF</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{"user":"librarian-bot"}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/librarian-bot\">@<span class=\"underline\">librarian-bot</span></a></span> </span></span> recommend</code></p>\n","updatedAt":"2026-06-13T01:49:05.883Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":364,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7517797350883484},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.00467","authors":[{"_id":"6a2c75a3a0d4daae4285f065","name":"Etienne Casanova","hidden":false},{"_id":"6a2c75a3a0d4daae4285f066","name":"Rafal Kocielnik","hidden":false},{"_id":"6a2c75a3a0d4daae4285f067","name":"R. Michael Alvarez","hidden":false}],"publishedAt":"2026-05-30T00:00:00.000Z","submittedOnDailyAt":"2026-06-12T00:00:00.000Z","title":"On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance","submittedOnDailyBy":{"_id":"62d4cfcd4b0933c48f45ae33","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62d4cfcd4b0933c48f45ae33/tCiLvQMywA0UKUkDl5Srl.png","isPro":true,"fullname":"Rafal Kocielnik","user":"RKocielnik","type":"user","name":"RKocielnik"},"summary":"Large Language Models (LLMs) are increasingly used for zero-shot annotation and LLM-as-a-judge tasks, yet their reliability hinges on how model-internalized priors interact with user-provided instructions. We investigate three dimensions of this interaction: (1) how an LLM's familiarity with data and task definitions affects performance, (2) the extent to which additional information in prompts can correct zero-shot errors (\"decision stickiness\"), and (3) model susceptibility to misaligned task definitions. Through experiments on toxicity detection across diverse datasets (spanning social media, gaming, news, and forums) using both dense and mixture-of-experts models, we find that nearly two-thirds of zero-shot errors are resistant to correction, with an overall rescue rate (fraction of initial errors corrected by prompting) of only 34.8%. High-confidence errors prove especially resistant to correction. When given misaligned definitions, LLMs follow them while maintaining confidence levels unchanged from the aligned condition. Crucially, we introduce Definition-Specific Familiarity (DSF), which measures alignment between a model's internal concept and the task definition. After controlling for dataset-level confounds, DSF shows a positive association with model performance (partial r = +0.41), while three distinct memorization metrics (ROUGE-L, BERTScore, and embedding cosine similarity) all fail to show a positive association. These findings show the limitations of prompt-based correction in annotation tasks, highlighting the importance of definition alignment over text-level memorization.","upvotes":0,"discussionId":"6a2c75a4a0d4daae4285f068","ai_summary":"Large language models exhibit limited ability to correct zero-shot errors through prompting, with model performance more strongly linked to definition-specific familiarity than text-level memorization metrics.","ai_keywords":["zero-shot annotation","LLM-as-a-judge","model-internalized priors","decision stickiness","toxicity detection","dense models","mixture-of-experts models","rescue rate","high-confidence errors","misaligned task definitions","Definition-Specific Familiarity","ROUGE-L","BERTScore","embedding cosine similarity"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"68489d5fef240a91f79ba016","name":"caltech","fullname":"California institute of technology","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68489cc6e0f0802846902f01/RnxRgxa3Oc-31kS597SMI.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"organization":{"_id":"68489d5fef240a91f79ba016","name":"caltech","fullname":"California institute of technology","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68489cc6e0f0802846902f01/RnxRgxa3Oc-31kS597SMI.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.00467.md","query":{}}">
On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance
Abstract
Large language models exhibit limited ability to correct zero-shot errors through prompting, with model performance more strongly linked to definition-specific familiarity than text-level memorization metrics.
Large Language Models (LLMs) are increasingly used for zero-shot annotation and LLM-as-a-judge tasks, yet their reliability hinges on how model-internalized priors interact with user-provided instructions. We investigate three dimensions of this interaction: (1) how an LLM's familiarity with data and task definitions affects performance, (2) the extent to which additional information in prompts can correct zero-shot errors ("decision stickiness"), and (3) model susceptibility to misaligned task definitions. Through experiments on toxicity detection across diverse datasets (spanning social media, gaming, news, and forums) using both dense and mixture-of-experts models, we find that nearly two-thirds of zero-shot errors are resistant to correction, with an overall rescue rate (fraction of initial errors corrected by prompting) of only 34.8%. High-confidence errors prove especially resistant to correction. When given misaligned definitions, LLMs follow them while maintaining confidence levels unchanged from the aligned condition. Crucially, we introduce Definition-Specific Familiarity (DSF), which measures alignment between a model's internal concept and the task definition. After controlling for dataset-level confounds, DSF shows a positive association with model performance (partial r = +0.41), while three distinct memorization metrics (ROUGE-L, BERTScore, and embedding cosine similarity) all fail to show a positive association. These findings show the limitations of prompt-based correction in annotation tasks, highlighting the importance of definition alignment over text-level memorization.
Community
Do LLM annotators actually follow the definitions we give them?
Our paper studies how model-internalized priors shape LLM annotation behavior. Across 9 models and 5 toxicity-related datasets, we find that performance is better explained by definition alignment than by text memorization. We introduce Definition-Specific Familiarity (DSF), a lightweight diagnostic for measuring whether a model’s internal concept matches the task definition.
We also find strong “decision stickiness”: most zero-shot errors persist even after aligned definitions and few-shot examples, and high-confidence errors are especially hard to correct. Models can also confidently follow misaligned definitions, making confidence an unreliable indicator of whether the intended labeling standard is being applied.
The takeaway: LLM annotation pipelines should not assume that prompt definitions fully control model behavior. Definition design and definition alignment need to be measured explicitly.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.00467 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.00467 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.00467 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.