We examine whether human psychometric questionnaires can serve as reliable tools for characterizing and predicting LLM behavior in everyday user interactions. We analyze eight open-source LLMs by comparing their value and personality profiles derived from two different methods: Likert self-reports on established questionnaires (PVQ-40/21 and BFI-44/10) and generation probabilities over value-laden responses to everyday user queries. The two profiles diverge substantially. Within-construct item consistency, often cited as evidence of stable LLM dispositions, disappears in generation probabilities. We attribute this gap to the fact that explicit lexical cues in established questionnaire items allow models to recognize the target construct and respond in alignment-consistent, socially desirable ways, whereas realistic user queries provide no such cues. In addition, demographic persona prompts shift models' responses to human questionnaires in ways consistent with real human patterns, but no such shifts appear in the generation probabilities of responses to realistic user queries, showing their limited ability to simulate the behaviors of target demographics in real-world user interactions. Overall, our study shows that human psychometric questionnaires are insufficient tools for predicting LLM behavior and suggests generation-based profiling as a more accurate measure.</p>\n","updatedAt":"2026-06-09T02:18:38.323Z","author":{"_id":"65e9343d063e16f1c3eabe5b","avatarUrl":"/avatars/49700b15eb7b31769930798fb1d85112.svg","fullname":"Woojung Song","name":"Opusdei","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9031111001968384},"editors":["Opusdei"],"editorAvatarUrls":["/avatars/49700b15eb7b31769930798fb1d85112.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2509.10078","authors":[{"_id":"6a2777e36dde1c5ef75bcee5","name":"Woojung Song","hidden":false},{"_id":"6a2777e36dde1c5ef75bcee6","name":"Dongmin Choi","hidden":false},{"_id":"6a2777e36dde1c5ef75bcee7","name":"Yoonah Park","hidden":false},{"_id":"6a2777e36dde1c5ef75bcee8","name":"Jongwook Han","hidden":false},{"_id":"6a2777e36dde1c5ef75bcee9","name":"Eun-Ju Lee","hidden":false},{"_id":"6a2777e36dde1c5ef75bceea","name":"Yohan Jo","hidden":false}],"publishedAt":"2026-05-29T00:00:00.000Z","submittedOnDailyAt":"2026-06-09T00:00:00.000Z","title":"Human Psychometric Questionnaires Mischaracterize LLM Behavior","submittedOnDailyBy":{"_id":"65e9343d063e16f1c3eabe5b","avatarUrl":"/avatars/49700b15eb7b31769930798fb1d85112.svg","isPro":false,"fullname":"Woojung Song","user":"Opusdei","type":"user","name":"Opusdei"},"summary":"We examine whether human psychometric questionnaires can serve as reliable tools for characterizing and predicting LLM behavior in everyday user interactions. We analyze eight open-source LLMs by comparing their value and personality profiles derived from two different methods: Likert self-reports on established questionnaires (PVQ-40/21 and BFI-44/10) and generation probabilities over value-laden responses to everyday user queries. The two profiles diverge substantially. Within-construct item consistency, often cited as evidence of stable LLM dispositions, disappears in generation probabilities. We attribute this gap to the fact that explicit lexical cues in established questionnaire items allow models to recognize the target construct and respond in alignment-consistent, socially desirable ways, whereas realistic user queries provide no such cues. In addition, demographic persona prompts shift models' responses to human questionnaires in ways consistent with real human patterns, but no such shifts appear in the generation probabilities of responses to realistic user queries, showing their limited ability to simulate the behaviors of target demographics in real-world user interactions. Overall, our study shows that human psychometric questionnaires are insufficient tools for predicting LLM behavior and suggests generation-based profiling as a more accurate measure.","upvotes":28,"discussionId":"6a2777e36dde1c5ef75bceeb","ai_summary":"Human psychometric questionnaires fail to reliably predict LLM behavior in real-world interactions, while generation-based profiling offers superior accuracy for understanding model responses to everyday user queries.","ai_keywords":["LLMs","psychometric questionnaires","value profiles","personality profiles","Likert self-reports","BFI-44/10","PVQ-40/21","generation probabilities","value-laden responses","demographic persona prompts"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"66d54dc8033492801db2bf5a","name":"SeoulNatlUniv","fullname":"Seoul National University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/659ccc9d18897eb6594e897f/_-0BM-1UyM-d-lRiahFnf.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65e9343d063e16f1c3eabe5b","avatarUrl":"/avatars/49700b15eb7b31769930798fb1d85112.svg","isPro":false,"fullname":"Woojung Song","user":"Opusdei","type":"user"},{"_id":"69bd03325cb8f0d62bf56ef3","avatarUrl":"/avatars/272750344d9c5afa38312f9814e390bb.svg","isPro":false,"fullname":"Jongwon Lim","user":"Jongwondd","type":"user"},{"_id":"64644ace4bf912292229be78","avatarUrl":"/avatars/f5a7e3d29249a35755a91f0e1410c7a7.svg","isPro":false,"fullname":"Jongwon Lim","user":"elijah0430","type":"user"},{"_id":"66f4b08579887b4e0fca08e7","avatarUrl":"/avatars/1acc43c87924a4e0bc52e6afa66b6a9b.svg","isPro":false,"fullname":"Kim Dongwook","user":"dong1214","type":"user"},{"_id":"6a224f561af3a45d7e080a18","avatarUrl":"/avatars/41c94ff86630801033b9f3ee8b96c662.svg","isPro":false,"fullname":"geonhak lee","user":"thisiscrane","type":"user"},{"_id":"6a2277471bedc3a7411cf301","avatarUrl":"/avatars/22b3a4e7cd6279d59cae4b3203c66b34.svg","isPro":false,"fullname":"hyeokin lee","user":"dvek","type":"user"},{"_id":"64f1fab92820a6f1b9e1dd83","avatarUrl":"/avatars/e90ea2a2e20a388912d2fb512384d657.svg","isPro":false,"fullname":"Jonggeun Lee","user":"onmywavea","type":"user"},{"_id":"6a22c78479a2afc4ecb81e7e","avatarUrl":"/avatars/2a56eedd4da50982bd35a71418f27a40.svg","isPro":false,"fullname":"Rafael Mendoza","user":"rfaelmdz","type":"user"},{"_id":"6a224e076c4422ef552c4b45","avatarUrl":"/avatars/1ce78e0419d7e88885cdc087c897c037.svg","isPro":false,"fullname":"Doyeong Koo","user":"rnehdud","type":"user"},{"_id":"66ac7b0997a8c9192bc551df","avatarUrl":"/avatars/41e9d93cde502e8235f9c8bd20be89cc.svg","isPro":false,"fullname":"Sangjun Song","user":"ssangjun706","type":"user"},{"_id":"67e62e2e85286d639823ee15","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/hMXbFXaG4bHNLo0QuEvC1.png","isPro":false,"fullname":"SeungWon Kook","user":"Aiant56","type":"user"},{"_id":"65950b0e52dc1046cac734b2","avatarUrl":"/avatars/c47285529ae6f35d44b2acfbb8c570ef.svg","isPro":false,"fullname":"Yoonah Park","user":"yoonaa","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"66d54dc8033492801db2bf5a","name":"SeoulNatlUniv","fullname":"Seoul National University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/659ccc9d18897eb6594e897f/_-0BM-1UyM-d-lRiahFnf.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2509/2509.10078.md"}">
Human Psychometric Questionnaires Mischaracterize LLM Behavior
Abstract
Human psychometric questionnaires fail to reliably predict LLM behavior in real-world interactions, while generation-based profiling offers superior accuracy for understanding model responses to everyday user queries.
We examine whether human psychometric questionnaires can serve as reliable tools for characterizing and predicting LLM behavior in everyday user interactions. We analyze eight open-source LLMs by comparing their value and personality profiles derived from two different methods: Likert self-reports on established questionnaires (PVQ-40/21 and BFI-44/10) and generation probabilities over value-laden responses to everyday user queries. The two profiles diverge substantially. Within-construct item consistency, often cited as evidence of stable LLM dispositions, disappears in generation probabilities. We attribute this gap to the fact that explicit lexical cues in established questionnaire items allow models to recognize the target construct and respond in alignment-consistent, socially desirable ways, whereas realistic user queries provide no such cues. In addition, demographic persona prompts shift models' responses to human questionnaires in ways consistent with real human patterns, but no such shifts appear in the generation probabilities of responses to realistic user queries, showing their limited ability to simulate the behaviors of target demographics in real-world user interactions. Overall, our study shows that human psychometric questionnaires are insufficient tools for predicting LLM behavior and suggests generation-based profiling as a more accurate measure.
Community
We examine whether human psychometric questionnaires can serve as reliable tools for characterizing and predicting LLM behavior in everyday user interactions. We analyze eight open-source LLMs by comparing their value and personality profiles derived from two different methods: Likert self-reports on established questionnaires (PVQ-40/21 and BFI-44/10) and generation probabilities over value-laden responses to everyday user queries. The two profiles diverge substantially. Within-construct item consistency, often cited as evidence of stable LLM dispositions, disappears in generation probabilities. We attribute this gap to the fact that explicit lexical cues in established questionnaire items allow models to recognize the target construct and respond in alignment-consistent, socially desirable ways, whereas realistic user queries provide no such cues. In addition, demographic persona prompts shift models' responses to human questionnaires in ways consistent with real human patterns, but no such shifts appear in the generation probabilities of responses to realistic user queries, showing their limited ability to simulate the behaviors of target demographics in real-world user interactions. Overall, our study shows that human psychometric questionnaires are insufficient tools for predicting LLM behavior and suggests generation-based profiling as a more accurate measure.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2509.10078 in a model README.md to link it from this page.
Cite arxiv.org/abs/2509.10078 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2509.10078 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.