Hugging Face Daily Papers · June 10, 2026 · 4 min read

PsychoSafe: Eliciting Psychologically-Informed Refusals in Large Language Models

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

PsychoSafe is a psychologically informed framework for making LLM refusals more supportive in high-risk situations, such as crises, coercion, or escalating harmful intent. It improved refusal quality substantially over a generic baseline, especially in resource referrals and psychological grounding. In-context learning worked better than fine-tuning, where sometimes responses were less relevant and did not generalize as well outside its training domains.<br><a href=\"https://cdn-uploads.huggingface.co/production/uploads/6652354cb88e4539b2189cd7/2hmzusXuG4ct7EAKqvbkl.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/6652354cb88e4539b2189cd7/2hmzusXuG4ct7EAKqvbkl.png\" alt=\"psychosafe_fig1\"></a></p>\n","updatedAt":"2026-06-10T08:11:57.756Z","author":{"_id":"6652354cb88e4539b2189cd7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6652354cb88e4539b2189cd7/kZ7Mi6Yz7zbOSLqgFW5jt.jpeg","fullname":"Gianluca Barmina","name":"giannor","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.9576320648193359},"editors":["giannor"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6652354cb88e4539b2189cd7/kZ7Mi6Yz7zbOSLqgFW5jt.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.09697","authors":[{"_id":"6a2912f4e7d78ea7587e5674","name":"Gianluca Barmina","hidden":false},{"_id":"6a2912f4e7d78ea7587e5675","name":"Federico Torrielli","hidden":false},{"_id":"6a2912f4e7d78ea7587e5676","name":"Sven Harms","hidden":false},{"_id":"6a2912f4e7d78ea7587e5677","name":"Jacob Nielsen","hidden":false},{"_id":"6a2912f4e7d78ea7587e5678","name":"Felix Mächtle","hidden":false},{"_id":"6a2912f4e7d78ea7587e5679","name":"Stine Lyngsø Beltoft","hidden":false},{"_id":"6a2912f4e7d78ea7587e567a","name":"Peter Schneider-Kamp","hidden":false},{"_id":"6a2912f4e7d78ea7587e567b","name":"Thomas Eisenbarth","hidden":false},{"_id":"6a2912f4e7d78ea7587e567c","name":"Lukas Galke Poech","hidden":false},{"_id":"6a2912f4e7d78ea7587e567d","name":"Anne Lauscher","hidden":false}],"publishedAt":"2026-06-08T16:19:18.000Z","submittedOnDailyAt":"2026-06-10T00:00:00.000Z","title":"PsychoSafe: Eliciting Psychologically-Informed Refusals in Large Language Models","submittedOnDailyBy":{"_id":"6652354cb88e4539b2189cd7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6652354cb88e4539b2189cd7/kZ7Mi6Yz7zbOSLqgFW5jt.jpeg","isPro":false,"fullname":"Gianluca Barmina","user":"giannor","type":"user","name":"giannor"},"summary":"Large language models (LLMs) routinely face requests that should be refused, creating a trade-off between helpfulness and harm prevention. However, refusals themselves can be helpful. In high-risk interactions involving crisis, coercion, or escalating intent, blunt non-compliance may prevent direct harm while still failing to support the needs of the person behind the request. We present PsychoSafe, a psychologically-informed refusal framework that reframes refusal as structured supportive communication grounded in evidence-based intervention strategies. To develop PsychoSafe, we construct a corpus of 8019 prompt-response pairs spanning five psychologically salient risk domains and apply prompting and parameter-efficient fine-tuning to Qwen 3.5 27B. On a balanced validation set of 500 prompts, evaluated with an LLM judge and validated through human ratings, PsychoSafe prompting improves overall refusal quality by 28.1% over a generic baseline, with particularly strong gains in external resource referral (+46.8%) and psychological grounding (+34.8%), while preserving downstream performance on non-refusal tasks. Fine-tuning achieves near-perfect refusal and resource-referral rates but reduces response relevance. Additional evaluations on SORRY-Bench and XSTest show strong in-domain robustness but limited out-of-domain generalization, suggesting that future work should diversify fine-tuning data to help models apply interventions selectively rather than schematically.","upvotes":5,"discussionId":"6a2912f5e7d78ea7587e567e","githubRepo":"https://github.com/aisilab/psychological-safety","githubRepoAddedBy":"user","ai_summary":"A psychologically-informed refusal framework called PsychoSafe is developed for large language models to improve harmful request handling through structured supportive communication, showing enhanced refusal quality and resource referral while maintaining performance on non-refusal tasks.","ai_keywords":["large language models","refusal framework","psychological grounding","prompting","parameter-efficient fine-tuning","Qwen 3.5 27B","LLM judge","SORRY-Bench","XSTest"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":2,"organization":{"_id":"68adbfb1dd070a92488069b1","name":"SDU-Denmark","fullname":"University of Southern Denmark (SDU)","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68adbdb50fdaa186aa43d1ce/f1kfMH47RIckIAOWEi1mv.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6652354cb88e4539b2189cd7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6652354cb88e4539b2189cd7/kZ7Mi6Yz7zbOSLqgFW5jt.jpeg","isPro":false,"fullname":"Gianluca Barmina","user":"giannor","type":"user"},{"_id":"624d671d953e603497e0eb28","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/624d671d953e603497e0eb28/8-xsTsJAV0xBfQgqLwIC0.png","isPro":false,"fullname":"Federico Torrielli","user":"EvilScript","type":"user"},{"_id":"65dee4eb2df2dd7ceecb5850","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65dee4eb2df2dd7ceecb5850/WZCx-1X-7944O-BX7h29L.jpeg","isPro":false,"fullname":"Jacob Nielsen","user":"JacobBITLABS","type":"user"},{"_id":"68b031d6aa3a9d6ef8ff91ca","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/-uFUU2OfVN02ttCtgIVOw.png","isPro":false,"fullname":"Annemette Brok Pirchert","user":"popunicorn","type":"user"},{"_id":"69e73ebbf119e40cb8e83cf4","avatarUrl":"/avatars/7e22f0ac3f4b1e85e90fbdc8a688470a.svg","isPro":false,"fullname":"Filippo Tonini","user":"filo362","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"68adbfb1dd070a92488069b1","name":"SDU-Denmark","fullname":"University of Southern Denmark (SDU)","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68adbdb50fdaa186aa43d1ce/f1kfMH47RIckIAOWEi1mv.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.09697.md"}">

Papers

arxiv:2606.09697

PsychoSafe: Eliciting Psychologically-Informed Refusals in Large Language Models

Published on Jun 8

· Submitted by

Gianluca Barmina on Jun 10

University of Southern Denmark (SDU)

Upvote

Authors:

Abstract

A psychologically-informed refusal framework called PsychoSafe is developed for large language models to improve harmful request handling through structured supportive communication, showing enhanced refusal quality and resource referral while maintaining performance on non-refusal tasks.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Large language models (LLMs) routinely face requests that should be refused, creating a trade-off between helpfulness and harm prevention. However, refusals themselves can be helpful. In high-risk interactions involving crisis, coercion, or escalating intent, blunt non-compliance may prevent direct harm while still failing to support the needs of the person behind the request. We present PsychoSafe, a psychologically-informed refusal framework that reframes refusal as structured supportive communication grounded in evidence-based intervention strategies. To develop PsychoSafe, we construct a corpus of 8019 prompt-response pairs spanning five psychologically salient risk domains and apply prompting and parameter-efficient fine-tuning to Qwen 3.5 27B. On a balanced validation set of 500 prompts, evaluated with an LLM judge and validated through human ratings, PsychoSafe prompting improves overall refusal quality by 28.1% over a generic baseline, with particularly strong gains in external resource referral (+46.8%) and psychological grounding (+34.8%), while preserving downstream performance on non-refusal tasks. Fine-tuning achieves near-perfect refusal and resource-referral rates but reduces response relevance. Additional evaluations on SORRY-Bench and XSTest show strong in-domain robustness but limited out-of-domain generalization, suggesting that future work should diversify fine-tuning data to help models apply interventions selectively rather than schematically.

View arXiv page View PDF GitHub 2 Add to collection

Community

giannor

Paper submitter about 9 hours ago

•

edited about 9 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.09697

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.09697 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.09697 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

PsychoSafe: Eliciting Psychologically-Informed Refusals in Large Language Models

Abstract

Community

Models citing this paper 1

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers