SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations</p>\n","updatedAt":"2026-06-08T07:25:24.571Z","author":{"_id":"65642d7401de72cb63165d22","avatarUrl":"/avatars/1f4417c4ac5e781ce73eae1060e3f7f2.svg","fullname":"ytaewon","name":"hamzzi","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8153515458106995},"editors":["hamzzi"],"editorAvatarUrls":["/avatars/1f4417c4ac5e781ce73eae1060e3f7f2.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.05563","authors":[{"_id":"6a266de3e4c258a0294921e5","name":"Taewon Yun","hidden":false},{"_id":"6a266de3e4c258a0294921e6","name":"Hyeonseong Park","hidden":false},{"_id":"6a266de3e4c258a0294921e7","name":"Jeonghwan Choi","hidden":false},{"_id":"6a266de3e4c258a0294921e8","name":"Hayoon Park","hidden":false},{"_id":"6a266de3e4c258a0294921e9","name":"Yeeun Choi","hidden":false},{"_id":"6a266de3e4c258a0294921ea","name":"Hwanjun Song","hidden":false}],"publishedAt":"2026-06-04T00:00:00.000Z","submittedOnDailyAt":"2026-06-08T00:00:00.000Z","title":"SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations","submittedOnDailyBy":{"_id":"65642d7401de72cb63165d22","avatarUrl":"/avatars/1f4417c4ac5e781ce73eae1060e3f7f2.svg","isPro":false,"fullname":"ytaewon","user":"hamzzi","type":"user","name":"hamzzi"},"summary":"Evaluating LLM mediators remains challenging, as mediation unfolds as a real-time trajectory shaped by disputants' shifting emotions, intentions, and context. Existing testbeds rely on a few expert-authored domains, vary mainly strategic posture, and score every turn against every topic, introducing off-topic noise. We introduce SoCRATES, a benchmark for evaluating proactive LLM mediators in realistic, multi-domain testbeds. It constructs scenarios from real conflicts through an agentic pipeline across eight domains, probes five socio-cognitive adaptation axes (strategic posture, party composition, history length, emotional reactivity, and cultural identity), and scores each topic only on the turns that advance it via a topic-localized evaluator. The evaluator reaches 0.82 alignment with human experts, more than doubling a per-turn baseline. Benchmarking eight frontier LLMs, we find that even the strongest mediator closes only about a third of the unmediated consensus gap under diverse and realistic testbeds, with performance varying sharply by socio-cognitive axis, highlighting that progress lies in social adaptation to diverse conditions.","upvotes":29,"discussionId":"6a266de3e4c258a0294921eb","projectPage":"https://disl-lab.github.io/SoCRATES/","ai_summary":"SoCRATES presents a realistic multi-domain benchmark for evaluating proactive LLM mediators across various socio-cognitive adaptation axes, demonstrating that even top-performing models only resolve about one-third of the consensus gap in conflict resolution.","ai_keywords":["LLM mediators","real-time trajectory","socio-cognitive adaptation","topic-localized evaluator","agentic pipeline","multi-domain testbeds","consensus gap"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"6708fb8eb992dee2c3ffbaae","name":"DISLab","fullname":"Data Intelligence System Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/63c9da8d5fdc575773c84816/YxqnL3XD4yK_dqZY3zlmr.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65642d7401de72cb63165d22","avatarUrl":"/avatars/1f4417c4ac5e781ce73eae1060e3f7f2.svg","isPro":false,"fullname":"ytaewon","user":"hamzzi","type":"user"},{"_id":"67eb73fa9ae31639c1b01e5e","avatarUrl":"/avatars/139a8d679e114b0b7bbe5ee553a7e68e.svg","isPro":false,"fullname":"BAE JONG SUNG","user":"BAEJONGSUNG","type":"user"},{"_id":"670e33701062db514bd5d872","avatarUrl":"/avatars/52d363c2baa76f575465d815c3f227bf.svg","isPro":false,"fullname":"tese","user":"test182617181","type":"user"},{"_id":"67ea868cf5906275490cdb28","avatarUrl":"/avatars/028c65488e149e256ec3775efb7da661.svg","isPro":false,"fullname":"jessi","user":"jesssssi","type":"user"},{"_id":"6a0b34bae492f66e757d02dd","avatarUrl":"/avatars/ed43214cbeaaa0b9dc100fdf5afe0dbb.svg","isPro":false,"fullname":"nothing","user":"nothingisbest","type":"user"},{"_id":"6a0b330229c6a75e98f39ca6","avatarUrl":"/avatars/75745839bac9ee404913a958df1a4043.svg","isPro":false,"fullname":"Brian","user":"powerpower312","type":"user"},{"_id":"6a0b02eecf25b0d9c3406229","avatarUrl":"/avatars/548128f1a888bebb1e8d97ce5ad25f37.svg","isPro":false,"fullname":"Lee","user":"John12315","type":"user"},{"_id":"67ea85823ace6eb4673cea17","avatarUrl":"/avatars/fa6a32561c12e60926194f6701e6da26.svg","isPro":false,"fullname":"ruso","user":"ruso4321","type":"user"},{"_id":"62eff1a871164d46818b59b4","avatarUrl":"/avatars/41103b3c0fb10568be8245dfa73545aa.svg","isPro":false,"fullname":"Park Sunhong","user":"chestnut1717","type":"user"},{"_id":"67ea84f33ace6eb4673cbe50","avatarUrl":"/avatars/d9adbe39cf0ca0c9e6d0d45fac9bc464.svg","isPro":false,"fullname":"booo","user":"boooo123","type":"user"},{"_id":"6481b04b70ac5e1968a82059","avatarUrl":"/avatars/5606f8a6e760e8536e39b381b6d3ddd1.svg","isPro":false,"fullname":"song","user":"song04121","type":"user"},{"_id":"6a0ad0a1eb380b7bbe423a7f","avatarUrl":"/avatars/34fcf0cfa76c6ca6acc089a952e1bfcb.svg","isPro":false,"fullname":"shinhyunwook","user":"franwook","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":3,"organization":{"_id":"6708fb8eb992dee2c3ffbaae","name":"DISLab","fullname":"Data Intelligence System Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/63c9da8d5fdc575773c84816/YxqnL3XD4yK_dqZY3zlmr.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.05563.md"}">
SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations
Abstract
SoCRATES presents a realistic multi-domain benchmark for evaluating proactive LLM mediators across various socio-cognitive adaptation axes, demonstrating that even top-performing models only resolve about one-third of the consensus gap in conflict resolution.
Evaluating LLM mediators remains challenging, as mediation unfolds as a real-time trajectory shaped by disputants' shifting emotions, intentions, and context. Existing testbeds rely on a few expert-authored domains, vary mainly strategic posture, and score every turn against every topic, introducing off-topic noise. We introduce SoCRATES, a benchmark for evaluating proactive LLM mediators in realistic, multi-domain testbeds. It constructs scenarios from real conflicts through an agentic pipeline across eight domains, probes five socio-cognitive adaptation axes (strategic posture, party composition, history length, emotional reactivity, and cultural identity), and scores each topic only on the turns that advance it via a topic-localized evaluator. The evaluator reaches 0.82 alignment with human experts, more than doubling a per-turn baseline. Benchmarking eight frontier LLMs, we find that even the strongest mediator closes only about a third of the unmediated consensus gap under diverse and realistic testbeds, with performance varying sharply by socio-cognitive axis, highlighting that progress lies in social adaptation to diverse conditions.
Community
SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.05563 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.05563 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.05563 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.